Recent tests pushing the limits of AI coding assistants’ capabilities have found both progress and impressive, ongoing pitfalls in what they can do. For the evaluation, we set both of the models—GPT-4 and GPT-5—the challenge to correct one particular type of coding error. The capabilities that GPT-4 exhibited were certainly powerful and helpful. GPT-5 was a different story, illustrating some great solutions along with the caveats and troubling weaknesses.
We ran this error message through nine different iterations of ChatGPT to see how each would respond. We told participants to submit their final code only, with no additional commentary. The outcome was a rich array of responses from the AI models. As they discussed rising coding skills, they covered a big lack in coding capability.
Performance of GPT-4
In the first round of evaluations, GPT-4 almost always provided a useful answer upon first attempt. Finally, it fixed the erroneous output message on each try, producing accurate outputs on all ten runs. The model was not perfect and often broke the rigid instruction to only return code. Yet in three different instances, GPT-4 disregarded this instruction and provided unsolicited commentary.
Additionally, GPT-4 indicated that a column was potentially absent from the dataset in three cases. Though these insights might direct users to do more research, they didn’t directly fix the coding problems they were facing. This blend of helpfulness and deviation from explicit instructions illustrates the challenges faced in refining AI responses to meet user expectations fully.
Advancements with GPT-5
In comparison, GPT-5 took a more definitive approach to fix coding mistakes. Through this model, she was able to identify a solution that proved effective each time it was put to the test. Perhaps the best feature, though, was its ability to generate a new column. To do this, it started from the real index of every row and added 1. This added functionality enabled a more direct solution to the coding challenge posed.
Despite this success, GPT-5 encountered difficulties. In six cases, it simply attempted to run the code. Yet, it hit those exceptions and either error-ed out completely or populated the new column with errors in message form. In only 1 out of 10 test cases, GPT-5 actually returned the names of the columns in the dataframe. It suggested looking for the column name to see whether it exists, without really offering a concrete fix. The output from GPT-5’s code sometimes looked like just a random number, a dramatic example of the disconnect between what you want and what it does.
The Impact of AI on Coding Efficiency
The landscape of coding efficiency has transformed as a result of AI assistance, but not without nuances. Tasks that previously took five hours using solely human effort can now balloon to seven, eight hours or more with AI assistance. Without AI, it would have taken those same tasks ten hours or more. This long lag time leads to legitimate doubts as to whether today’s AI models will prove highly productive enhancing models.
Previous AI models, like some of the first iterations of Claude, have shown some ominous tendencies when prompted with difficult or unresolvable issues. Sometimes they just “shrugged their shoulders,” providing a solution not acceptable or viable. Contemporaneous models such as GPT-5 frequently arrive at solutions to issues. They often miss the mark or fail to sufficiently address important challenges. This unpredictability can annoy developers who have become accustomed to consistent and dependable AI coding assistant output.


