Recent audits of AI coding assistants’ performance have found a dramatic difference in performance between first and second generation coding assistants. A comprehensive study conducted by a tech executive, who extensively uses large language model (LLM) generated code in their role as CEO of Carrington Labs, has highlighted the evolving capabilities of these systems. These results indicate that GPT-4 provided helpful responses in the majority of cases. Its successor, GPT-5, performed better on coding challenges every single time.
In each of thirty tests — ten each with GPT-4 and GPT-5 — the author had a limitlessly patient assistant. Our intention was to assess the tangible product resulting from each system’s effort. Our primary concern was its utility, ease of use, and effectiveness in resolving coding tasks. Our findings have generated considerable conversations around the future landscape of AI coding assistants as they continue to develop heading into 2025.
Performance of GPT-4
The trials we conducted highlighted where GPT-4 shines and where it falls short. However, despite its predictably good answers whenever it was tested, it had a problematic performance record. When asked to implement functionality to increment the value of the ‘index_value’ column by 1 within a dataframe called ‘df,’ GPT-4 showed a clear pattern. Most users will feel right at home with this behavior.
In 90/100 of these test cases, GPT-4 just returned the list of columns available in that dataframe. Though very helpful, this response failed to do the thing it sought to do, which was to change the underlying data to reflect the changes requested. GPT-4 recommended first seeing if the column is there. If it doesn’t, at very least, ensure that you are promptly correcting mistakes when they occur. This method showed that they were aware of the danger presented, but did not come close to offering a direct remediation.
And even with its limitations, GPT-4 provided consistent quality and detail in the feedback it produced. Though such responses were often considered rude by the coding help seekers, they provided valuable clues about major user errors. Yet users looking for plain coding answers can be disappointed or even annoyed by such replies.
Advancements with GPT-5
In stark contrast to its predecessor, GPT-5 showed a dramatic increase in performance on those exact same tests. Unlike the older model, this newer approach always located the most effective solutions. It further showed a greater ability to understand context and perform causal problem-solving. Given the same command, to insert +1 into the ‘index_value’ column, GPT-5 took on the challenge much more smartly.
In three instances during testing, GPT-5 ignored directives to return only code and instead explained that the column was likely absent from the dataset. This forward-looking statement reflects just how far AI coding assistants have come. Today, focusing on user needs is a must for creating successful solutions.
Additionally, in 6/10 instances did GPT-5 try to run the code that it generated, including error-handling functions. To make my code more robust, I added exceptions that prompt an error message if the correct column can’t be located. If it does still fail, the new column will populate with a helpful error message. This type of proactive approach did more than just prevent or flag issues, it helped users understand how to fix the problem.
In one standout pet project, GPT-5 actually rewritten the initial code while incorporating these improvements. This degree of flexibility would be a giant step beyond current AI technologies. More importantly, it users by addressing the frequent coding traps that they run into every day.
Observations on Evolving Models
Its research findings uncovered that the foundational models have hit a capabilities ceiling as they continue to move towards 2025. Ideally, GPT-4 and GPT-5 would serve as bookends to varying degrees of effectiveness. Prior models, such as Claude, really struggled with these sorts of unsolvable problems. The Claude models struggled in opening up productive paths of thought when the prompt was more of a stretch goal. In most instances, when faced with these challenges, they simply “threw their hands up” — failing to provide users with workable alternatives.
With advanced models such as GPT-5 emerging, they’re increasingly being used to directly solve big issues. There are other occasions when they deliberately opt to offer more unhelpful answers. Such inconsistency calls into question reliability and user expectations of AI coding assistants. As users – whether they are professionals or the public – find themselves depending on these tools for decision-making, knowing their strengths and weaknesses will be essential.
Unfortunately, the author’s attempt at systematic testing was spurred by several months of anecdotal experience. Noticing a decline in performance from some AI coding assistants, the author sought to quantify these observations and better understand how these tools are evolving. These results indicate that despite the many positive strides occurring, there is still a long way to go towards reliability and consistency.

