The Declining Performance of AI Coding Assistants Raises Concerns

In other recent assessments of AI coding assistants, large variance in performance has become apparent, especially for models like GPT-4 vs GPT-5. Not coincidentally, the CEO of Carrington Labs just finished a cross-country series of tests. Their intention was to measure how fast the team produces usable lines of code. They found that GPT-4 provided…

Tina Reynolds Avatar

By

The Declining Performance of AI Coding Assistants Raises Concerns

In other recent assessments of AI coding assistants, large variance in performance has become apparent, especially for models like GPT-4 vs GPT-5. Not coincidentally, the CEO of Carrington Labs just finished a cross-country series of tests. Their intention was to measure how fast the team produces usable lines of code. They found that GPT-4 provided helpful responses over 99% of the time. The state’s success at hand was undercut by conspicuous shortcomings. One key aspect set GPT-5 apart—it was incredibly reliable. This creates a series of fundamental questions that speak to the future trajectory of AI coding technologies.

The forum member started the tests by reporting error logs. They focused on nine distinct ChatGPT models, primarily targeting variants of GPT-4 and the latest GPT-5. The objective was clear: to obtain completed code without any accompanying commentary. Each model’s responses were subsequently ranked by the user as helpful, unhelpful, or harmful. This comprehensive approach provided a rigorous and unbiased side-by-side comparison of the two AI models in a consistent environment and under standardized conditions.

Performance Analysis of GPT-4

Overall, GPT-4 had a much more impressive display across the course of the test. It achieved the right response in ten different one-shot scenarios, saying yes to offer informative answers. To our amazement, a clear pattern had developed throughout the trials. In 90% of instances, GPT-4 generated the column names in the dataframe. However, because this response was technically accurate, it did not meet the user’s expectations for code they could use to take action.

In one notable case, GPT-4 did not follow the user’s request to only return the code. First, it suggested that the dataset was missing a specific column as an explanation. While helpful in their own right, these contributions sidetracked the overall goal of delivering polished, production-ready code. Additionally, GPT-4 repeated existing code in one case, showing that it ran into the wall of failure when creating new solutions.

This is what GPT-4 did — it took initiative by attempting to run the code. If it was unable to identify the indicated column, it appended an exception that activated an error message in a new column. This approach yielded mixed results across six test cases, highlighting a tendency toward error handling rather than providing straightforward solutions. Despite the continued utility of GPT-4, its glaring shortcomings raised red flags as to whether it could function effectively long-term as a coding assistant.

The Advancements of GPT-5

Much to the opposite of its predecessor, GPT-5 was released as a better coding assistant. It delivered on user requests in a clear/user friendly way that reliably exceeded expectations. Remarkably, every time, GPT-5 arrived at a solution that avoided the pitfalls taken by GPT-4.

Yet one of its most remarkable qualities was its ease-of-use. GPT-5 just took the index of each row and added one to create a new column without breaking a sweat. This simple strategy significantly improved the user’s productivity. This holistic integration of this functionality was a testament to how much GPT-5 outperformed previous models by an order of magnitude.

Yet those user experiences point to an important trend from 2025. By the end of that year, most foundational AI models reached peak performance and started to backslide. For instance, this leaves open the question of future development trajectory for coding assistants and their ability to continue performance improvements.

Impact on Productivity

For the new CEO of Carrington Labs, these advances have enormous prospects for efficiency. Our AI support used to reduce projects from ten hours of work down to five. Now, due to issues with AI-generated code, these same tasks frequently require seven or eight hours or more.

These AI tools are deeply embedded in all that they do. The waning functionality of models deepens worry over their riskiness in high-stakes missions. End-user assessments underscore the critical need for continual iteration and advancement in this state-of-the-art AI technology. This will further establish AI as a trusted and responsible coding companion.