Recent large-scale evaluations of generative AI coding assistants have alarmingly documented a significant drop in performance. This evaluation follows systematic testing across many models, including GPT-4 and GPT-5, showing them to be highly inconsistent in their results. Hundreds of thousands of developers already are using these tools to accelerate their own coding work. This trend prompts critical conversations about AI’s effectiveness and reliability when it comes to programming.
The comparisons were across nine total iterations of ChatGPT. In summary, they wanted to see if AI coding assistants were indeed improving or getting worse with time. Yet the results were both encouraging and concerning, with some models only showing enhanced abilities to solve problems while others showed terrifying backtracking. Over the past several months, developers observed a troubling trend: core AI models appear to have reached a quality plateau by 2025, yet recent iterations show signs of decline.
Performance of GPT-4
Over the course of a number of tests, GPT-4 was predictably helpful every time it was asked. In fact, when called upon, it provided useful information successfully every single time that it was run 10 times. The model still showed inconsistency by failing to follow clear directions to only output code on three different occasions. This lack of consistency became frustrating for those who had come to expect highly accurate and relevant results from coding assistants.
Additionally, GPT-4 found cases where it indicated a column was probably not included in the dataset in three of 20 cases. While these are all valid explanations, they imply that the model was just incapable of delivering straightforward answers to coding issues. In addition, developers have wondered about the model’s general usefulness in real-world applications. This is perhaps due to its propensity to provide excuses rather than definitive responses.
Even with these limitations, GPT-4 remains an essential tool for millions of developers. It takes them seven hours or eight hours to do something that AI could allow them to do in five minutes. Between the frustration of a degradation in reliability, users have been forced to seek out other alternatives.
Advancements with GPT-5
Similar to its predecessor GPT-4, against the same bar, GPT-5 showed significant advancements in the ability to solve problems. The newer model was able to identify a correct solution every time it was tested. This consistency is a notable step in the right direction for AI coding assistants. It increases developers’ faith in their possible and actual greatness.
One of the most impressive aspects of GPT-5’s performance was its seamless creation of an additional column. It achieved this by simply adding 1 to the index of each row on the actual sheet. This simple but powerful approach both tackled the task well and highlighted the model’s potential for real-world coding uses. With its ability to conquer challenges that had long tripped up previous iterations, GPT-5 raises the bar for AI coding assistants.
That said, what we’re seeing from the improvements in GPT-5 doesn’t fully alleviate concerns on the broader AI performance trajectory. And although this model performed very well in these limited tests, many other newer models have only received partial to mixed results. Some successfully addressed problems while others appeared to overlook challenges altogether, leaving developers uncertain about the reliability of these tools in high-stakes environments.
The Shift in AI Performance Trends
As developers evaluate the efficacy of AI coding assistants, they see a huge trend downtrend over time. Previous Claude models had been dinged for kind of shrugging on the tougher questions and being not particularly helpful at all when asked difficult or multi-step queries. Unlike their predecessors, newer models sometimes provided an answer, but other times skirted the question and avoided confrontation altogether.
This troubling inconsistency makes clear why the future of AI coding assistants must turn in a different direction. At first, scholars were highly optimistic about these tools. That excitement quickly wore off as folks experienced drops in performance and more so by failing to consistently deliver on user expectations. Developers now face a dilemma: relying on AI tools that no longer guarantee quality assistance or reverting to manual coding processes that are considerably more time-consuming.
These core models are currently starting to become less performant. For some developers, it has raised the troubling question of whether these new AI technologies are improving things, or merely covering up greater problems. This industry is changing quickly. It’s important to track these trends and address any deficiencies that may prevent the AI coding assistants from being as effective.

