In recent months, the performance of AI coding assistants, specifically variations of the ChatGPT models like GPT-4 and GPT-5, has come under scrutiny. An analysis conducted by a technology executive revealed a troubling trend: while these models once provided reliable assistance, their effectiveness appears to have diminished. This article explores how well GPT-4 and GPT-5 actually perform. It runs a cohort through an open coding problem and screens based on how well they do at solving those problems.
The author, president and CEO of Carrington Labs, has deeply integrated the use of LLM-generated code in their coding workflows. This firsthand experience underscores the significance of reliable AI coding solutions in software development. The author, John F. D. Frumkin, ran a test of ten repetitions on each model. … and they judged the resulting outputs to be helpful, useless, or counterproductive. The results of this study call into question the validity and core value of these technologies which many practitioners now rely on.
Performance Comparison of AI Models
GPT-4 proved, soundly, to return helpful responses on all ten attempts. To each interaction we received a wealth of thought and insight that would help go towards solving coding problems. And GPT-5 underperformed in a surprising number of areas. It provided concrete answers to some challenges, but it fell flat on the flipside.
For example, as shown in GPT-5’s impressive response below, it easily created a new column in the dataset. It did this by incrementing the index of each row by one. Furthermore, the model had difficulty following through and running code in six of these cases. It resulted in it throwing errors all the time or populating the new column with error messages instead. Inconsistent outputs like these can be a pain point in terms of reliability, especially next to GPT-4’s more consistent outputs.
On one of every ten attempts, GPT-5 simply repeated the original code without providing any helpful improvement or solution. This unfortunate habit to take steps backward on past code without offering upgrades only highlights the need for continued iteration on AI coding partners. As the tasks coders tackle get more complicated, so does the expectation for these tools to provide real value.
Trends in AI Coding Assistance
In the author’s experience, this has been the trend with most modern AI coding assistants. Models prior to Claude, even Claude itself, had difficulty solving multi-step, complicated issues. By comparison, their successors like GPT-4 and GPT-5 occasionally provide answers, but they often miss the crucial questions or provide partial responses. This pattern indicates that despite improvements, progress on these items has not kept pace with user demand and expectations.
I’d guess core models such as GPT-4 would start hitting a quality wall by 2025. This indicates that their early strong performance may not have lasted. In recent analyses, we’ve seen a downward trend in effectiveness that no longer allows users to thieve at full productivity. In the past, those tasks would have been five hours worth of work, done with the help of AI. Now, they frequently extend to seven or eight hours—as much as twice as long—when better alternatives don’t exist.
As AI coding assistants are incorporated into more workflows, their reliability is key. The mixed performance in these challenges serves to highlight the dangers of adopting these technologies at face value without oversight.
Implications for Software Development
In short, the ramifications of these findings are large for software developers on the front lines as well as executives and policymakers. As more AI tools are used to automate portions of the coding process, it’s important to understand their limitations. The contrasting performances of GPT-4 and GPT-5 underscore the need for thorough evaluation before incorporating such models into development practices.
Just because AI can generate code doesn’t mean you should stop thinking critically or solving problems as a developer. As these tools progress, programmers need to be vigilant. They should be actively involved in iterating on output and quality control during the coding efforts.


