Recent evaluations of AI coding assistants have shown alarming patterns. This has led many developers to question the efficacy of these tools. As the CEO of Carrington Labs, these days, the author uses a lot of LLM-generated code. They ran experiments on different versions of ChatGPT. The results made clear the alarming deterioration of performance—especially among more recent model years.
The author ran ten trials using different versions of ChatGPT, primarily variations of GPT-4 and GPT-5, to assess their problem-solving capabilities. During the lawsuits, GPT-4 regularly provided attorney-quality responses across every test case. GPT-5 reliably produced the right solutions every time it was tested. These other versions raised concerns over their reliability and utility in coding tasks.
Trials and Outcomes
In the testing case, the author submitted an error message to nine separate versions of ChatGPT. For each iteration, we instructed them to repair the given bug, returning just the corrected code with no explanation. 3rd and last place models provided the least impactful results. In nine out of ten test cases, ChatGPT just output the list of columns in the dataframe thing rather than cutting straight to fixing the error.
One comment echoed my initial advice that users should look for the column first. If that column is not there, then they need to make it right immediately. This approach seemed like a huge ask for developers who were just looking for quick wins. While GPT-4 showcased a consistent ability to provide helpful outputs, the newer GPT-5 distinguished itself by proposing a direct solution: it suggested taking the actual index of each row and adding one to create a new column, thus demonstrating its effectiveness in solving coding problems.
A Worrying Trend
These results point to a deeply disturbing tendency across AI coding assistants. As the author gamely noted about such abstentions, even these older models—such as Claude—tend to “shrug their shoulders” when faced with insurmountable challenges. By comparison, more recent models often deliver results that are only partial or even completely off-target. This significant inconsistency calls into question the reliability of AI as an effective resource in coding environments.
Before, it would be five hours with AI support equating to around 10 hours without it. Now, due to increasing inefficiencies, that same work frequently takes seven to eight hours or more. Developers today have to spend more time figuring out how to address problems that AI tools should be able to conveniently manage. This fundamental shift increases productivity, but it creates new issues. What is the aggregate benefit that these AI assistants actually provide to software development?
Quality Plateau and Decline
Analysis of these trials’ results revealed that the majority of core models hit a peak quality plateau around 2025. In the years since, there seems to have been a backslide in their performance prowess. Studies have shown that developers mostly use AI coding assistants to maximize their productivity. This latest downturn has many questioning whether they’ll still be useful during the next recession.
As AI technology continues to evolve, it is imperative for developers and companies to adapt their expectations and workflows accordingly. The author’s observations highlight a clear opportunity to further refine and build upon existing AI coding assistants. Without major improvements, these tools will be more a productivity trap than the productivity boon.

