The Decline of AI Coding Assistants Raises Concerns

Concerns about the performance of AI coding assistants have surfaced in recent weeks, pointing to a disturbing trend. The author, now the CEO of Carrington Labs, is a heavy user of large language models (LLMs). Over the last few years, they have noticed a downturn in the performance of these tools. To get a systematic…

Tina Reynolds Avatar

By

The Decline of AI Coding Assistants Raises Concerns

Concerns about the performance of AI coding assistants have surfaced in recent weeks, pointing to a disturbing trend. The author, now the CEO of Carrington Labs, is a heavy user of large language models (LLMs). Over the last few years, they have noticed a downturn in the performance of these tools. To get a systematic sense of ChatGPT’s overall abilities, we tested both version 3.5 and 4.0 ChatGPT and even 5.0 ChatGPT. Our results revealed contradictions in their efficacy while producing coding solutions.

As AI itself continues to grow and develop, so too do the expectations on how it should perform. The author’s test results raise serious questions about the reliability of AI coding assistants. The critics raise alarms about the impacts of these tools on overall software development productivity.

Performance Comparison Between GPT-4 and GPT-5

In our most recent round of testing, GPT-4 paced itself to a sufficient performance, producing helpful output on every run. The model’s ability to code jump-started these types of conversations with the public by demonstrating its capabilities at answering questions.

On the flip side, we wanted GPT-5 to make that experience better. It achieved this through a brute force method by reading the index of each row and adding 1. This approach successfully produced effective solutions time after time, but it had its drawbacks. Improving on GPT-5 GPT-5 set a strong precedent with major improvements across the board. Yet debilitating syntax errors and inconsistent logic plagued its output, exposing a difficulty with basic coding concepts.

To our surprise, when asked to debug mistakes in code, the two models turned out to work in fundamentally different ways. GPT-4 did occasionally deviate from the instructions by adding comments along with the code. This at times surfaced problems such as empty columns. By comparison, GPT-5 provided an additional, simpler solution. The current system did have a notable exception. This would either raise an error or populate the new column with the corresponding error message when that specified column didn’t exist.

The Claude Models’ Mixed Performance

So it was just as surprising for many that the older Claude models were so noticeably behind the rapid progress made by the GPT variants. These legacy models failed on intractable issues. They were completely unhelpful, rolling their shoulders and dismissing the challenges, providing no valuable support. This peculiar behavior revealed serious gaps in their skills as programmers.

More recent Claude models tended to be more dynamic. In many cases, they succeeded at solving major environmental problems. They were prone to avoiding challenges, in favor of “sweeping them under the rug.” This inconsistency further complicates the landscape of AI coding assistants, as developers cannot rely on these tools to address issues comprehensively.

These differences in performance across these models raise questions about the triaging efficacy expected from their deployment in practice. Developers have been turning to AI in droves for code assistance. The unpredictable nature of these tools often works against productivity rather than improve it.

Impact on Development Timeframes

The ramifications of these performance failures are much greater than just some incorrect code. It’s not that the things we did in five hours with augmented AI—and ten hours without AI—are still taking seven or eight hours, sometimes more. This added time spent developing code is a major drain on project timelines and overall efficiency.

The author’s exhaustive, scientific experiment consisted of forwarding one error message to nine different versions of ChatGPT. Contextually, in doing so, they sought to understand if the issue of performance decline was a systemic problem across all models. The findings indicated a concerning trend: most core models seemed to have reached a quality plateau around 2025 and have recently shown signs of decline.

This decline portends some important questions for the future of AI coding assistants, and more broadly, their integration into the software development pipeline. As developers expect these tools to facilitate their work, ongoing issues with reliability may force a reevaluation of how they integrate AI into their processes.