The Decline of AI Coding Assistants: A Closer Look

Perversely, recent evaluations of artificial intelligence coding assistants like ChatGPT have found just the opposite. Although models like GPT-4 and GPT-5 were once promising in generating valuable code, their trustworthiness has since provided less favorable results. Jamie Twiss, a PhD candidate in AI performance evaluation, ran evaluations to test the ability of these models at…

Tina Reynolds Avatar

By

The Decline of AI Coding Assistants: A Closer Look

Perversely, recent evaluations of artificial intelligence coding assistants like ChatGPT have found just the opposite. Although models like GPT-4 and GPT-5 were once promising in generating valuable code, their trustworthiness has since provided less favorable results. Jamie Twiss, a PhD candidate in AI performance evaluation, ran evaluations to test the ability of these models at coding tasks. Those findings point to a drop-off in effectiveness—leaving many to wonder what’s next for AI in the world of programming.

In Twiss’s trials, GPT-4 provided a helpful response in each case every single time over ten tests. It sometimes ignored the prompt to just return code. In three of these cases, GPT-4 provided explanations instead of the requested output, like high school essay introduction, showing more concrete failures to follow user instructions. In comparison, GPT-5 was more successful, solving the challenges reliably every time, though each with important caveats.

The Performance of GPT Models

GPT-4 impressed as usual recently by proposing columns that could be lacking in datasets. This was the most helpful insight offered along Twiss’s evaluation. It also exhibited weaknesses. For example, when presented with a question that lacked clarity, GPT-4 often repeated the initial question or created imaginary information to avoid mistakes. These two responses were then voted on as the most useless and counterproductive responses, respectively.

For its part, GPT-5 went a much simpler path. It modified the index of each row, allowing for the rapid creation of new columns. While this approach worked successfully most of the time, it was not without its issues. Even the first drafts of GPT-5 were riddled with syntax mistakes and examples of bad reasoning. Still, despite these shortcomings, it frequently was more successful than its predecessor at the creation of code that could be used.

The Claude models present a different narrative. Their previous versions were apparently challenged with more complex queries, frequently offering loyalty-style catchall answers that weren’t particularly helpful. The newer versions of Claude were all significant improvements. On other occasions, they opted to evade issues rather than provide answers.

Implications for Developers

The diminishing accuracy of AI coding assistants presents a troubling dynamic for software developers who may be counting on these technologies. Tasks that previously, even with the help of AI, took around five hours balloon to seven or eight hours. In fact, in some instances, they are even longer! This move is a sign of the new usage landscape of AI tools. They are still running, but they may no longer provide the expected returns in terms of time savings and productivity.

The CEO of Carrington Labs recently highlighted the challenges faced by developers using large language models (LLMs) for coding tasks. After writing extensive improvement details, they submitted error messages for different versions of ChatGPT—primarily ranging between variations of GPT-4 and GPT-5. The CEO’s request was explicit: only completed code without any additional commentary should be provided to resolve the errors.

This straightforward approach focused on determining the accuracy of these models to perform individual coding tasks with detailed instructions. The impact was varied. As of 2025, most of our core models reached a quality plateau and have recently started declining.

The Future of AI Coding Assistants

Given this trajectory for AI coding assistants, it’s worth asking what role they should ultimately play in the software development process. As the industry continues to depend on these tools, knowing their caveats will be key. The mixed performance outcome of Twiss’s trials serves as a reminder that AI systems—in particular, ones with a public-facing impact—should be perpetually improved and innovated on.

GPT-4 in practice and GPT-5 in theory have advanced by leaps and bounds. Issues with adherence to prompt and output quality underscore the limitations that remain. Now more than ever, developers need to focus on using their unique abilities—supported by AI. Focusing only on these models will not foster smart, context-sensitive and efficient coding.