Declining Performance of AI Coding Assistants Raises Concerns

Over the last few months, the astonishing performance of AI coding assistants has dazzled and worried developers and tech aficionados alike. An experiment conducted by a CEO of Carrington Labs highlighted a troubling trend: many AI models, once reliable, have begun to falter in their ability to assist with coding tasks effectively. We ran ten…

Tina Reynolds Avatar

By

Declining Performance of AI Coding Assistants Raises Concerns

Over the last few months, the astonishing performance of AI coding assistants has dazzled and worried developers and tech aficionados alike. An experiment conducted by a CEO of Carrington Labs highlighted a troubling trend: many AI models, once reliable, have begun to falter in their ability to assist with coding tasks effectively. We ran ten such trials across a variety of models, from GPT-4 through to GPT-5. Our objective was to measure how well they could tackle concretely defined programming tasks.

GPT-4 provided useful answers more consistently, but Claude models failed miserably much more often was the students’ key takeaway learned from the trials, research revealed. Unlike GPT-5 for example, the improvements weren’t remarkable as they were still prone to significant inconsistencies. As AI coding assistants become standard tools for use in enterprise software development, it’s increasingly important to know what they can do — and what they can’t.

Performance Assessment of GPT-4 and Claude Models

In the evaluation, GPT-4 showed that kind of dependable performance, providing a high-quality response consistently across all evaluation tests. In a typical use case, the job was as simple as incrementing an ‘index_value’ column on a pandas dataframe named ‘df.’ GPT-4 executed the commands correctly in nine of ten attempts. Rather than doing what it was supposed to, it decided to just print out the column names in the dataframe.

GPT-4 included a helpful comment in its code, suggesting users verify the existence of the column before proceeding further. This preventative strategy provided users support in figuring out problems they encountered, but left the user’s actual task unaccomplished. The model always did a great job, providing developers with relentless confidence. It was failing to run the correct code.

Claude models were clearly unresponsive to unsolvable prompts. Rather than offering appropriate help or explanations, they just “threw up their hands” and left users stranded. This extreme difference between GPT-4’s accuracy and Claude’s failures is very concerning, especially considering how quickly these large language models are changing recently.

Advancements and Inconsistencies in GPT-5

The purpose of this evaluation was to determine the extent of progress GPT-5 made over prior iterations. It always managed to do the right thing by just taking the original index of each row as-is and adding one to it. This was a positive step showing a deepening understanding of how to actually solve these problems. There were illogical gaps.

In other tests, GPT-5 didn’t follow the brief correctly, trying to run the code as instructed. In three specific cases, it failed to stick to its lane. It showed detailed justifications on what was wrong with the column’s appearance in the dataset. This extra context would actually be useful to developers, but it was at odds with the very clear instructions from users to only bring back code.

In six of these cases, GPT-5 was able to run the code correctly. It added “exceptions” that threw errors or populated the new column with error strings where the original column was absent. This inconsistency highlights a crucial aspect of AI coding assistants: while they may offer valuable suggestions or alternatives, their reliability can vary dramatically based on how well they interpret user instructions.

The Impact on Development Time and User Experience

Here this author’s observations show an incredible increase in development time when trying to use AI coding assistants as a crutch. Previously, tasks that could have taken around five hours with AI assistance now often require seven or eight hours or even longer. This increased time spent has dangerous implications on productivity, not to mention the impact this has on the performance of these tools long-term.

The author has since gotten in the habit of using LLM-generated code at Carrington Labs. He expressed concern over what he sees as the diminishing effectiveness of AI-based coding assistants. We meant for these new tools to increase productivity and cut red tape. The reality today is that we are sadly going in the opposite direction of progress.

We’ve conducted experiments with nine ChatGPT variants, primarily honing in on models of GPT-4 and GPT-5. As these tests further demonstrate, it is absolutely critical that we continually test, evaluate, and improve this technology. As these systems are increasingly adopted within development workflows, it is imperative to address their limitations. This ensures they remain vital and cutting-edge resources for programmers.