Decline in AI Coding Assistants Raises Concerns Among Developers

A new study of the effect AI coding assistants are having finds them falling woefully short. Yet a simple, systematic test conducted by an industry expert uncovered these dismal outcomes. The native image of the author The author, Dr. Groves, is CEO of Carrington Labs. He wanted to see if there was any empirical evidence…

Tina Reynolds Avatar

By

Decline in AI Coding Assistants Raises Concerns Among Developers

A new study of the effect AI coding assistants are having finds them falling woefully short. Yet a simple, systematic test conducted by an industry expert uncovered these dismal outcomes. The native image of the author The author, Dr. Groves, is CEO of Carrington Labs. He wanted to see if there was any empirical evidence to back up the perceived decline in quality. The test results paint a pretty disturbing picture. AI improvements can’t come soon enough—users feel that many AI models (the ones you’ve probably heard of) are underwhelming.

The evaluation was conducted by querying ten parallel instances of ChatGPT, although mostly based on the GPT-4/5 architectures. To do so, the author adapted each version to correct the same central coding issue. They asked that only the final code be submitted back, with no additional comments. Some of the discoveries demonstrate significant gaps in performance between each statutory or regulatory version of the EIS tested. This should raise serious questions about the reliability of AI coding tools.

Testing the Capabilities of ChatGPT

Specifically, in the early stages of the public test, the replier helped the author repropose a nasty error message to nine distinct models of ChatGPT. These alter egos covered GPT-4 variants as well as GPT-5 versions. Each iteration was then expected to offer a better answer to fix the mistake. The outcomes were inconclusive. Though GPT-4 always provided valuable responses, GPT-5 had a 100% success rate.

OpenAI’s approach with GPT-5 was to adopt a simple but great approach. Specifically, it recommended one new column be created from the index of each row. You just have to take that index and add one to it. We found this approach to be very effective, demonstrating the model’s ability to produce useful, relevant code based on focused prompts.

It wasn’t the case for the other versions of ChatGPT that performed wildly differently. In nine of ten test cases, these versions just printed out the full list of columns in the dataframe. What they didn’t do, unfortunately, was remedy the error in question. This response only serves to demonstrate how poorly the issue is actually understood. It further exposes a lack of innovation and ability to go-to-market with a compelling solution.

Patterns of Noncompliance

In doing so, the author found that multiple iterations of ChatGPT exhibited obvious signs of failure to adhere to the set directions. In three instances, despite explicit requests for only code, the models included comments suggesting that users check for column presence and rectify any issues. This unexpected break from the directive suggests an area of weakness in the understanding of user intent.

Six other builds attempted to run the code. They soon ran into exceptions that would either raise an error or populate the new column with an error message when the specified column couldn’t be found. Such behavior demonstrates a lack of creativity and ingenuity in problem solving to work through programming obstructions.

Surprisingly, on one occasion even a version of ChatGPT simply repeated the original code with no suggestions at all. This back-and-forth mechanic indicates a failure of problem-solving ability built into some versions of AI coding assistants.

Implications for Developers

This comprehensive test should sound alarm bells for developers. In short, they rely on AI coding assistants in their own work. Like many of you, the author has been worried by a disturbing pattern over the last few months. After ~2025, many of the core models appear to have reached a quality plateau. The further back in history you go, the more obvious this decline is.

If AI coding assistants are to be widely adopted in software development workflows, their reliability must be unquestionable. Developers love these tools because they help automate time-consuming processes, allowing their teams to be more productive. The evidence seems to suggest a decline in quality and flexibility. As a direct consequence, developers are beginning to doubt the trustworthiness of AI generated code.

This experience has informed how we approach evaluation through a social equity lens and the need for ongoing assessment and iteration within AI systems. Developers—your author included—are already out there building things with these new tools. As can be expected, there is an urgent demand for improvements that will address existing outcry and advance trial performance.