Decline in AI Coding Assistants Raises Concerns Among Tech Leaders

As a founder and CEO of Carrington Labs, Jamie Twiss has been using large language model (LLM)-generated code to an extreme degree in his duties. In recent months, he’s watched what he considers a deeply unsettling trend play out with AI coding assistants. Twiss started to really understand that better. To do that, he conducted…

Tina Reynolds Avatar

By

Decline in AI Coding Assistants Raises Concerns Among Tech Leaders

As a founder and CEO of Carrington Labs, Jamie Twiss has been using large language model (LLM)-generated code to an extreme degree in his duties. In recent months, he’s watched what he considers a deeply unsettling trend play out with AI coding assistants. Twiss started to really understand that better. To do that, he conducted a controlled experiment across various iterations of AI models, with a specific focus on the ChatGPT iterations.

Twiss’s test was intended to confirm or refute his anecdotal observations about how well AI coding assistants perform. With the help of ChatGPT Plus user James Gumm, he sent this error message to nine different versions of ChatGPT, mostly GPT-4 / GPT-5 variants. His findings paint a dark picture around the continued effectiveness of these tools. All of this begs the question of how reliable they are in professional environments.

A Systematic Approach to Testing

Twiss’s more systematic Turing Test consisted of sending the same error message to multiple instances of ChatGPT. This process evened the playing field and enabled him to assess what each model did best based on the same set of predetermined conditions. In doing so, he aimed to determine if the coding assistants’ quality really had started to decline after a golden age.

The focus was mainly on GPT-4 and GPT-5 specifically—arguably two of the most-prominent, most-used models in industry. In all Twiss ran the test ten times for each build. The outcomes uncovered important disparities between top performance and the need that deserve focused conversation.

For GPT-4, Twiss found it rarely failed to give a helpful answer. From ten different tries, not one answer failed to be helpful and on-point. This level of reliability is paramount for the industry professionals that would use AI coding assistants to increase their productivity. The model even showed a good knowledge of common pitfalls implied by the error message and produced cohesive troubleshooting steps.

GPT-5’s performance was less satisfactory. While it did find a solution for each example, Twiss found that all solutions were false. Perhaps one of the most spectacular mistakes resulted when a new random-number filled column was initialized. This occurred because we incorrectly appended 1 to the unrealized index of every row. This unfortunate misstep reveals potential gaps in GPT-5’s logic. That poses a serious question about the model’s applicability to practical, real-world use cases.

Differences in Model Responses

The differences between GPT-4 and GPT-5 go much deeper than just correct versus incorrect answers. As an example, the previous Claudes had major failings on problems that were unsolvable. Instead they responded with just as vague or totally unhelpful answers, basically shrugging on the matter. For newer models, though, that was never an option—instead they would just provide the wrong solution or would completely skip addressing the issue altogether.

Twiss ran GPT-4 in nine out of ten scenarios. The model was able to accurately render the available list of columns in the dataframe on each pass. Most impressively, it returned useful relevant code snippets. Even more valuable, though, were the smart suggestions telling Twiss to look for the existence of a metadata/notes column in his dataset. This high degree of engagement shows a level of awareness of context that is foundational for good coding practices.

Here’s the interesting part, though—there were three cases where GPT-4 went slightly off the rails and failed to follow Twiss’s instruction to only return code. Rather, it returned messages of clarifications as to why the column could not be found in the dataset. While this could be seen as unnecessary verbosity, it underscores a characteristic of AI coding assistants: they are designed to offer context in addition to solutions.

The Quality Plateau

The overall findings from Twiss’s test indicate a troubling trend: AI coding assistants appear to have reached a quality plateau and may even be declining in effectiveness as 2025 approaches. Unlike prior models such as GPT-4 that proved to be highly reliable. The latest versions such as GPT-5 have a hard time with accuracy and logical reasoning.

This continuing decline presents a tremendous challenge for the engineers, data scientists, and others who use these tools to do important coding work. With the rapid development of AI technology, upholding the highest standards of performance is more important than ever. The inconsistency noted by Twiss may result in even more user frustration as users come to expect more dependable assistance from AI.

As entities adopt generative AI to make their processes more efficient, grasping these limitations will be critical. Voices like Jamie Twiss are on the forefront of changing that narrative and spark leaders. Even the authors of the study recommend more robust capabilities in coding assistants to better align with industry needs.