A recent review describes worrying patterns in bias and accuracy for AI coding assistants. This is particularly the case for the newest versions of ChatGPT. Jamie Twiss, CEO of Carrington Labs, was the principal investigator of this study. This tested nine different variations of ChatGPT, and the overall focus was on finding out what works or doesn’t with GPT-4 and GPT-5. Taken together, these two studies illustrate how far we’ve come in some ways. The real-world reliability in actual applications is brought into question as the overall effectiveness has leveled off or dropped below zero.
In all, Twiss ran ten trials for each model, categorizing their outputs as helpful, useless or counterproductive. Those differences are stark in these results – particularly the contrast between GPT-4 and GPT-5. They show how these models compare favorably against the previous Claude mini models. This assessment sheds light on the evolving landscape of AI coding assistants, drawing attention to both their potential and limitations.
Mixed Results from Different Versions
Of the nine distinct versions tested, the vast majority were based on GPT-4 and GPT-5. GPT-4 goes 10 for 10, consistently giving a helpful response each time. This reliability can’t be taken at face value, as the performance of GPT-5 found the model providing inconsistent results. GPT-5 got the most complicated answer on the first try by just adding one to the real index of each row. It showed a much wider range of concerning behaviors in more complex situations.
In three cases, GPT-5 did not follow prompts to output code only, but rather recommended that users visit the dataset. This digression from normal behavior is an alarming signal that the model is ignoring the user command. This leads us to question its capacity to follow such guidelines in a meaningful way. In the other six instances, GPT-5 attempted to run code. Unfortunately, it brought in exceptions that led to errors, which populated the new column with error messages. Such results are a drain on productivity and damage user confidence in the assistant’s ability to deliver.
In 90% of the test cases, GPT-5 successfully generated and printed the column names in the dataframe. In addition, it recommended that users look for those columns exists. Non-tangible New developer This vexatious suggestion might be considered as punching down, as it offered only token help with addressing coding problems.
Performance Plateau Among Core Models
This drop in final performance seems to be the case across AI coding assistants in general. From early 2025, all core models experienced a quality standstill. This could mean that developers have exhausted low-hanging fruit for improvement. These models exemplify varying degrees of problem-solving ability. Because their performance is erratic, they have longer task times than humans.
Despite AI’s help, work that previously took five hours now frequently takes seven or eight hours. With AI, those would have previously been ten-hour-plus tasks. This regressive trend of more time spent on coding tasks is a sign that AI coding assistants might not be as effective as before. As Twiss observed, the newer models mostly fix issues but too often avoid addressing them in the first place.
The previous Claude models had an even more notable failure mode—essentially “shrugging their shoulders” when confronted with problematic tasks they couldn’t solve. This lack of capability further underlines the need for continued innovation and improvement in AI technologies, particularly when it comes to coding assistance.
Counterproductive Solutions from GPT-5
Perhaps the most disturbing result from the tests was GPT-5’s method of overcoming an obstacle. In another trial, it simply repeated the initial code instead of generating an alternative solution or improving features. Such behavior is counterproductive to executing their civic duty. It harms the public, too, miseducating users who come to think that AI tools are the best way to generate novel policy solutions.
While GPT-5 added one to the true index properly, the output rolled off as if it were a random number. This unfortunate outcome did not do much to enlighten anyone. This damaging outcome is a reminder that even the most advanced AI isn’t infallible. Yet it often fails to hit the target when anticipating or addressing niche user goals.
These varied performance levels across models raise questions about continuing to trust AI coding assistants. Twiss has already incorporated LLM-generated code into his day-to-day work at Carrington Labs. He emphasizes the importance of critically assessing what these tools can – and cannot – do.

