Over the past few months, another alarming trend has taken shape around AI coding assistants and their performance. Jamie Twiss, CEO of Carrington Labs, runs a company known for its predictive-analytics risk models provided to lenders. He has seen the tools grow less effective. Twiss has tested all the ways and piece by piece assembled anecdotal evidence to reveal profound inconsistencies in the dependability of widespread AI models. This is particularly the case for Claude and ChatGPT series.
The latest round of testing was based on running a total of ten trials per model and categorizing the outputs into helpful, useless, or counterproductive. Overall, GPT-4 repeatedly produced helpful responses across all test scenarios. By comparison, the newer Claude models underperformed, frequently unable to solve problems or giving unsatisfactory answers. These inconsistencies lead to serious concerns regarding the limitations of AI coding assistants today.
Observations from Testing
Twiss’s methodology was simple but smart and thorough in how it tested the effectiveness of different AI coding models. He fed the same erroneous message nine different versions of ChatGPT, from GPT-4 to GPT-5. We would like a solution that contains only finished code and no explanation. The results were striking.
In 9/10 test cases, all GPT-5 did was print out the columns in a dataframe. This means that creators need to check if the columns still exist. This response points out what is potentially a regression in our capacity to solve problems, relative to previous models. GPT-4, on the other hand, produced helpful answers on its first try each time, showcasing its consistency and efficiency.
GPT-5’s performance was a very different story. It found a proven path by simply using the literal index of each row and adding one. This approach turned out to be the right one. It underscored that despite some progress still marching forward, the big picture trend is a downtrend in performance for many models.
The Claude Models’ Limitations
Twiss’s original observations aren’t limited to just the ChatGPT series of models but include all Claude models. These models have had a troubling precedent of “shrugging their shoulders” in the face of unsolvable issues. Rather than trying to make a real attempt at an answer, they turn to cop-outs or non-answers. This is highly problematic for developers who depend on AI tools to help them complete the work of writing code.
Like the more recent iterations of Claude models, these models have a mixed bag of performance. When they actually do a good job solving problems, often it is because they just haven’t fixed the issue or they’ve done a bare minimum to help. This kind of inconsistency across models can be very disruptive to productivity and the overall experience of developers looking for trustworthy coding assistance.
Twiss’s findings highlight the importance for users of these AI coding assistants to evaluate the limitations of the tools and the validity of their capabilities. As coding spaces grow in size and complexity, the need for robust problem-solving tools will be more important than ever. Users will be less inclined to put their trust in these systems if the unpredictable performance of Claude models makes them second guess the outcomes.
Implications for Users
As CEO of Carrington Labs, Twiss uses AI-generated code to develop new products and improve operations. His experiences exemplify a growing concern within the tech industry about the rapidly shrinking effectiveness of AI coding assistants. Developers and organizations that depend on such tools must remain vigilant about their limitations and the potential impact on their workflows.
Even with newer models such as GPT-5 and Claude, these types of inconsistencies are common. In short, if users are not made aware of these pitfalls, it can lead to increased development time and errors. Proponents of AI must accept that it is a powerful tool. At the same time, they should not look to it as a magic bullet.
Additionally, the results encourage discussions regarding what lies ahead for AI in coding utilities. If current trends continue, dependence on these tools will require additional training or human intervention to produce quality outputs.

