AI Coding Assistants Show Declining Performance According to CEO

Jamie Twiss, Carrington Labs’ CEO, called the implications “grave”. He personally (and understandably) has deep misgivings about the deteriorating performance of AI coding assistants, particularly over the past few months. Carrington Labs focus on delivering predictive-analytics-based risk models to lenders, which are very data and code driven. Twiss ran a detailed pilot test to assess…

Tina Reynolds Avatar

By

AI Coding Assistants Show Declining Performance According to CEO

Jamie Twiss, Carrington Labs’ CEO, called the implications “grave”. He personally (and understandably) has deep misgivings about the deteriorating performance of AI coding assistants, particularly over the past few months. Carrington Labs focus on delivering predictive-analytics-based risk models to lenders, which are very data and code driven. Twiss ran a detailed pilot test to assess the learning potential of different generative AI models. What his observations uncover is a deeply disturbing trend that just might threaten the overall efficiency of software development.

Twiss ran a Turing-like test by sending the same error message to nine different versions of ChatGPT. These were largely different flavors of the GPT-4 and GPT-5 models. What his findings show Specifically, his findings indicate that great machine learning models persist while others just can’t do the basics of coding. This drop in performance casts doubt on the overall reliability of AI coding assistants as useful tools for development.

Methodology of the Test

To test what AI assistants have to offer coding before incorporating them into teams, Twiss adapted a simple but organized method. He wanted to know if the overall American decline in performance was a real trend, or just a lot of anecdotal evidence. The test consisted of systematically prompting the different models in test with an error message.

Twiss concentrated his efforts on the newer GPT-4 and GPT-5 models. He executed the tests ten times for each model, testing their responses against the resulting error messages. Taking such a systematic approach gave us a comprehensive understanding of how various versions of AI performed on coding tasks.

The results from Twiss’s test were revealing. GPT-4 reliably gave relevant advice, resolving the error message on each occasion without fail. By comparison, GPT-5 produced answers that were functional, but incorrect. For example, one fix only produced a new column filled with gibberish numbers instead of addressing the root cause.

Performance Breakdown

The failure of Twiss’s test revealed significant gaps in performance for GPT-4 vs. GPT-5. Though GPT-4 still had a flawless success rate in correcting the error messages, GPT-5’s performance was less consistent. While it produced solutions in every case, many were too weak, introducing even more complexity and burden.

In six cases, though, GPT-5 tried to run the code. It rolled in an exception that either raised an exception or populated the new column with the exception message. In three others, it simply did not answer at all, disregarding Twiss’s exact instruction to only return code. Instead, it recommended that users look and see if the column exists in the downloaded dataset. Failure to provide such responses shows a fundamental misunderstanding of or inability to execute following directives with specificity.

In each of nine out of ten test cases, GPT-5 without prompting printed a list of the columns in the dataframe. It did have something to say about testing for the existence of those columns. This disposition seems to portend a future where newer models only offer half measures, or simply don’t take on challenges head on.

Insights on Model Evolution

AI performance is at a high-water mark, and overall performance is rapidly declining among various models. This trend has huge implications for the many similar versions of coding assistants. He noted that previous models, such as Claude, have a greater inclination to skip over infeasible questions. By comparison, more recent models sometimes present incorrect answers or attempt to skirt real problems altogether.

What comes after the performance plateau core models are expected to reach by the end of 2025 has many developers and industry leaders worried. As these models improve, they don’t all appear to be keeping up with basic coding duties that form the foundation of software development. Such a decline could prevent many productivity increases and decrease faith in AI-aided coding methods.

A short time ago, AI coding assistants were considered miraculous tools that would increase efficiency and accuracy in coding tasks. Twiss’s recent discoveries indicate an important place for additional innovative development within these systems. Those very same systems, if we don’t continually improve them, can in fact make them dangerous.