In recent months, the role of AI coding assistants has come under scrutiny, particularly as their performance appears to fluctuate. High-profile doomers like the CEO of Carrington Labs have embraced these tools, especially ChatGPT, for code generation. A deeper look beneath the surface tells a more exciting story, one of positive upswings in predictive model performance. What’s working better with the newer GPT-5 is more of a mixed bag.
We conducted a total of ten trials on each of GPT-4 and GPT-5. Our task was to judge how they behaved in the face of an example error message. The results point to a profound disparity in the reliability and effectiveness of these different versions. In all ten of those attempts, GPT-4 always gave us good responses. Conversely, the much more intelligent GPT-5 not only provided a solution each time but presented eye-opening perspectives.
Performance Comparison of AI Models
In these posts he did a deep dive on nine different non-finalized builds of ChatGPT, including multiple variations on ChatGPT GPT-4 as well as GPT-5. The goal was to get a sense of how they might respond to a typical coding challenge. In these tests, GPT-4 did an excellent job, providing relevant and helpful answers every time. This consistency has further helped it become a go-to tool for developers looking to make their code writing tasks easier.
GPT-5 had a much more revealing response. Even if it ultimately discovered the right answer every time, it sometimes missed the mark, going against the original request. In continued prompts, for example, in three out of four times, GPT-5 ignored prompts asking it to only return the code and gave extra contextual information as well. This behavior comes with a double-edged sword. On one hand, it demonstrates the model’s capacity to offer wider-ranging justifications. Conversely, it can annoy developers who only want simple code examples.
Interestingly, in one trial, GPT-5 recognized when a column was probably missing within the dataset. This recommendation demonstrates a much deeper acuity of thought that goes beyond first-tier coding help. These types of insights are useful for end-users that might be unaware that there is a deeper problem with their data.
The Decline in AI Model Quality
Models such as GPT-5 demonstrate remarkable progress. AI coding assistants are starting to make a lot of people quite anxious about where this technology is headed. As 2025 unfolded, dozens of foundational models seemingly started to plateau on quality before eventually regressing. This decline raises important questions about the sustainability of improvements in AI technology.
Earlier models such as Claude were more easily stumped with impossible tasks. In some cases, they provided all the required information, but their responses were highly generic and non-informative to customers. In recent iterations some models have created paradigm-shifting advances in problem solving ability. Others have decided to dodge problems, in other words kicking the can down the road rather than addressing need directly.
It’s one of the most painful things for a developer ever, when the code just works. It’s worse yet when it does the opposite—with random and unpredictable effects. This case study illustrates the fine line between just getting things done and actually making an impact. It further compounds the danger in our increasing dependence on AI coding assistants.
User Trials and Classifications
The author painstakingly scored the outputs from ten trials each for every AI model. This new approach drives a deeper understanding of how well these models are actually working. We defined responses as either productive, unproductive, or detrimental. This categorization was based on how useful and successful all of the individual responses were at clearing the error message.
The results indicated that while GPT-4’s outputs were primarily helpful throughout the trials, GPT-5’s responses varied significantly. While it always offered feasible solutions, its habit of deviating from clearly stated guidelines often led to misguided effects. This inconsistency highlights an important factor for users to keep in mind when choosing an AI coding assistant.
Developers are actively and endlessly bringing these tools into their workflows. In short, they need to weigh the promise of improved visibility against the risk of unpredictable delivery. The future of AI coding assistants is fascinating territory that warrants exploration and adjustment.

