The Evolution of AI Coding Assistants Reveals Mixed Performance

In recent benchmarks of AI coding assistants, one stark trend has been the huge differences between the performances of various models. This whole analysis was through the lenses of GPT-4 and GPT-5, two iterations of OpenAI’s large language models. We focused on their effectiveness supporting coding tasks. Our initial testing confirmed that despite a high…

Tina Reynolds Avatar

By

The Evolution of AI Coding Assistants Reveals Mixed Performance

In recent benchmarks of AI coding assistants, one stark trend has been the huge differences between the performances of various models. This whole analysis was through the lenses of GPT-4 and GPT-5, two iterations of OpenAI’s large language models. We focused on their effectiveness supporting coding tasks. Our initial testing confirmed that despite a high useful response rate from GPT-4, it occasionally struggled to follow explicit instructions. On the other hand, GPT-5 reliably provided great solutions in all tests, a significant advance in coding assistance power.

Under the auspices of the American CEO of Carrington Labs, the tests were performed. They utilized various series of ChatGPT to assess how effectively each iteration performed on coding tasks. The scope of the project usually took them 5-10 hours to execute on their own without AI assistance. These models drastically reduce the time to a mere 7-8 hours. This is a great example of how AI can be put to good use to automate coding.

Performance Comparison: GPT-4 Versus GPT-5

In our testing, GPT-4 returned a helpful response on the first run every time. It ignored clear instructions three times, where the author explicitly asked for finished code only. Rather than blindly executing the instruction, GPT-4 supplemented its code with explanations, showing that it had a fundamental misunderstanding of what was required.

Moreover, GPT-4 flagged possible problems in the data set three times. It pointed out that a specific key column was probably missing. So, advocacy 101: It can be a powerful tool to first shine light on gaps in information. Unfortunately, it still fails to offer clear coding guidance if the data is not fully complete.

GPT-5 showed a tremendous leap in capability. It was able to solve every one of the ten test cases successfully. It produced a new column by counting each row in the original index starting from 1. This approach turned out to be smart and very productive since GPT-5’s solution worked 100% of the time it was tested.

The Reliability of AI Models Over Time

The path for AI coding assistants has been extremely bumpy. Claude’s predecessors, when faced with impossible tasks, essentially answered by saying “no helpful answers found,” which often left users in the dark without helpful outputs or redirects provided. This lack of model reproducibility even led to questions regarding the utility of these models before they were used in a professional setting.

Neural-network based language models, especially newer architectures like those in the GPT series have had mixed success. Though some concerns were responded to well, many were ignored or poorly answered. This inconsistency often creates a frustrating experience for users who are looking for consistent and dependable coding help from these tools.

This gap in performance illustrates the limitations that have recently surfaced among foundational AI models. Many already hit a quality plateau with the target for FY 2025, and research indicates a deterioration in performance has occurred since then. Carrington Labs provided testing that further illustrates this trend. Infrastructure Now’s findings demonstrate that progress has indeed been made since the passage of FAST Act, but it’s clear that more work remains.

Implications for Future Development

This evaluation’s findings, therefore, highlight the need for continual evaluation and improvement in AI coding assistants. As developers work to improve these models, knowledge of these limitations will be key to informing future versions.

That sure is a big difference in problem-solving approach between GPT-4 vs GPT-5. The next generation of models will be successful if they’re built with a user-first approach that puts getting instructions right at the forefront. This would increase the success of the user experience and would make the agency more efficient by reducing the amount of revision and correction work needed.

AI technology is incredibly fast-moving. Destination organizations such as Carrington Labs will be instrumental by giving the industry real-world feedback that pushes for continuous improvement. By utilizing LLM-generated code extensively in their operations, they can offer insights into practical applications and highlight areas where AI assistants fall short.