The Evolving Landscape of AI Coding Assistants

In recent months, the performance of AI coding assistants has become a hot-button topic amongst developers and tech enthusiasts alike. While recent scorecards reveal progress — specifically for GPT-4 and its successor, GPT-5 — there are favorite hang-ups that have gotten better or worse. These issues severely limit their usability and value for teaching and…

Tina Reynolds Avatar

By

The Evolving Landscape of AI Coding Assistants

In recent months, the performance of AI coding assistants has become a hot-button topic amongst developers and tech enthusiasts alike. While recent scorecards reveal progress — specifically for GPT-4 and its successor, GPT-5 — there are favorite hang-ups that have gotten better or worse. These issues severely limit their usability and value for teaching and learning programming. GPT-4 performed impressively on coding challenges across the board. GPT-5’s advances in critical thinking, though welcome, present their own set of challenges.

During our testing, we found GPT-4 to consistently have helpful answers on ten different runs. Here, though, it struggled when prompted to produce code exclusively, contradicting directions in three cases. In those instances, GPT-4 flagged possible dataset problems by recommending that a particular column be added. For the other half, GPT-5 proved to be the more successful overall problem-solver, succeeding on all test cases faced.

Performance Analysis of GPT-4

In the past tests, GPT-4 showed admirable consistency, providing helpful answers on nine out of ten tries. First, from the perspective of someone who teaches programming, it exhibited an impressive grasp of coding concepts and included rationales for its generated outputs. Notably, for nine out of the ten test cases, it printed the list of columns in the dataframe, aiding users in understanding the context of its suggestions.

Yet GPT-4’s doggedness to clean up any possible gaps in the dataset led it to make changes outside the intended output structure. In three example scenarios, the prompt told it to respond with code only. Rather, it called attention to the fact that an essential column may be absent. We understand this approach will disappoint users looking for a simple answer.

Even GPT-4 goes out of its way to instruct the user to check if certain columns exist. Providing this kind of guidance allows users to more successfully and efficiently troubleshoot data-related issues. Even with all these moves, its overall deference to simple prompts and avoidance of certain instructions on occasion makes its commitment to user needs questionable.

Advancements with GPT-5

In sharp contrast to its predecessor, GPT-4 proved surprisingly good at producing the desired effective solutions, and doing so consistently. Notably, it employed a simple yet effective method: taking the actual index of each row and adding 1 to it to create a new column. This direct and simple approach resolved the issue and highlighted a more advanced awareness of coding logic.

Though GPT-5 has leapt forward in overall capability, it is still imperfect. The early versions of the model were prone to syntax errors and incorrect logic which made it less effective on code-oriented tasks. As the model matured, it started producing more precise outcomes, resulting in higher reliability from top to bottom.

The productivity increase with AI-assisted coding is still an open question for most developers. What used to take a couple hours turned into three, four, five, sometimes seven and eight hours or more in recent times. This transition into the world of AI-bot-driven programming potentially brings up a huge quality versus efficiency conundrum and what is sacrificed by being too code efficient.

Decline in Code Quality

AI experts have documented that the quality of code produced by LLMs hit a ceiling in 2025. Increased vulnerability in recent years there has been a profound loss of that quality. Once users have experienced scenarios where their code runs but returns completely random or gibberish outputs, this phenomenon represents one of the worst possible outcomes for developers who depend on these tools to produce functional code.

Previous models such as Claude have shown a pattern of refusing to answer unanswerable questions. Newer models have exhibited document results— at times fixing complaints but other times just avoiding them altogether. This inconsistency is particularly concerning because it suggests that something AI coding assistants are still evolving and developing.