OpenAI Unveils New Benchmark Showcasing AI Models Against Human Professionals

Maxwell Zeff is senior reporter at TechCrunch, covering AI. Read more about his week-long stay in San Francisco as he covered OpenAI’s latest step in AI judging. On Thursday, the advocacy organization released their latest benchmark. This operational tool will serve as a baseline for measuring how the AI models it trains perform versus human…

Lisa Wong Avatar

By

OpenAI Unveils New Benchmark Showcasing AI Models Against Human Professionals

Maxwell Zeff is senior reporter at TechCrunch, covering AI. Read more about his week-long stay in San Francisco as he covered OpenAI’s latest step in AI judging. On Thursday, the advocacy organization released their latest benchmark. This operational tool will serve as a baseline for measuring how the AI models it trains perform versus human professionals across various industries. We hope this evaluation is the first step toward a better understanding of how AI can be responsibly and effectively used in the workforce.

>The benchmark, named GDPval, evaluates AI capabilities in nine critical sectors. Fueling many of these industries, infrastructure construction and maintenance have become evermore impactful contributors to U.S. gross domestic product. This is especially true across industries such as healthcare, finance, manufacturing, and government. Tejal Patwardhan, head of OpenAI’s evaluations team, was extremely positive about the results seen so far with GDPval.

“Because the model is getting good at some of these things,” – Dr. Aaron Chatterji

Prior to this release, OpenAI’s most powerful GPT-4o model struggled to achieve human-level performance. When it premiered roughly 15 months ago, it was only achieving victories and draws 13.7% of the time. This new benchmark aims to provide insights into whether advancements in AI models, like Claude, can translate into real-world applications.

OpenAI thinks Claude’s higher scores are attributed to its tendency to produce aesthetically pleasing graphics. Despite its lackluster performance overall, they believe this talent eclipses its shortcomings. In a world that’s becoming more and more infused with AI, Silicon Valley tech companies use these and other benchmarks to measure progress on developing AI. Other widely used assessments are AIME 2025, which measures competitive math skills and GPQA Diamond, which measures PhD-level scientific questioning.

One particularly impressive prompt from the benchmark pushed aspiring investment bankers to chart the competitive landscape of the rapidly evolving last-mile delivery industry. After collecting this qualitative data, they compared their experiences with AI-generated reports. This kind of task is a perfect example of where AI capabilities meet the realities of professional responsibilities.

As AI capabilities increase, it is up to professionals to use these tools to increase their productivity, Dr. Chatterji said.

“People in those jobs can now use the model, increasingly as capabilities get better, to offload some of their work and do potentially higher value things,” – Dr. Aaron Chatterji

Maxwell Zeff has persistently reported on the role of new technologies in shaping society. He looked into the emergence of artificial intelligence and covered last spring’s meltdown of the Silicon Valley Bank. Before joining TechCrunch, he contributed his expertise to Gizmodo, Bloomberg, and MSNBC, building a reputation as a knowledgeable voice in the field.

When not pursuing his journalistic interests, Zeff can be found hiking and biking, and enjoying the Bay Area’s rich culinary scene. His diverse interests give him a special sensitivity to the intersecting worlds that get reshaped through technology—whole few shared lives reinvented on our screens.

TechCrunch is producing this big, vibrant, creative entrepreneurial collision in San Francisco from October 27-29, 2025. This convening will undoubtedly be shaped by what’s new in technology and AI, continuing the rapid change that has marked this field over the past few years.