Global research team creates new exam to test the limits of artificial intelligence

AIResearch

9 Mar

Humanity’s Last Exam gives the sector a new way to measure advanced AI systems after long-used benchmarks such as MMLU became less useful at the top end.

A global team of researchers has created a new benchmark for artificial intelligence after older academic-style tests became less effective at separating the strongest models.

The benchmark, Humanity’s Last Exam, includes 2,500 expert-level questions across subjects including mathematics, humanities, natural sciences, and specialized fields, and is designed to test capabilities that current systems still struggle to handle. The work was published in Nature and includes contributors from institutions around the world, including Texas A&M University.

Benchmarks such as the Massive Multitask Language Understanding exam, commonly known as MMLU, once acted as a meaningful test of model capability. However, the strongest AI systems now perform so well on them that they reveal less about the current limits of artificial intelligence. Humanity’s Last Exam was created as a harder alternative, with questions screened so that items answered correctly by leading models were removed from the final set.

A benchmark built for the post-MMLU era

The benchmark is designed to cover a wide range of disciplines and specialist areas. Questions were written and reviewed by experts from multiple fields, with the goal of creating problems that require depth of knowledge rather than simple retrieval of information.

The intention was not to create a consumer-facing test but a research tool that can help evaluate advanced systems at a time when many established AI benchmarks are approaching saturation.

Tung Nguyen, Instructional Associate Professor in the Department of Computer Science and Engineering at Texas A&M University, says: “When AI systems start performing extremely well on human benchmarks, it’s tempting to think they’re approaching human-level understanding.”

He adds: “But HLE reminds us that intelligence isn’t just about pattern recognition — it’s about depth, context and specialized expertise.”

Current models still struggle on the exam

Early results suggest the benchmark is significantly more difficult for current AI systems than previous tests. Some earlier models scored only a few percent on the exam, while more recent systems have improved but still fall well short of expert-level human performance across many specialist areas.

Nguyen contributed 73 of the 2,500 public questions and authored the largest number in mathematics and computer science among the contributors mentioned in the project. He says: “Without accurate assessment tools, policymakers, developers and users risk misinterpreting what AI systems can actually do.” He adds: “Benchmarks provide the foundation for measuring progress and identifying risks.”

For educators and researchers, the benchmark highlights a growing challenge in evaluating artificial intelligence. As systems improve rapidly, older assessments can stop providing useful signals about capability.

The project also points to a broader issue in discussions about AI and education: strong performance on familiar tests does not necessarily reflect deeper understanding of complex subjects.

Nguyen continues: “This isn’t a race against AI.” He adds: “It’s a method for understanding where these systems are strong and where they struggle. That understanding helps us build safer, more reliable technologies. And, importantly, it reminds us why human expertise still matters.”

Nguyen also points to the scale of the collaboration behind the benchmark: “What made this project extraordinary was the scale.” He continues: “Experts from nearly every discipline contributed. It wasn’t just computer scientists; it was historians, physicists, linguists, medical researchers. That diversity is exactly what exposes the gaps in today’s AI systems —perhaps ironically, it’s humans working together.”

ETIH Innovation Awards 2026

The ETIH Innovation Awards 2026 are now open and recognize education technology organizations delivering measurable impact across K–12, higher education, and lifelong learning. The awards are open to entries from the UK, the Americas, and internationally, with submissions assessed on evidence of outcomes and real-world application.

Featured