Researchers at Stanford University introduce cost-effective and efficient way to evaluate AI language models
A new paper from Stanford researchers, published at the International Conference on Machine Learning, has found a cost-effective and efficient way to evaluate AI language models.

As increasing numbers of new versions of AI language models are launched each year, it can be difficult - and costly - to demonstrate the benefits of a new model.
“This evaluation process can often cost as much or more than the training itself,” explains co-author Sang Truong, a doctoral candidate at the Stanford Artificial Intelligence Lab (SAIL).
“We’ve built an infrastructure that allows us to adaptively select subsets of questions based on difficulty. It levels the playing field.”
“The key observation we make is that you must also account for how hard the questions are,” adds Sanmi Koyejo, an Assistant Professor of Computer Science at the School of Engineering, who led the research.
“Some models may do better or worse just by luck of the draw. We’re trying to anticipate that and adjust for it to make fairer comparisons.”
Koyejo, Truong, and colleagues, borrowed a concept from education known as Item Response Theory, which Koyejo compares to standardized tests such as the SAT with questions of differing levels of difficulty.
The team analyze questions and score the answers provided by AI language models, using generative AI to create questions that can be tailored to different levels of difficulty. The team claims this has reduced the costs of testing by between 50 and 80 percent.
Koyejo has tested the system against 22 datasets and 172 language models and found that it can adapt easily to both new models and questions. The researchers say the finding will allow for better diagnostics and more accurate performance evaluations of AI language models.