Law professors prefer AI answers in 75% of blind comparisons, Stanford study finds

19 Jun

Sixteen professors across 14 US law schools rated AI responses more highly than faculty-written answers and flagged them as harmful less often in a study of first-year contracts tutoring.

***A Stanford-led study found that law professors preferred AI-generated answers in about 75% of blind comparisons with responses written by fellow instructors.***

A Stanford University-led study has found that law professors preferred answers generated by large language models to responses written by fellow instructors in a blinded test of short-answer tutoring for first-year contracts courses.

The 16 participating professors, representing 14 US law schools, completed 2,918 anonymized comparisons between answers written by instructors and responses produced by Google Gemini 2.5 Pro and NotebookLM. Across the human-versus-AI comparisons, the models recorded an average win rate of 75.33%.

Researchers asked professors to choose the answer they would rather give to a student during office hours or after class. The questions covered case and code recall, legal doctrine, hypothetical scenarios and policy issues, including questions without a single clearly correct answer.

The professors also assessed whether individual responses could hinder student learning. AI-generated answers were flagged as harmful in 3.53% of cases, compared with an average of 12.06% for answers written by participating instructors.

The study evaluates the quality of short answers rather than whether students learn more when using an AI tutor. Its authors say further classroom trials are needed before the findings can support decisions about deploying AI systems across legal education.

AI answers lead across every question category

The researchers selected 40 questions from a larger pool created by participating contracts professors. Each professor then answered questions they had not written, while Gemini 2.5 Pro and NotebookLM produced responses to the same material.

The AI answers were calibrated to broadly match the length of faculty responses. Before evaluation, the answers were anonymized and lightly standardized to reduce clues about whether they had been produced by a person or an AI system.

Gemini 2.5 Pro achieved a 75.92% average win rate against instructors, while NotebookLM recorded 74.75%. The advantage remained across all four question categories, including hypothetical and policy questions that required professors to weigh competing arguments rather than identify one factual answer.

The models also performed at a similar level to the strongest instructor in the study, although the exact ranking varied depending on the statistical method used. NotebookLM outperformed every individual instructor in raw comparisons, with one tie, while a Bradley-Terry ranking placed Gemini first and the strongest human instructor second.

Julian Nyarko, Stanford Law School professor and co-author of the paper, says: "We were frankly surprised by the magnitude of the results. These weren't just simple questions with obvious answers. Many of them required synthesizing complex material, applying it to new situations, and explaining legal concepts in ways that would help students develop their own analytical skills."

Faculty-written answers showed a wider spread in harmfulness ratings, ranging from 1% to 39.75% across individual instructors. Gemini was flagged as harmful in 3.41% of cases and NotebookLM in 3.64%.

The researchers found that answer length and other writing characteristics explained only part of the preference for AI responses. The models continued to outperform the rates predicted from features including structure, clarity, confidence, legal references and pedagogical support.

Study focuses on judgment rather than one correct answer

Most previous assessments of AI tutoring have concentrated on subjects where responses can be measured against a fixed answer. The Stanford-led study instead examined whether AI-generated legal explanations aligned with the professional standards professors use when evaluating arguments involving uncertainty and competing interpretations.

Sarath Sanga, co-author and professor at Yale Law School, says: "In most fields where AI gets tested, there's a right answer. In law, there often isn't. Two opposing arguments can both be good. What we wanted to know is whether AI can meet the latent professional standard that lawyers use to evaluate each other's arguments. In this case, the answer was yes."

The professors showed greater agreement than would be expected if their decisions were based only on individual preferences, according to the paper. Agreement was highest on policy questions, followed by questions involving the recall of cases or legal codes.

The human evaluation covered a standard version of Gemini 2.5 Pro and NotebookLM, which had access to the contracts casebook used by the participating professors. Despite that additional grounding, the study found no meaningful advantage for NotebookLM on questions whose answers were contained in the casebook.

The researchers suggest several possible explanations, including the strength of knowledge already present in the base model and the risk that providing a large volume of retrieved material can introduce irrelevant context. They stress that the study did not test those explanations directly.

A separate analysis used Llama-4 Maverick as an automated judge to compare additional models after first testing its decisions against the participating professors. Every AI system included in that extended ranking placed above the aggregated human instructors, but the authors acknowledge that AI-based judging can be affected by position, verbosity and model-family bias.

Researchers caution against wholesale adoption

The study involved a small and self-selecting group. Sixteen of the 60 professors invited to participate completed the research, and the final sample included a higher proportion of professors from top-14 US law schools than the wider pool.

Participating professors were also more likely to be tenured, while women accounted for 25% of participants compared with 33% of the full group invited. The authors say the shared professional standard identified in the results may therefore reflect the particular instructors who took part rather than legal educators more broadly.

The research was limited to brief, written responses in first-year contracts courses. It did not measure longer tutoring conversations, students' ability to retain information, academic performance or whether regular access to AI changes critical thinking.

Nyarko says: "Our study evaluates the quality of answers given by AI tools. But how to implement these tools to most effectively improve student learning is still an open question. So we're not advocating for wholesale adoption of AI tutors. But our data suggests that blanket skepticism may be equally unwarranted. The conversation should shift from whether AI can give accurate, high quality responses to how we can deploy it responsibly to the benefit of our students."

The authors propose course-based randomized controlled trials as a next step. They also recommend that any legal education system using AI should include clear limits, citations to course materials, refusal mechanisms for uncertain questions and routes for escalating complex cases to instructors.

The paper reports that no external funding was received and that its authors declared no competing interests. Its supporting data and code are expected to be placed in a public repository when the research is published.

ETIH Innovation Awards 2026