OpenAI puts AI research capabilities to the test with LifeSciBench

AIResearchSTEM

22 Jun

The benchmark uses 750 expert-authored tasks to assess scientific reasoning, data interpretation and research decisions across life science workflows.

Life science researcher using a pipette during laboratory research — ***LifeSciBench tests how AI systems perform across applied life science research tasks, including analysis, experimental design and evidence handling.***

OpenAI has introduced LifeSciBench, a benchmark designed to test whether AI systems can handle research tasks used in drug discovery and life sciences, rather than only answer structured biology questions.

LifeSciBench contains 750 tasks spanning seven research workflows and seven biological domains. The tasks were created by 173 scientists with Ph.D.-level training and experience in biotechnology or pharmaceutical research.

The benchmark assesses models through free-response tasks involving scientific evidence, experimental design, analysis, validation, translation and communication. It includes 1,062 supporting artifacts such as figures, PDFs, tables, genomic sequences, molecular structures, chemical files and web references.

OpenAI’s published results show that GPT-Rosalind achieved an overall exact pass rate of 36.1 percent, compared with 25.7 percent for GPT-5.5. Performance remained lower on tasks involving scientific design, analysis, exact calculations and information contained in external artifacts.

OpenAI says LifeSciBench measures performance on self-contained research tasks and does not establish whether AI systems accelerate drug discovery or improve research outcomes. The organization plans to connect future benchmark results with studies of models operating in live research environments.

Benchmark built around applied research work

OpenAI developed the LifeSciBench taxonomy after surveying practicing life scientists about the workflows they use in applied research.

The seven categories are evidence handling; analysis; design, optimization and prediction; scientific reasoning; validation and operations; translation; and scientific communication.

Each task is written as a request that a scientist might give to a knowledgeable collaborator. Models receive a scientific prompt, supporting context or artifacts and instructions to produce a free-response answer.

Task-specific rubrics assess whether responses include the required scientific claims, calculations, decisions, evidence, caveats and formatting. LifeSciBench contains 19,020 rubric criteria, an average of approximately 25 for each task.

Seventy-nine percent of the tasks require multiple reasoning or decision-making steps, with an average of four steps per task. More than half, 53 percent, require models to interpret or combine information from at least one supporting artifact.

Yin He, who works with startups at OpenAI, described the benchmark’s focus in a LinkedIn post.

"The goal is not to build systems that simply ace biology exams. It is to understand whether they can contribute meaningfully to the work of discovering and developing new medicines," she wrote.

GPT-Rosalind improves results but most tasks remain unsolved

OpenAI reports that GPT-Rosalind improved on GPT-5.5 across several LifeSciBench workflows, particularly scientific communication and translation.

The scientific communication pass rate increased from 56.3 percent for GPT-5.5 to 71.1 percent for GPT-Rosalind. However, OpenAI notes that this category contains only nine tasks and says the result should be interpreted cautiously.

For translation tasks covering the process of moving research from preclinical evidence toward clinical use, the pass rate increased from 36.8 percent to 57.7 percent.

GPT-Rosalind also recorded higher rubric scores on tasks requiring actionable outputs and the handling of uncertainty. Its score for expert-useful or actionable responses was 44.7 percent, compared with 29.1 percent for GPT-5.5. For uncertainty and caveat handling, the scores were 44.8 percent and 29.3 percent respectively.

The overall exact pass rate of 36.1 percent means GPT-Rosalind still failed to meet the benchmark’s 70 percent task-level success threshold on almost two-thirds of tasks.

Artifacts and exact outputs expose model limits

Performance fell when models had to work with files, web sources or scientific artifacts rather than prompt text alone.

GPT-Rosalind achieved a 45.1 percent pass rate on text-only tasks, falling to 28.1 percent on tasks involving artifacts or URLs. GPT-5.5 dropped from 29.9 percent on text-only tasks to 21.9 percent when artifacts were included.

OpenAI says models struggled to extract information from complex figures and large sequence files and then incorporate that evidence into their final answers.

Design, optimization and prediction remained one of the most difficult workflows for GPT-Rosalind, with a pass rate of 30.7 percent. Its analysis pass rate was 30.3 percent.

Tasks requiring precise outputs produced lower results. GPT-Rosalind achieved a 14.8 percent pass rate on numeric tasks, 24 percent on sequence or structure outputs and 27.3 percent on construct-generation tasks.

LifeSciBench was independently assessed by 453 reviewers who had not written its tasks. OpenAI reports that 97 percent held a Ph.D. or equivalent doctorate and that reviewer agreement exceeded 96 percent across measures covering real-world relevance, scientific reasoning, grounding and overall usefulness.

OpenAI says the next stage will involve deployment studies examining model use across live research workflows, repeated rounds of reasoning and experimental follow-up. LifeSciBench currently covers self-contained tasks, and OpenAI cautions that benchmark performance should not be treated as direct evidence of downstream scientific impact.

ETIH Innovation Awards 2026