Google extends AMIE beyond diagnosis in Nature disease management study

22 Jun

The Gemini-based research system matched primary care physicians overall across 100 simulated multi-visit cases but remains outside clinical use.

***Google Research and Google DeepMind have tested AMIE's ability to manage health conditions across multiple simulated patient appointments.***

Google Research and Google DeepMind have extended the Articulate Medical Intelligence Explorer (AMIE) from one-off diagnostic conversations to disease management across multiple appointments, with a Nature study comparing the medical AI system against 21 board-certified primary care physicians.

Google announced the research on June 17. The paper was accepted by Nature on June 4 and released as an accelerated article preview, following earlier research into AMIE's diagnostic reasoning capabilities.

The randomized, blinded study covered 100 simulated patient scenarios across cardiology, pulmonology, obstetrics and gynecology, urology, gastroenterology, neurology, and musculoskeletal care. Each scenario involved three text-based appointments, with patient symptoms, treatment responses, and test results changing between visits.

Primary care physicians and patient actors were based in India and Canada. Thirty specialist physicians assessed the conversations and management plans, while the cases incorporated guidance from the UK National Institute for Health and Care Excellence (NICE) and BMJ Best Practice.

AMIE was rated as non-inferior to primary care physicians in overall management reasoning. It received higher scores for the precision of investigation and treatment recommendations and for aligning plans with clinical guidelines, although the research was conducted in a simulated setting rather than with patients receiving clinical care.

Google is now studying how AMIE could operate in clinical environments. The next phase includes an ongoing nationwide virtual care study with Included Health and a clinical feasibility study with Beth Israel Deaconess Medical Center.

AMIE compared with doctors across three simulated visits

The researchers adapted the Objective Structured Clinical Examination (OSCE), an assessment format used in medical education, to evaluate how AMIE and primary care physicians handled changing patient information over time.

The same patient actor completed each scenario with AMIE and a physician in a randomized and blinded order. Full transcripts from previous visits were available during the second and third appointments, allowing both AMIE and the participating doctors to review earlier information.

Specialist physicians evaluated the appropriateness, completeness, precision, and guideline alignment of each management plan.

AMIE received favorable ratings for the overall appropriateness of its plans in 95 percent of first visits, 96 percent of second visits, and 98 percent of third visits. The corresponding results for primary care physicians were 72 percent, 80 percent, and 81 percent.

The difference was larger for the precision of treatment recommendations. AMIE scored 96 percent, 95 percent, and 95 percent across the three visits, compared with 62 percent, 65 percent, and 67 percent for participating physicians.

AMIE also scored higher for investigation precision, receiving favorable ratings in 98 percent, 99 percent, and 98 percent of visits. Physician scores were 87 percent, 82 percent, and 82 percent.

The study found no evaluation area in which primary care physicians significantly outperformed AMIE. However, the researchers noted that participating physicians were based in India and Canada while the cases were grounded in UK guidance, which may have affected their familiarity with the selected recommendations.

The participating doctors were given access to the same guideline collection and could review it without a time limit after each simulated consultation.

Two AI agents combine patient conversations and clinical guidance

AMIE uses a two-agent architecture intended to separate real-time patient communication from longer clinical reasoning.

The Dialogue Agent communicates with the patient, gathers information, maintains the history of previous appointments, and delivers the system's response.

A second Management Reasoning Agent, known as the Mx Agent, analyzes the patient information, retrieves clinical guidance, and creates structured plans covering investigations, treatments, prescriptions, and follow-up care.

The evaluated AMIE system was built on Gemini 1.5 Flash. It used Gemini's long-context capabilities to process patient conversations alongside clinical documents and drug formularies.

The full research corpus contained 627 clinical guidance documents and approximately 10.5 million tokens, exceeding the model's context window. AMIE therefore conducted an initial retrieval process before analyzing an average of six relevant documents at the same time.

The documents included 527 NICE guidance publications and 100 BMJ Best Practice documents. Individual recommendations generated by the Mx Agent included citations intended to show which clinical documents supported them.

Google Staff Research Scientist Mike Schaekermann said in a LinkedIn post: "At Google, we firmly believe that a responsible approach to conversational AI in health should adopt high standards of evidence generation, similar to other interventions in medicine."

The paper also reported tests involving Gemini 2.5 Flash, but the central physician comparison used the AMIE configuration built on Gemini 1.5 Flash.

Medication benchmark exposes limits alongside higher difficult-question scores

The researchers also created RxQA, a 600-question benchmark testing medication knowledge and reasoning.

RxQA was developed from the United States Food and Drug Administration's OpenFDA data and the British National Formulary. Board-certified pharmacists reviewed the questions and classified them by difficulty.

AMIE and participating physicians completed the questions without reference materials and with access to external drug information.

On the higher-difficulty questions, AMIE achieved 50.6 percent accuracy without external information, compared with 41.5 percent for physicians. With drug references available, AMIE scored 57.9 percent and physicians scored 47.8 percent.

There was no statistically significant difference between AMIE and the participating doctors on the lower-difficulty questions. Even with reference materials, the highest result recorded by either group across the benchmark was 73.8 percent, leaving room for errors in medication reasoning.

The researchers said the findings do not show that AMIE is ready for clinical care. The evaluation used scripted cases, patient actors, text chat, and short intervals between appointments, without electronic health records, live prescribing systems, pharmacists, or the broader complexity of routine clinical practice.

The paper also identified possible errors within AMIE's internal reasoning traces, even when those errors did not appear in its final management plans. Google has not released AMIE's model code or weights, citing the risks of unmonitored use in medical settings.

Schaekermann added: "Though further work is needed for safe real-world implementation, our recent clinical feasibility study with Beth Israel Deaconess Medical Center and our ongoing nationwide study with Included Health are important steps towards that goal. We believe that systems like AMIE could one day augment care and give doctors back time with their patients where it truly matters."

Google plans to continue working with research partners, healthcare providers, and regulators to evaluate AMIE. The nationwide Included Health study will test the system in real-world virtual care, but Google has not provided a timeline for clinical availability.

ETIH Innovation Awards 2026