A crew of researchers at Stanford College have launched Medagentbench, a brand new benchmark suite designed to judge large-scale language mannequin (LLM) brokers within the healthcare context. Not like earlier query datasets, Medagentbench supplies a Digital Digital Well being Document (EHR) setting the place AI programs must work together, plan, and carry out multi-step scientific duties. This reveals a major change from testing static reasoning to evaluate agent performance in reside, tool-based medical workflows.
Why do healthcare require agent benchmarks?
LLMs today have moved past static chat-based interactions on agent conduct, together with high-level instruction interpretation, API calls, affected person information integration, and complicated processes. In drugs, this evolution helps tackle workers shortages, doc burdens, and administration inefficiencies.
Whereas general-purpose agent benchmarks (similar to agent benches, agent boards, and tau benches), healthcare lacked a standardized benchmark that captured medical information complexity, FHIR interoperability, and longitudinal affected person data. Medagentbench fills this hole by offering a reproducible and clinically related evaluation framework.
What does Medagentbench embody?
How are the duties configured?
Medagentbench consists of 300 duties throughout 10 classes written by licensed physicians. These duties embody looking affected person info, monitoring lab outcomes, documentation, take a look at order, referral, and medicine administration. The duty is a mean of 2-3 steps and mirror workflow encountered in inpatient and outpatient care.
Which affected person information help benchmarking?
The benchmark makes use of 100 practical affected person profiles extracted from Stanford’s Starr information repository and contains over 700,000 data, together with labs, vitals, prognosis, procedures, and medicine orders. Knowledge had been recognized and jittered for privateness whereas sustaining scientific validity.
How is the setting constructed?
The setting is FHIR compliant and helps each looking (GET) and modifying (publish) EHR information. AI programs can simulate practical scientific interactions, similar to very important documentation and medicine order placement. This design permits the benchmark to be translated instantly into the Dwell EHR system.
How are fashions evaluated?
Metric: Process success fee (SR), measured with strict path @1 to replicate precise security necessities. Fashions examined: 12 main LLMs together with GPT-4O, Claude 3.5 Sonnet, Gemini 2.0, Deepseek-V3, Qwen2.5, and Llama 3.3. Agent Orchestrator: Baseline orchestration setup with 9 FHIR capabilities. It’s restricted to eight interplay rounds per activity.
Which mannequin carried out finest?
Claude 3.5 Sonnet V2: The most effective total success, particularly highly effective in search duties (85.33%). GPT-4O: 64.0% success. Reveals balanced search and motion efficiency. DeepSeek-V3: 62.67% success. He’s main the best way amongst open weight fashions. Remark: Most fashions had been glorious at question duties, however they struggled with action-based duties that required protected multi-step execution.

What errors did the mannequin create?
Two dominant patterns of dysfunction emerged.
Instruction Adherence Failure – Invalid API calls or incorrect JSON format. Output mismatch – Supplies an entire assertion when structured numeric values are required.
These errors spotlight the important accuracy and reliability hole in scientific deployments.
abstract
Medagentbench establishes the primary giant benchmark for assessing LLM brokers in a practical EHR setting, combining 300 clinicians with an FHIR-compliant setting and 100 affected person profiles. The outcomes present that it could be robust however restricted reliability. Claude 3.5 Sonnet V2 leads at 69.67%. Will increase the hole between profitable question and protected motion execution. MedagentBench, constrained by single-center information and EHR-centric scopes, supplies an open and reproducible framework for driving the subsequent era of reliable healthcare AI brokers
Take a look at our papers and technical weblog. For tutorials, code and notebooks, please go to our GitHub web page. Additionally, be happy to comply with us on Twitter. Remember to affix 100K+ ML SubredDit and subscribe to our e-newsletter.

Mikal Sutter is a knowledge science professional with a Grasp’s diploma in Knowledge Science from Padova College. With its stable foundations of statistical evaluation, machine studying, and information engineering, Michal excels at remodeling complicated datasets into actionable insights.


