AllTopicsTodayAllTopicsToday
Notification
Font ResizerAa
  • Home
  • Tech
  • Investing & Finance
  • AI
  • Entertainment
  • Wellness
  • Gaming
  • Movies
Reading: Stanford Researchers Introduced MedAgentBench: A Real-World Benchmark for Healthcare AI Agents
Share
Font ResizerAa
AllTopicsTodayAllTopicsToday
  • Home
  • Blog
  • About Us
  • Contact
Search
  • Home
  • Tech
  • Investing & Finance
  • AI
  • Entertainment
  • Wellness
  • Gaming
  • Movies
Have an existing account? Sign In
Follow US
©AllTopicsToday 2026. All Rights Reserved.
AllTopicsToday > Blog > AI > Stanford Researchers Introduced MedAgentBench: A Real-World Benchmark for Healthcare AI Agents
Blog banner 54.png
AI

Stanford Researchers Introduced MedAgentBench: A Real-World Benchmark for Healthcare AI Agents

AllTopicsToday
Last updated: September 16, 2025 7:49 am
AllTopicsToday
Published: September 16, 2025
Share
SHARE

A crew of researchers at Stanford College have launched Medagentbench, a brand new benchmark suite designed to judge large-scale language mannequin (LLM) brokers within the healthcare context. Not like earlier query datasets, Medagentbench supplies a Digital Digital Well being Document (EHR) setting the place AI programs must work together, plan, and carry out multi-step scientific duties. This reveals a major change from testing static reasoning to evaluate agent performance in reside, tool-based medical workflows.

https://ai.nejm.org/doi/full/10.1056/aidbp2500144

Why do healthcare require agent benchmarks?

LLMs today have moved past static chat-based interactions on agent conduct, together with high-level instruction interpretation, API calls, affected person information integration, and complicated processes. In drugs, this evolution helps tackle workers shortages, doc burdens, and administration inefficiencies.

Whereas general-purpose agent benchmarks (similar to agent benches, agent boards, and tau benches), healthcare lacked a standardized benchmark that captured medical information complexity, FHIR interoperability, and longitudinal affected person data. Medagentbench fills this hole by offering a reproducible and clinically related evaluation framework.

What does Medagentbench embody?

How are the duties configured?

Medagentbench consists of 300 duties throughout 10 classes written by licensed physicians. These duties embody looking affected person info, monitoring lab outcomes, documentation, take a look at order, referral, and medicine administration. The duty is a mean of 2-3 steps and mirror workflow encountered in inpatient and outpatient care.

Which affected person information help benchmarking?

The benchmark makes use of 100 practical affected person profiles extracted from Stanford’s Starr information repository and contains over 700,000 data, together with labs, vitals, prognosis, procedures, and medicine orders. Knowledge had been recognized and jittered for privateness whereas sustaining scientific validity.

How is the setting constructed?

The setting is FHIR compliant and helps each looking (GET) and modifying (publish) EHR information. AI programs can simulate practical scientific interactions, similar to very important documentation and medicine order placement. This design permits the benchmark to be translated instantly into the Dwell EHR system.

How are fashions evaluated?

Metric: Process success fee (SR), measured with strict path @1 to replicate precise security necessities. Fashions examined: 12 main LLMs together with GPT-4O, Claude 3.5 Sonnet, Gemini 2.0, Deepseek-V3, Qwen2.5, and Llama 3.3. Agent Orchestrator: Baseline orchestration setup with 9 FHIR capabilities. It’s restricted to eight interplay rounds per activity.

Which mannequin carried out finest?

Claude 3.5 Sonnet V2: The most effective total success, particularly highly effective in search duties (85.33%). GPT-4O: 64.0% success. Reveals balanced search and motion efficiency. DeepSeek-V3: 62.67% success. He’s main the best way amongst open weight fashions. Remark: Most fashions had been glorious at question duties, however they struggled with action-based duties that required protected multi-step execution.

https://ai.nejm.org/doi/full/10.1056/aidbp2500144

What errors did the mannequin create?

Two dominant patterns of dysfunction emerged.

Instruction Adherence Failure – Invalid API calls or incorrect JSON format. Output mismatch – Supplies an entire assertion when structured numeric values ​​are required.

These errors spotlight the important accuracy and reliability hole in scientific deployments.

abstract

Medagentbench establishes the primary giant benchmark for assessing LLM brokers in a practical EHR setting, combining 300 clinicians with an FHIR-compliant setting and 100 affected person profiles. The outcomes present that it could be robust however restricted reliability. Claude 3.5 Sonnet V2 leads at 69.67%. Will increase the hole between profitable question and protected motion execution. MedagentBench, constrained by single-center information and EHR-centric scopes, supplies an open and reproducible framework for driving the subsequent era of reliable healthcare AI brokers

Take a look at our papers and technical weblog. For tutorials, code and notebooks, please go to our GitHub web page. Additionally, be happy to comply with us on Twitter. Remember to affix 100K+ ML SubredDit and subscribe to our e-newsletter.

Mikal Sutter is a knowledge science professional with a Grasp’s diploma in Knowledge Science from Padova College. With its stable foundations of statistical evaluation, machine studying, and information engineering, Michal excels at remodeling complicated datasets into actionable insights.

Microsoft lost $357 billion in market cap in earnings plunge
When and why agent systems work
Time-Series Transformation Toolkit: Feature Engineering for Predictive Analytics
The New Skill is Verbalized Sampling
Palantir (PLTR) Q4 2025 earnings
TAGGED:agentsBenchmarkHealthcareIntroducedMedAgentBenchRealWorldResearchersStanford
Share This Article
Facebook Email Print
Leave a Comment

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Follow US

Find US on Social Medias
FacebookLike
XFollow
YoutubeSubscribe
TelegramFollow

Weekly Newsletter

Subscribe to our newsletter to get our newest articles instantly!

Popular News
News digest 2025.jpg
AI

News digest 2025: AI year in review

AllTopicsToday
AllTopicsToday
December 27, 2025
The Cost To Remodel A Rundown Two-Bedroom In-Law Unit
A WW2 Bomb Is Unearthed in Central London
OpenAI temporarily blocked from using “Cameo” after trademark lawsuit
New 2026 tax brackets are here: What higher thresholds and a bigger standard deduction mean for paychecks and the top 1%
- Advertisement -
Ad space (1)

Categories

  • Tech
  • Investing & Finance
  • AI
  • Entertainment
  • Wellness
  • Gaming
  • Movies

About US

We believe in the power of information to empower decisions, fuel curiosity, and spark innovation.
Quick Links
  • Home
  • Blog
  • About Us
  • Contact
Important Links
  • About Us
  • Privacy Policy
  • Terms and Conditions
  • Disclaimer
  • Contact

Subscribe US

Subscribe to our newsletter to get our newest articles instantly!

©AllTopicsToday 2026. All Rights Reserved.
1 2
Welcome Back!

Sign in to your account

Username or Email Address
Password

Lost your password?