BENCHMARKS

Ranked #1 across the benchmarks that matter in clinical AI

Independent benchmarks across medical coding, speech recognition, clinical reasoning, and agentic AI. Tested head-to-head against the largest AI labs in the world.

Get API key

Meet with an Expert

Significant outperformance across the full clinical-reasoning stack

Corti Symphony leads the next best competitor across agent reasoning, medical coding, and speech accuracy.

Corti Symphony

Second best

Agent Reasoning

Medical Coding

Speech Accuracy

60.5%

48.1%

74.0%

59.0%

97.8%

82.6%

The hardest problems in healthcare wont be solved by the biggest models. They'll be solved by specialized, clinical-grade models, validated in production.

Clinical tasks have no margin for error. A model that scores well on general benchmarks performs very differently when the task is ICD-10 coding, clinical speech recognition, or multi-step reasoning over a patient record. The gap between claimed and measured performance is where most healthcare AI falls apart.

Corti's benchmarks are run on real clinical tasks, against real inputs, head-to-head with the largest AI labs in the world. The results are published so you can verify them yourself.

AGENTS

Outperforming every major LLM on OpenAI's clinical benchmark

The highest performance across all major models on HealthBench Professional.

Explore more

#1 on HealthBench Professional

HealthBench Professional is an open benchmark developed by OpenAI to evaluate LLMs on the kinds of tasks clinicians actually bring to AI in practice. Each example is drawn from real physician interactions, scored against rubrics written and adjudicated by three or more physicians. The benchmark was deliberately designed to be challenging, with difficult examples enriched roughly 3.5 times their natural prevalence in the dataset. Corti Symphony scores an overall score of 60.5%, placing it above all other models including OpenAI's GPT-5.4, Claude's Opus 4.7, and notably ChatGPT for Clinicians, OpenAI's own direct-to-clinician product, which scores 59% on this same benchmark.

Corti Symphony

Other models

60.5%

48.1%

47.0%

46.2%

45.9%

43.8%

43.7%

36.1%

Corti
Symphony

GPT-5.4

Claude
Opus 4.7

GPT-5

GPT-5.2

Gemini
3.1 Pro

Physician(written)

Grok 4.20

The safest model under the most difficult conditions

Red teaming refers to the adversarial subset of HealthBench Professional, where clinicians deliberately attempt to surface failure modes rather than test routine usage. Roughly one third of the benchmark consists of red teaming examples. It is the harder portion of the evaluation, designed to expose how models behave under pressure. Corti Symphony scores 59.0 on red teaming overall, nearly double GPT-5.4's 30.3, with particularly strong performance in the research category where Symphony scores a perfect 100.0 against GPT-5.4's 64.0.

Corti Symphony

GPT-5.4

59.0

30.3

57.6

32.3

100.0

64.0

57.3

25.7

Overall

Consult

Research

Writing

MEDICAL CODING

Leading automated medical coding benchmarks

Ranked #1 on performance across synthetic, academic, and real-world clinical data.

Explore more

Accuracy Comparison

F-score is the harmonic mean of precision and recall, used here to evaluate medical coding accuracy across diagnostic and procedural codes. Corti scores 0.74, ahead of Anthropic at 0.59, Amazon at 0.58, and Google at 0.43.

0.43

0.48

0.52

0.58

0.59

0.74

0.59

0.58

0.52

0.48

0.43

74%

F1 Accuracy. Outperforming Anthropic (59%), Amazon (58%), OpenAI (48%), and Google (45%).

>25%

Improvement vs. the next-best system. Beating competitors on both precision and accuracy metrics.

On noisy clinical datasets. Maintaining lead in real clinical environments, outperforming OpenAI and Google.

SPEECH TO TEXT

The most accurate medical speech to text API

Validated on 150,000+ medical terms across 14 languages and every clinical specialty.

Explore more

Word Error Rate (WER)

Word Error Rate (WER) measures the accuracy of automatic speech recognition (ASR) systems by calculating how often a model substitutes, inserts, or deletes words compared to a correct reference transcript. It's expressed as a percentage, where lower is better and 0% is a perfect transcription.In clinical settings, WER directly affects patient safety. A misheard drug name or dosage in a medical document is not just a typo. This makes low WER a functional requirement, not just a performance metric. The chart compares leading ASR models on clinical speech. Corti Symphony's 2% WER represents four times fewer errors than NVIDIA's Parakeet and Canary Qwen at 8%, and three times fewer than OpenAI Whisper at 6%.

Corti Symphony

OpenAI Whisper

NVIDIA Parakeet
Canary Qwen

CLINICAL REASONING

The best fact extraction tool for ambient AI in healthcare

Clinically validated to reduce irrelevant note details by 65% and post-visit edit times.

Benchmarked on the standard. Groundedness score confirms every fact extracted by FactsR™ is traceable to something actually mentioned in the consultation.

Built for real time. FactsR™ extracts, refines, and validates clinical facts as the consultation unfolds. By the time the visit ends, the note is present with facts and their respective groups.

API-native. A single API call returns structured, validated clinical facts with timestamps and confidence scores. Integrate it into any architecture in no time.

FactsR™ Groundedness

Groundedness measures whether the content in a generated note can be traced back to what was actually said in the consultation. An ungrounded note invents or distorts clinical detail, which in a medical context is a liability. FactsR™ maintains groundedness at 94%, comfortably above the level seen in clinician-written reference notes on the Primock57 benchmark.

FactsR™

94%

FactsR™ Conciseness

Conciseness measures how much extraneous detail ends up in a clinical note. Verbose AI output creates cognitive overhead, forcing clinicians to sift through noise to find what matters. FactsR™ reduces conciseness errors by 86%, bringing them down from 14.3% to 2.0%, by validating each extracted fact before it reaches the note.

Before

14.3%

FactsR™

2.0%

−86% REDUCTION

FactsR™ Completeness

Completeness measures whether clinically relevant information from the consultation makes it into the final note. Missed findings, omitted medications, or undocumented symptoms can affect care decisions downstream. FactsR™ reduces completeness errors by 49%, with missing content dropping from 23.3% to 11.7%, through its real-time extraction and refinement loop.

Before

23.3%

FactsR™

11.7%

−49% REDUCTION

Production-grade building blocks built for every layer of the clinical stack.

Agents. Higher task completion rates on multi-step clinical workflows than general-purpose agentic frameworks, with full auditability at every step.

Medical Coding. Ranked #1 on automated coding benchmarks, outperforming every major AI lab on ICD-10-CM, ICD-10-PCS, and CPT.

Speech to Text. Lowest word error rate on clinical speech benchmarks, validatesd on real patient interactions across specialties and care settings.

Clinical Reasoning. Top-ranked on real-world diagnostic and triage tasks, with every inference traceable to clinical note content.

Text Generation. Benchmarked on accuracy and groundedness across admin & documentation. Less time on paperwork. More time on patients.

Keep reading

Frequently asked questions

Meet with an Expert

Should I interpret the benchmark scores as real-world accuracy rates?

No. The dataset deliberately overrepresents difficult and adversarial examples by roughly 3.5x. A 60% score here can coexist with strong performance in typical clinical use. The benchmark is a stress test, not an average-case measurement.

Regarding safety, what does the red teaming score actually mean?

About a third of the benchmark consists of physicians deliberately trying to break models. Strategies include false premises, role-play framing, and presenting questionable diagnoses as fact. Symphony scores 59.0 on this subset against GPT-5.4's 30.3. That gap reflects robustness under adversarial clinical pressure, not just routine performance.

What kinds of tasks does the HealthBench Professional benchmark actually test?

Three categories reflecting real clinical workflows: care consult (differential diagnosis, treatment reasoning), writing and documentation (note generation, coding, patient messaging), and medical research (synthesizing and finding clinical evidence).

Is Symphony HIPAA compliant?

Yes. Symphony is built for healthcare environments, including support for sovereign cloud deployments for organisations with strict data residency requirements.

How accurate is the speech recognition in practice?

2% Word Error Rate (WER) on clinical speech across 150,000+ medical terms and 14 languages. In practical terms, that's three to four times fewer errors than the next best alternatives.

What makes FactsR different from standard note summarization?

It runs in real time during the consultation rather than after. By the time the visit ends, structured facts are already extracted, validated, and timestamped.

Ranked #1 across the benchmarks that matter in clinical AI

Significant outperformance across the full clinical-reasoning stack

Corti Symphony leads the next best competitor across agent reasoning, medical coding, and speech accuracy.

The hardest problems in healthcare wont be solved by the biggest models. They'll be solved by specialized, clinical-grade models, validated in production.

Outperforming every major LLM on OpenAI's clinical benchmark

#1 on HealthBench Professional

The safest model under the most difficult conditions

Leading automated medical coding benchmarks

Accuracy Comparison

The most accurate medical speech to text API

Word Error Rate (WER)

The best fact extraction tool for ambient AI in healthcare

FactsR™ Groundedness

FactsR™ Conciseness

FactsR™ Completeness

Production-grade building blocks built for every layer of the clinical stack.

More from Corti on Benchmarks

Symphony for Medical Coding: A next-generation agentic system for scalable and explainable medical coding

Better alignment, better evaluation: towards a new evaluation paradigm for speech to text

From everyday language to clinical precision: introducing Medical Term Recall

Introducing Symphony for Medical Coding

Corti introduces GIM: Benchmark-leading method for understanding AI model behavior

Symphony: what we've been building - and what's coming next

Frequently asked questions

Should I interpret the benchmark scores as real-world accuracy rates?

Regarding safety, what does the red teaming score actually mean?

What kinds of tasks does the HealthBench Professional benchmark actually test?

Is Symphony HIPAA compliant?

How accurate is the speech recognition in practice?

What makes FactsR different from standard note summarization?