BENCHMARKS

Ranked #1 across the benchmarks that matter in clinical AI

Independent benchmarks across medical coding, speech recognition, clinical reasoning, and agentic AI. Tested head-to-head against the largest AI labs in the world.

Significant outperformance across the full clinical-reasoning stack

Corti Symphony leads the next best competitor across agent reasoning, medical coding, and speech accuracy.

The hardest problems in healthcare wont be solved by the biggest models. They'll be solved by specialized, clinical-grade models, validated in production.

Clinical tasks have no margin for error. A model that scores well on general benchmarks performs very differently when the task is ICD-10 coding, clinical speech recognition, or multi-step reasoning over a patient record. The gap between claimed and measured performance is where most healthcare AI falls apart.

Corti's benchmarks are run on real clinical tasks, against real inputs, head-to-head with the largest AI labs in the world. The results are published so you can verify them yourself.

AGENTS

Outperforming every major LLM on OpenAI's clinical benchmark

The highest performance across all major models on HealthBench Professional.

#1 on HealthBench Professional

HealthBench Professional is an open benchmark developed by OpenAI to evaluate LLMs on the kinds of tasks clinicians actually bring to AI in practice. Each example is drawn from real physician interactions, scored against rubrics written and adjudicated by three or more physicians. The benchmark was deliberately designed to be challenging, with difficult examples enriched roughly 3.5 times their natural prevalence in the dataset. Corti Symphony scores an overall score of 60.5%, placing it above all other models including OpenAI's GPT-5.4, Claude's Opus 4.7, and notably ChatGPT for Clinicians, OpenAI's own direct-to-clinician product, which scores 59% on this same benchmark.

The safest model under the most difficult conditions

Red teaming refers to the adversarial subset of HealthBench Professional, where clinicians deliberately attempt to surface failure modes rather than test routine usage. Roughly one third of the benchmark consists of red teaming examples. It is the harder portion of the evaluation, designed to expose how models behave under pressure. Corti Symphony scores 59.0 on red teaming overall, nearly double GPT-5.4's 30.3, with particularly strong performance in the research category where Symphony scores a perfect 100.0 against GPT-5.4's 64.0.
MEDICAL CODING

Leading automated medical coding benchmarks

Ranked #1 on performance across synthetic, academic, and real-world clinical data.

Accuracy Comparison

F-score is the harmonic mean of precision and recall, used here to evaluate medical coding accuracy across diagnostic and procedural codes. Corti scores 0.74, ahead of Anthropic at 0.59, Amazon at 0.58, and Google at 0.43.
74%
F1 Accuracy. Outperforming Anthropic (59%), Amazon (58%), OpenAI (48%), and Google (45%).
>25%
Improvement vs. the next-best system. Beating competitors on both precision and accuracy metrics.  
#1
On noisy clinical datasets. Maintaining lead in real clinical environments, outperforming OpenAI and Google.
SPEECH TO TEXT

The most accurate medical speech to text API

Validated on 150,000+ medical terms across 14 languages and every clinical specialty.

Word Error Rate (WER)

Word Error Rate (WER) measures the accuracy of automatic speech recognition (ASR) systems by calculating how often a model substitutes, inserts, or deletes words compared to a correct reference transcript. It's expressed as a percentage, where lower is better and 0% is a perfect transcription.In clinical settings, WER directly affects patient safety. A misheard drug name or dosage in a medical document is not just a typo. This makes low WER a functional requirement, not just a performance metric. The chart compares leading ASR models on clinical speech. Corti Symphony's 2% WER represents four times fewer errors than NVIDIA's Parakeet and Canary Qwen at 8%, and three times fewer than OpenAI Whisper at 6%.
CLINICAL REASONING

The best fact extraction tool for ambient AI in healthcare

Clinically validated to reduce irrelevant note details by 65% and post-visit edit times.
Interface Essential Global Public Streamline Icon: https://streamlinehq.com interface-essential-global-public
Benchmarked on the standard. Groundedness score confirms every fact extracted by FactsR™ is traceable to something actually mentioned in the consultation.
Ecology Tree Streamline Icon: https://streamlinehq.com ecology-tree
Built for real time. FactsR™ extracts, refines, and validates clinical facts as the consultation unfolds. By the time the visit ends, the note is present with facts and their respective groups.
Coding Apps Websites Programming Browser Streamline Icon: https://streamlinehq.com coding-apps-websites-programming-browser
API-native. A single API call returns structured, validated clinical facts with timestamps and confidence scores. Integrate it into any architecture in no time.

FactsR™ Groundedness

Groundedness measures whether the content in a generated note can be traced back to what was actually said in the consultation. An ungrounded note invents or distorts clinical detail, which in a medical context is a liability. FactsR™ maintains groundedness at 94%, comfortably above the level seen in clinician-written reference notes on the Primock57 benchmark.

FactsR™ Conciseness

Conciseness measures how much extraneous detail ends up in a clinical note. Verbose AI output creates cognitive overhead, forcing clinicians to sift through noise to find what matters. FactsR™ reduces conciseness errors by 86%, bringing them down from 14.3% to 2.0%, by validating each extracted fact before it reaches the note.

FactsR™ Completeness

Completeness measures whether clinically relevant information from the consultation makes it into the final note. Missed findings, omitted medications, or undocumented symptoms can affect care decisions downstream. FactsR™ reduces completeness errors by 49%, with missing content dropping from 23.3% to 11.7%, through its real-time extraction and refinement loop.
Keep reading

More from Corti on Benchmarks

Frequently asked questions

Should I interpret the benchmark scores as real-world accuracy rates?

No. The dataset deliberately overrepresents difficult and adversarial examples by roughly 3.5x. A 60% score here can coexist with strong performance in typical clinical use. The benchmark is a stress test, not an average-case measurement.

Regarding safety, what does the red teaming score actually mean?

About a third of the benchmark consists of physicians deliberately trying to break models. Strategies include false premises, role-play framing, and presenting questionable diagnoses as fact. Symphony scores 59.0 on this subset against GPT-5.4's 30.3. That gap reflects robustness under adversarial clinical pressure, not just routine performance.

What kinds of tasks does the HealthBench Professional benchmark actually test?

Three categories reflecting real clinical workflows: care consult (differential diagnosis, treatment reasoning), writing and documentation (note generation, coding, patient messaging), and medical research (synthesizing and finding clinical evidence).

Is Symphony HIPAA compliant?

Yes. Symphony is built for healthcare environments, including support for sovereign cloud deployments for organisations with strict data residency requirements.

How accurate is the speech recognition in practice?

2% Word Error Rate (WER) on clinical speech across 150,000+ medical terms and 14 languages. In practical terms, that's three to four times fewer errors than the next best alternatives.

What makes FactsR different from standard note summarization?

It runs in real time during the consultation rather than after. By the time the visit ends, structured facts are already extracted, validated, and timestamped.