Symphony for speech-to-text

Medical speech to text API built for clinical language

Symphony for Speech-to-Text is clinical speech recognition for healthcare developers. Real-time medical dictation, ambient documentation, and batch audio, with structured transcripts via REST and WebSocket.

98.6%
Word accuracy. 1.4% Word Error Rate in English realtime
#1
Medical dictation benchmarks in English, French, and German
98.5%
Formatted entity recall on dosages, units, measurements
50%
Fewer missed terms with keyterm biasing

Trusted for > 1 million interactions every week

symphony for speech-to-text

Clinical speech recognition API

Three endpoints supporting every clinical audio workflow. All three run through the same underlying pipeline: medical recognition, structured formatting, and contextual correction.
/transcribe

Real-time stateless dictation

WebSocket streaming for low-latency dictation. Send audio, receive interim and final transcripts. Built for EHR navigation, structured data capture, and voice UI control.

/streams

Real-time stateful transcription

WebSocket streaming for conversational clinical audio tied to an ongoing session. Speaker diarization, audio health events, and contextual correction run natively. Built for ambient documentation workflows.

/transcripts

Async batch processing

REST endpoint for pre-recorded audio and asynchronous processing. Upload files and retrieve transcripts at scale. Same quality, no streaming infrastructure required.

Ship speech-enabled clinical software faster

An orchestrated pipeline, not a single-pass model.

Symphony treats speech to text as a structured inference problem. Recognition, formatting, and correction run as an orchestrated pipeline rather than collapsing into one step. The result is text that is clinically accurate and operationally ready.

Stream in real time or upload batch.

Stateless real-time dictation, stateful conversational transcription, and async batch processing all run through the same underlying model. Build across modes without managing separate vendors or accepting differences in accuracy.

Inject context. Bias toward the keywords that matter.

Supply a terminology list at inference time, and Symphony biases recognition toward the terms in your context - rare drug names, facility abbreviations, clinician identifiers. No fine-tuning, no retraining cycle. Reduces missed terms by ~50% without affecting precision.

Catch quality problems before they reach the transcript.

Symphony Audio Health Events surface real-time quality signals during the interaction, so your pipeline can prompt a correction rather than surface a bad transcript after the fact.

Structured outputs your agents can act on directly.

Entities are structured, punctuation is rendered, dosages and units are in the right form. Your agents and downstream systems can act on the transcript without a post-processing layer.

Benchmarks

The most accurate speech-to-text for clinical use cases

Best-in-class performance across medical terminology, formatting, and dictation commands.

Accuracy comparison

Keyword Accuracy
1.4%
Word Error Rate. Outperforming ElevenLabs, OpenAI, AWS, NVIDIA, and Google by >20%, both in realtime and offline English benchmarks.
98%
Formatting Accuracy. So dosages, units, dates, and measurements always render correctly for downstream use.
25%+
Improvement vs. legacy dictation. Symphony WER
4.1% vs. Dradon Medical One 5.7% WER on Medical Dictation benchmarks.

Word Accuracy Rate

Hear it from them

What builders are saying

"By adding Corti’s API directly into our platform, we’re giving customers the latest capabilities without forcing them to learn new systems or abandon familiar workflows."
Dr. Thomas Brauner
CEO, Speech Processing Solutions
"In a clinical conversation, every word matters - a missed medication name, a misheard dosage, or a mistranscribed symptom can change the meaning of an encounter. Symphony’s accuracy on clinical terminology gives us the foundation to bring more trusted AI capabilities into clinical workflows with our Voicepoint Xenon® platform"
Pierre Corboz
Head of Solutions & Business Development, Voicepoint

Symphony for Speech-to-Text Capabilities

Commands

Control your application interface by voice, using spoken words to trigger actions and navigate fields.

Formatting

Render dosages, units, dates, and measurements in clinical form automatically. No post-processing needed.

Diarization

Attribute speech to the right speaker across multi-participant conversations.

Interim results

Show transcript output in real time as the clinician speaks, word by word.

Audio health events

Surface input quality signals in the API response, before a bad transcript reaches your users.

Auto/spoken punctuation

Handle punctuation at the recognition layer for a natural dictation experience.

Replacement rules

Control how words, phrases, and acronyms appear in final transcript output.

Custom dictionary

Improve recognition for proper nouns, facility names, and specialty-specific terminology.

Production-ready

Global coverage across 15+ languages

Available today through a unified API supporting live streaming dictation, conversational speech, and batch audio file processing. A single integration handles every voice workflow your product needs.

Build any voice-powered healthcare workflow

Symphony supports dictation, ambient documentation, and agentic workflows through a single API. No separate systems. No accuracy tradeoffs between modes.

Stream a conversation. Receive a structured transcript.

Connect to /streams for stateful, real-time conversational transcription. Audio is associated with an ongoing interaction - diarization, contextual correction, and audio health events run natively in this mode.

Speaker diarization segments the transcript by doctor and patient automatically

Audio health events surface input quality issues in real time, before the encounter ends

Get started with the Corti SDK in JavaScript and C# .NET

Low-latency transcription built for command-and-control.

Connect to /transcribe for stateless, real-time dictation. Built for command-and-control workflows where spoken commands control the interface, edit text, and trigger formatting operations.

Industry-leading accuracy on medical dictation benchmarks, outperforming Dragon Medical One by 20%+

Spoken punctuation, measurements, and abbreviations handled at the recognition layer

Supply a keyterm list at inference time to bias recognition toward your vocabulary

Give your agents clean, structured input. Not raw text.

Symphony delivers structured, validated outputs your agents can act on directly. It handles the gap between what a clinician says and what your software needs to do.

Structured outputs let voice directly drive workflows, orders, and documentation

Context injection and controllable outputs mean your agent starts from verified input

Controllable outputs reduce the surface area for LLM hallucination downstream

Compare

A speech-to-text pipeline built for healthcare

Not adapted from general audio.

General-purpose APIs don't solve for clinical use.

VS. GENERIC ASR
Symphony
Generic ASR
Medical term accuracy
Native
Real-time formatting
Native
Spoken punctuation
Native
Custom commands
Native
Speaker diarization
Native
Partial
Audio health events
Native

Legacy software wasn't designed for builders.

VS.  LEGACY DICTATION
Symphony
Legacy Dictation
Medical term accuracy
Best-in-class
Real-time formatting
Native
Flexible developer API
Yes
Embeddable in your app
Yes
Speaker diarization
Native
Structured outputs
Agent-ready
Cursor input
Compare

Build the next generation of dictation applications. Without the legacy constraints.

Symphony
Legacy Dictation
Medical vocabulary
Best-in-class
Built-in
Real-time formatting
Native, structured
Native, unstructured
Flexible developer API
Yes
No API
Embeddable in your app
Yes
Client-side software
Speaker diarization
Native
No
Structured outputs
Agent-ready
Cursor input

Discover the research behind Symphony for Speech-to-Text

Nine years of peer-reviewed research, published at NeurIPS, ICML, ICLR, and ACL. Now shipping as an API.

More from Corti on Speech-to-Text

Start building with Symphony for Speech-to-Text

$50 free credits. Full API access. No card required.

Frequently asked questions

How accurate is Symphony for Speech-to-Text?

Symphony is top-ranked on medical dictation benchmarks across English, French, and German, and is available across 14 languages worldwide. It outperforms OpenAI, ElevenLabs, Whisper, and Parakeet on clinical audio, and matches or exceeds state-of-the-art on general-purpose benchmarks.

What does Symphony for Speech-to-Text cost?

$0.0065 per audio minute, all inclusive. Diarization, contextual correction, keyterm biasing, audio health events, multichannel speaker attribution, and all supported languages are included at no extra cost. No separate charges for transcript output.

10 minutes = $0.07. 1 hour = $0.39.

Is Symphony HIPAA compliant?

Yes. Symphony is built for healthcare environments, with sovereign cloud deployment options for organizations with strict data residency requirements.

How does Symphony handle hallucinations?

Symphony is stress-tested on non-speech audio and aggressively segmented inputs. It reports lower spurious insertion rates than every major competing system - an important signal for production clinical deployments.

How do the three Symphony for Speech-to-Text endpoints differ?

/transcribe is a stateless WebSocket endpoint for real-time dictation and voice-controlled interfaces. /stream is a stateful WebSocket endpoint for conversational transcription, where audio is associated with an ongoing interaction. /transcripts is an async REST endpoint for batch processing of pre-recorded audio files. All three share the same underlying pipeline with no performance differences between modes.

How does Corti Symphony for Speech-to-Text compare to Dragon Medical One?

On MedDictate, a realistic English medical dictation benchmark, Symphony achieves 4.6% WER, compared to 5.7% for Dragon Medical One, with higher medical term recall and a lower false discovery rate (0.79% vs. 1.33%). Unlike front-end dictation applications, Symphony is an API built for developers - no client-side software, no vendor lock-in, and structured outputs that downstream systems can act on directly.

How do I evaluate Symphony?

Every developer gets $50 of API credits to start - enough to process nearly 8,000 minutes of audio.

Get an API key, run your own audio through the appropriate endpoint, and measure results against a gold standard transcript using Corti Canal, our open-source evaluation tool that reports Word Error Rate, Character Error Rate, and Medical Term Recall.

If you want help setting up a benchmark or interpreting results, reach out to help@corti.ai.