Symphony for Speech-to-Text is clinical speech recognition for healthcare developers. Real-time medical dictation, ambient documentation, and batch audio, with structured transcripts via REST and WebSocket.
Trusted for > 1 million interactions every week
Real-time stateless dictation
WebSocket streaming for low-latency dictation. Send audio, receive interim and final transcripts. Built for EHR navigation, structured data capture, and voice UI control.
Real-time stateful transcription
WebSocket streaming for conversational clinical audio tied to an ongoing session. Speaker diarization, audio health events, and contextual correction run natively. Built for ambient documentation workflows.
Async batch processing
REST endpoint for pre-recorded audio and asynchronous processing. Upload files and retrieve transcripts at scale. Same quality, no streaming infrastructure required.
Symphony treats speech to text as a structured inference problem. Recognition, formatting, and correction run as an orchestrated pipeline rather than collapsing into one step. The result is text that is clinically accurate and operationally ready.
Stateless real-time dictation, stateful conversational transcription, and async batch processing all run through the same underlying model. Build across modes without managing separate vendors or accepting differences in accuracy.
Supply a terminology list at inference time, and Symphony biases recognition toward the terms in your context - rare drug names, facility abbreviations, clinician identifiers. No fine-tuning, no retraining cycle. Reduces missed terms by ~50% without affecting precision.
Symphony Audio Health Events surface real-time quality signals during the interaction, so your pipeline can prompt a correction rather than surface a bad transcript after the fact.
Entities are structured, punctuation is rendered, dosages and units are in the right form. Your agents and downstream systems can act on the transcript without a post-processing layer.

Commands
Control your application interface by voice, using spoken words to trigger actions and navigate fields.
Formatting
Render dosages, units, dates, and measurements in clinical form automatically. No post-processing needed.
Diarization
Attribute speech to the right speaker across multi-participant conversations.
Interim results
Show transcript output in real time as the clinician speaks, word by word.
Audio health events
Surface input quality signals in the API response, before a bad transcript reaches your users.
Auto/spoken punctuation
Handle punctuation at the recognition layer for a natural dictation experience.
Replacement rules
Control how words, phrases, and acronyms appear in final transcript output.
Custom dictionary
Improve recognition for proper nouns, facility names, and specialty-specific terminology.
Available today through a unified API supporting live streaming dictation, conversational speech, and batch audio file processing. A single integration handles every voice workflow your product needs.
Connect to /streams for stateful, real-time conversational transcription. Audio is associated with an ongoing interaction - diarization, contextual correction, and audio health events run natively in this mode.
Speaker diarization segments the transcript by doctor and patient automatically
Audio health events surface input quality issues in real time, before the encounter ends
Get started with the Corti SDK in JavaScript and C# .NET
Connect to /transcribe for stateless, real-time dictation. Built for command-and-control workflows where spoken commands control the interface, edit text, and trigger formatting operations.
Industry-leading accuracy on medical dictation benchmarks, outperforming Dragon Medical One by 20%+
Spoken punctuation, measurements, and abbreviations handled at the recognition layer
Supply a keyterm list at inference time to bias recognition toward your vocabulary
Symphony delivers structured, validated outputs your agents can act on directly. It handles the gap between what a clinician says and what your software needs to do.
Structured outputs let voice directly drive workflows, orders, and documentation
Context injection and controllable outputs mean your agent starts from verified input
Controllable outputs reduce the surface area for LLM hallucination downstream
Nine years of peer-reviewed research, published at NeurIPS, ICML, ICLR, and ACL. Now shipping as an API.
$50 free credits. Full API access. No card required.
Symphony is top-ranked on medical dictation benchmarks across English, French, and German, and is available across 14 languages worldwide. It outperforms OpenAI, ElevenLabs, Whisper, and Parakeet on clinical audio, and matches or exceeds state-of-the-art on general-purpose benchmarks.
$0.0065 per audio minute, all inclusive. Diarization, contextual correction, keyterm biasing, audio health events, multichannel speaker attribution, and all supported languages are included at no extra cost. No separate charges for transcript output.
10 minutes = $0.07. 1 hour = $0.39.
Yes. Symphony is built for healthcare environments, with sovereign cloud deployment options for organizations with strict data residency requirements.
Symphony is stress-tested on non-speech audio and aggressively segmented inputs. It reports lower spurious insertion rates than every major competing system - an important signal for production clinical deployments.
/transcribe is a stateless WebSocket endpoint for real-time dictation and voice-controlled interfaces. /stream is a stateful WebSocket endpoint for conversational transcription, where audio is associated with an ongoing interaction. /transcripts is an async REST endpoint for batch processing of pre-recorded audio files. All three share the same underlying pipeline with no performance differences between modes.
On MedDictate, a realistic English medical dictation benchmark, Symphony achieves 4.6% WER, compared to 5.7% for Dragon Medical One, with higher medical term recall and a lower false discovery rate (0.79% vs. 1.33%). Unlike front-end dictation applications, Symphony is an API built for developers - no client-side software, no vendor lock-in, and structured outputs that downstream systems can act on directly.
Every developer gets $50 of API credits to start - enough to process nearly 8,000 minutes of audio.
Get an API key, run your own audio through the appropriate endpoint, and measure results against a gold standard transcript using Corti Canal, our open-source evaluation tool that reports Word Error Rate, Character Error Rate, and Medical Term Recall.
If you want help setting up a benchmark or interpreting results, reach out to help@corti.ai.