Engineering medical-grade ASR

April 16, 2025

X min read

Lars Maaløe

Co-Founder & CTO, Corti

No items found.

Despite dictation tools serving as long-standing staples of healthcare technology, the journey to building robust, accurate, and trusted ASR systems for healthcare is anything but simple - and far from complete.

While speech recognition is often seen as a solved problem in many industries, in healthcare, it's a very different story. Unlike dictating a grocery list or a voice memo, healthcare speech involves dense, domain-specific language. Mistakes go much further than frustration or inefficiency - they can be dangerous.

At Corti, we recently launched a healthcare-specialized Dictation API built on our Solo foundation model. This enables developers to build applications that immediately compete with existing incumbents, with early users reporting up to 99% accuracy versus the industry standard of 92%. Through this journey, we've learned firsthand what makes healthcare ASR so uniquely difficult - and how to help developers rise to the challenge:

1. Domain-specific vocabulary: Why general ASR fails

Everyday speech recognition models - like Whisper or other general models trained on massive but unfiltered internet datasets - simply weren't designed to understand the complexity of medical language. State-of-the-art models like Whisper can show word error rates of 40% or more on medical dictation.

Doctors speak in Latin terms, medication names, abbreviations, and clinical shorthand that would leave most people scratching their heads. As I often explain, if you were to listen to two physicians talk in their normal expert language, you would understand very little - and unlike ambient documentation, which to a larger degree captures simplified doctor-patient language, medical environments without the patient involved quickly succumb to an entirely different medical language, encompassing hundreds of thousands of specialized terms.

Worse, when general-purpose encoder-decoder models encounter unfamiliar terms, they often hallucinate. A phrase like "glioblastoma multiforme" might turn into "global stormer," or "I have a brain tumor" could be mistranscribed as "I have a pain tremor." These aren't hypothetical examples - they represent real risks when using general-purpose ASR in healthcare.

With Corti's Dictation API, we've flipped this paradigm. Our ASR models are hyper-specialized for healthcare, trained on rich, medically annotated datasets. They don't just try to guess; they flag uncertainty, assigning confidence levels to critical terms so clinicians always know where to look twice.

2. Embedding semantic and contextual intelligence

In clinical conversations, context is everything. Take the word "cold" - it might describe a viral infection or a patient's temperature sensitivity. Without understanding the broader sentence, it's impossible to know whether the clinician is describing a common cold or noting that a patient feels cold, which could point to shock or hypothermia.

That's why building truly intelligent ASR for healthcare requires multi-level contextual modeling.

At Corti, our models work in tiers:

A fast model for real-time input/output
A more contextual model for refined understanding
A large language model (LLM) layer to interpret broader semantic meaning across the full interaction

We've reverse-engineered our own systems to evaluate how much context is required to resolve medical ambiguities - because we believe that understanding beats all when patient safety is on the line.

3. Balancing data privacy with data hunger

Healthcare AI lives in the shadow of stringent data regulations. Rightly so. But that doesn't mean we have to choose between privacy and performance.

At Corti, we've built anonymization pipelines for both audio and text, enabling us to process sensitive data safely and legally under GDPR and HIPAA frameworks. But we didn't stop there - we also generate synthetic data on the fly, creating training datasets that mimic real clinical scenarios without compromising a single patient's privacy.

This dual approach means our models get smarter while staying compliant, unlocking new capabilities without crossing ethical lines. For healthcare developers, this solves one of the most challenging obstacles to building effective AI solutions.

4. Real-time transcription: The non-negotiable standard

In clinical workflows, delay is the enemy. Whether you're a radiologist needing to dictate notes in sync with your cursor or a family physician juggling multiple consultations, real-time transcription should not be seen as a luxury - it's table stakes for patient safety.

Legacy systems like those from Microsoft Nuance or Solventum Fluency Direct understood this early on, evolving from off-line transcription services (sometimes powered by human transcriptionists in call centers) to real-time solutions. But many newer tools fall back into post-consultation summaries or batch transcription. That's not good enough.

Real-time output is less error-prone because users can immediately spot and correct mistakes during the conversation. When users can't see the transcript until after the interaction, those errors become much more disruptive to workflow and potentially dangerous.

Our Dictation API delivers near-instant transcription, empowering users to correct misfires in the moment, maintain cognitive flow, and reduce documentation fatigue. It's about aligning with how real clinicians actually work - not how engineers imagine they might.

A new standard in healthcare ASR

The launch of our Dictation API went further than simply building another transcription tool. It was about showing what's possible when you go deep on a vertical instead of staying generic. It was about enabling any developer - from large EHR providers to scrappy healthtech startups - to build world-class medical dictation into new or existing systems in minutes, not months.

With just a few lines of code, developers can access our foundation model for industry-leading medical dictation. Our included web component SDK allows implementation in browser applications in less than an hour – a process that traditionally takes weeks or even months.

And most importantly, it was about helping healthcare professionals focus on care, not keyboards. From radiology suites to emergency departments, family doctors to psychologists, our dictation technology adapts to each unique healthcare environment, transforming clinical documentation from a burden into an asset. And it continuously improves.

As we continue to push the boundaries of AI in medicine, we're reminded daily that accuracy, context, privacy, and speed aren't nice-to-haves - they're survival essentials. And by meeting these challenges head-on, we're building not just better technology, but a better standard for healthcare itself.

Explore Corti's Dictation API

Visit docs.corti.ai to learn how our API and SDK can elevate your medical software with intelligent, real-time dictation. Whether you're building for radiology, emergency medicine, mental health, or primary care, we've built this technology to adapt to your world - not the other way around.

Recent stories