Mar 18, 2025

The future of speech to text

Speech-to-text technology has evolved remarkably over the past few years, yet it still falls short of achieving a seamless, human-like interaction. Today’s systems rely on a combination of Automatic Speech Recognition (ASR) and Large Language Models (LLMs).

Introduction

Speech-to-text technology has come a long way, but it actually remains further from optimal than we think. Today, AI systems rely on two distinct technologies to handle spoken language: Automatic Speech Recognition (ASR) and Large Language Models (LLMs). ASR is responsible for converting speech into text but does not understand meaning or intent. LLMs, on the other hand, process text, extract meaning, and generate responses but rely on accurate transcriptions to function properly. This disconnect introduces latency, accuracy issues, and context loss, making AI-driven conversations feel unnatural.

The next major advancement in speech AI will be the integration of ASR and LLMs into a unified system that can transcribe, understand, and respond in real time. However, several challenges must be addressed before this becomes a reality.

How speech-to-text works today

Current AI speech technology follows a three-step process. First, ASR tools such as Whisper, AssemblyAI, or Deepgram transcribe spoken words into text. Second, the transcribed text is sent to an LLM, such as GPT-4, to process meaning and generate a response. While this system enables AI to interact with spoken language, it is inefficient due to delays, the lack of reasoning in ASR, and its dependence on transcription accuracy among a variety of other disadvantages.

The drawbacks of the current approach

The primary limitation of today's speech AI technology is that ASR and LLMs operate in isolation rather than as an integrated system. ASR is purely transcriptional and does not understand the meaning behind words, while LLMs rely on ASR's accuracy but do not have access to deeper speech-level insights. Additionally, ASR does not correct itself in real time, meaning that any transcription errors are passed to the LLM, which may then generate incorrect responses.

Beyond this fundamental disconnect, speech AI faces several other challenges. Latency remains a significant issue, as the AI does not process speech as it is spoken but instead waits for full transcription before generating a response, introducing unnatural delays in conversation. For example, a reported case showed that after integrating LiteLLM into an ASR workflow, the average latency increased by approximately 40 milliseconds, rising from 180ms to 220ms (GitHub). This may not seem like much, but in live interactions, small delays compound, making AI responses feel sluggish. Additionally, LLMs use autoregressive decoding, meaning they generate text token by token, which further adds to response time and makes real-time speech interaction difficult (arXiv). Since voice data is more computationally intensive than text, processing it efficiently requires significant computational resources, further contributing to latency (Gladia).

Context loss is another major limitation, as ASR does not retain memory of previous sentences, and LLMs cannot adjust their responses based on new input as humans do. The result is an AI that does not truly "listen" in real time but instead processes speech in isolated chunks, making interactions feel robotic rather than fluid. This gap between recognition and understanding means that AI still struggles with tasks requiring dynamic, continuous interaction, such as live customer service, real-time translations, or voice assistants that need to follow multi-turn conversations. Until ASR and LLMs are fully integrated into a single system that processes speech dynamically, AI-driven voice interactions will remain inefficient and unnatural

The future: Merging ASR and LLMs

The next evolution in AI speech technology will involve integrating ASR and LLMs into a single system that can listen, understand, and respond without delays. Rather than having ASR transcribe first and then pass the text to an LLM, future AI models will process speech and meaning simultaneously, allowing for a much more natural interaction.

A merged system would enable lower latency by processing speech as it is spoken rather than waiting for full sentences before generating a response. Context awareness would improve dramatically, as AI would be able to track entire conversations rather than just isolated phrases. This advancement would also make real-time translation possible, allowing AI to instantly translate and respond in multiple languages. Moreover, accuracy would improve as AI could self-correct transcription errors based on contextual understanding, ensuring more reliable responses. With these improvements, AI-driven conversations would feel more natural and human-like, eliminating the awkward pauses and misunderstandings that exist today.

Why hasn’t this happened yet?

Despite its potential, the integration of ASR and LLMs has not yet been fully realized due to several technical challenges. One of the biggest obstacles is the complexity of speech processing compared to text or images. Text and images are static, meaning AI can analyze them as fixed inputs. Speech, however, is dynamic and constantly evolving. It varies in tone, accent, and background noise, requiring continuous adjustments. Unlike OCR, which extracts text from an image in a single step, ASR must process changing audio in real-time while accounting for interruptions and mid-sentence corrections.

Another major challenge is real-time latency. LLMs are not optimized for continuous streaming input; they generate responses token by token based on a fixed text input. This means they cannot adjust their output as they receive new speech data. ASR models like Whisper already take 300-500 milliseconds per sentence to process, but LLMs require additional time to analyze and generate a response. This results in a lag of one to two seconds, making AI conversations feel slow and unresponsive. Humans do not wait to hear an entire sentence before understanding its meaning, so for AI to achieve real-time conversation, it must be able to process speech continuously, rather than in isolated chunks.

Beyond latency, computational costs remain a significant barrier. Running ASR and LLMs in parallel requires massive processing power, which makes large-scale real-time deployment impractical. Whisper and other ASR models are already expensive to run on GPUs, and merging ASR and LLMs into a single system would demand even greater computational efficiency. Until AI models can process speech with lower hardware costs, full integration will remain out of reach for most applications.

Where are we headed?

In the short term, ASR and LLMs will continue working together as separate systems, improving incrementally in accuracy and speed. However, as AI research advances, models will become better optimized for real-time speech processing, significantly reducing latency. Eventually, AI will be capable of instantly transcribing, analyzing, translating, and responding in one continuous process, much like how humans engage in conversation.

The primary challenge today is not technological feasibility but optimization. As AI models become more efficient, faster, and better at handling speech context, the gap between recognition and understanding will close. When that happens, speech-to-text will no longer feel like a step-by-step process but a seamless, natural exchange of ideas between humans and AI. While we are not there yet, the future of speech AI is closer than ever.

Get started with Custos

Your emails are valuable—find out how Custos can help you speed them up.

Join waitlist

Get started with Custos

Your emails are valuable—find out how Custos can help you speed them up.

Join waitlist

Get started with Custos

Your emails are valuable—find out how Custos can help you speed them up.

Join waitlist

Get started with Custos

Your emails are valuable—find out how Custos can help you speed them up.

Join waitlist

The future of speech to text

The future of speech to text

The future of speech to text

The future of speech to text

Introduction

How speech-to-text works today

The drawbacks of the current approach

The future: Merging ASR and LLMs

Why hasn’t this happened yet?

Where are we headed?

More Like This

More Like This

More Like This

More Like This

How real-time transcription can elevate your sales team

Get started with Custos

Get started with Custos

Get started with Custos

Get started with Custos