AllTopicsTodayAllTopicsToday
Notification
Font ResizerAa
  • Home
  • Tech
  • Investing & Finance
  • AI
  • Entertainment
  • Wellness
  • Gaming
  • Movies
Reading: How to Design a Fully Streaming Voice Agent with End-to-End Latency Budgets, Incremental ASR, LLM Streaming, and Real-Time TTS
Share
Font ResizerAa
AllTopicsTodayAllTopicsToday
  • Home
  • Blog
  • About Us
  • Contact
Search
  • Home
  • Tech
  • Investing & Finance
  • AI
  • Entertainment
  • Wellness
  • Gaming
  • Movies
Have an existing account? Sign In
Follow US
©AllTopicsToday 2026. All Rights Reserved.
AllTopicsToday > Blog > AI > How to Design a Fully Streaming Voice Agent with End-to-End Latency Budgets, Incremental ASR, LLM Streaming, and Real-Time TTS
Blog banner23 37.png
AI

How to Design a Fully Streaming Voice Agent with End-to-End Latency Budgets, Incremental ASR, LLM Streaming, and Real-Time TTS

AllTopicsToday
Last updated: January 20, 2026 8:29 am
AllTopicsToday
Published: January 20, 2026
Share
SHARE

On this tutorial, you’ll construct an end-to-end streaming voice agent that mirrors how trendy low-latency dialog techniques work in actual time. Simulate the whole pipeline, from chunked audio enter and streaming speech recognition to incremental language mannequin inference and streamed text-to-speech output, whereas explicitly monitoring latency at each stage. Through the use of tight latency budgets and observing metrics comparable to time to first token and time to first voice, we concentrate on the sensible engineering tradeoffs that form a responsive voice-based person expertise. Try the entire code right here.

import time import asyncio import numpy as np from collections import deque from dataclasses import dataclass from Typing import Listing, AsyncIterator from enum import Enum import matplotlib.pyplot as plt @dataclass class LatencyMetrics: audio_chunk_received: float = 0.0 asr_started: float = 0.0 asr_partial: float = 0.0 asr_complete: float = 0.0 llm_started: float = 0.0 llm_first_token: float = 0.0 llm_complete: float = 0.0 tts_started: float = 0.0 tts_first_chunk: float = 0.0 tts_complete: float = 0.0 def get_time_to_first_audio(self) -> float: return self.tts_first_chunk – self.asr_complete if self.tts_first_chunk and self.asr_complete else 0.0 def get_total_latency(self) -> float: return self.tts_complete – self.audio_chunk_received if self.tts_complete else 0.0 @dataclass class LatencyBudgets: asr_processing: float = 0.1 asr_finalization: float = 0.3 llm_first_token: float = 0.5 llm_token_generation: float = 0.02 tts_first_chunk: float = 0.2 tts_chunk_generation: float = 0.05 time_to_first_audio: float = 1.0 class AgentState(Enum): LISTENING = “listening” PROCESSING_SPEECH = “processing_speech” THINKING = “considering” SPEAKING = “speaking” INTERRUPTED = “interrupted”

Defines core knowledge constructions and state representations that assist you to monitor delays all through the audio pipeline. We formalize the ASR, LLM, and TTS timing alerts to make sure constant measurements throughout all phases. It additionally establishes a transparent agent state machine that guides how the system transitions throughout conversational turns. Try the entire code right here.

class AudioInputStream: def __init__(self, Sample_rate: int = 16000, chunk_duration_ms: int = 100): self.sample_rate = sample_rate self.chunk_duration_ms = chunk_duration_ms self.chunk_size = int(sample_rate * chunk_duration_ms / 1000) async def stream_audio(self, textual content: str) -> AsyncIterator[np.ndarray]: chars_per_seconds = (150 * 5) / 60 length seconds = len(textual content) / characters per second num_chunks = int(length seconds * 1000 / self.chunk_duration_ms) for _ in vary(num_chunks): chunk = np.random.randn(self.chunk_size).astype(np.float32) * 0.1 await asyncio.sleep(self.chunk_duration_ms / 1000) generate chunk

Simulates real-time audio enter by dividing the audio into chunks of fastened length that arrive asynchronously. Mannequin life like speech charges and streaming conduct to imitate dwell microphone enter. This stream is used as a foundation for testing downstream delay-sensitive parts. Try the entire code right here.

class StreamingASR: def __init__(self, latency_budget: float = 0.1): self.latency_budget = latency_budget self.silence_threshold = 0.5 async def transcribe_stream( self, audio_stream: AsyncIterator[np.ndarray]ground_truth: str ) -> AsyncIterator[tuple[str, bool]]: Phrases = ground_truth.cut up() Words_transcribed = 0 silence_duration = 0.0 chunk_count = 0 audio_stream chunk async: chunk_count += 1 await asyncio.sleep(self.latency_budget) if chunk_count % 3 == 0 and Words_transcribed < len(phrases): Words_transcribed += 1 yield " ".be part of(phrases[:words_transcribed]), False audio_power = np.imply(np.abs(chunk)) Silence_duration = Silence_duration + 0.1 if audio_power < 0.05 else 0.0 if Silence_duration >= self.silence_threshold: await asyncio.sleep(0.2) yield ground_truth, True return yield ground_truth, True

Implement a streaming ASR module that generates partial transcriptions earlier than outputting the ultimate outcomes. Progressively we’ll reveal language that displays how trendy ASR techniques function in actual time. We additionally introduce silence-based termination processing to approximate the detection of utterance termination. Try the entire code right here.

class StreamingLLM: def __init__ (self, time_to_first_token: float = 0.3, tokens_per_ Second: float = 50): self.time_to_first_token = time_to_first_token self.tokens_per_ Second = tokens_per_ Second async defgenerate_response(self, immediate: str) -> AsyncIterator[str]:response = { “howdy”: “Hey! How can I show you how to at this time?”, “climate”: “The climate is sunny and the temperature is 72°F.”, “time”: “The present time is 2:30 PM.”, “default”: “Okay, let me show you how to.”Response = Response[“default”]
For keys in response: immediate. For keys in decrease(): response = response[key]
Break await asyncio.sleep(self.time_to_first_token) for phrase in response.cut up(): yield phrase + ” ” await asyncio.sleep(1.0 / self.tokens_per_second) class StreamingTTS: def __init__(self, time_to_first_chunk: float = 0.2, chars_per_second: float = 15): self.time_to_first_chunk = time_to_first_chunk self.chars_per_ Second = chars_per_ Second async def synthesize_stream(self, text_stream: AsyncIterator[str]) -> AsyncIterator[np.ndarray]: first_chunk = True buffer = “” Async textual content in text_stream: if len(buffer) >= 20 then buffer += textual content or first_chunk: if first_chunk: await asyncio.sleep(self.time_to_first_chunk) first_chunk = False interval = len(buffer) / self.chars_per_second yield np.random.randn(int(16000 * length)).astype(np.float32) * 0.1 buffer = “” await asyncio.sleep(length * 0.5)

This snippet fashions a streaming language mannequin and a streaming text-to-speech engine that works collectively. Generate a response for every token to seize the conduct in time as much as the primary token. Subsequent, convert the incremental textual content into audio chunks to simulate early steady speech synthesis. Try the entire code right here.

class StreamingVoiceAgent: def __init__(self, latency_budgets: LatencyBudgets): self.budgets = latency_budgets self.audio_stream = AudioInputStream() self.asr = StreamingASR(latency_budgets.asr_processing) self.llm = StreamingLLM( latency_budgets.llm_first_token, 1.0 / latency_budgets.llm_token_generation ) self.tts = StreamingTTS( latency_budgets.tts_first_chunk, 1.0 / latency_budgets.tts_chunk_generation ) self.state = AgentState.LISTENING self.metrics_history: Listing[LatencyMetrics] = []

async def process_turn(self, user_input: str) -> LatencyMetrics: metrics = LatencyMetrics() start_time = time.time() metrics.audio_chunk_received = time.time() – start_time audio_gen = self.audio_stream.stream_audio(user_input) metrics.asr_started = time.time() – start_time async for textual content, last for self.asr.transcribe_stream(audio_gen, user_input): for last: metrics.asr_complete = time.time() – start_time transcription = textual content metrics.llm_started = time.time() – start_time response = “” token async for self.llm.generate_response(transcription): if not metrics.llm_first_token: metrics.llm_first_token = time.time() – start_time response += token metrics.llm_complete = time.time() – start_time metrics.tts_started = time.time() – start_time async def text_stream(): for phrase in response.cut up(): yield phrase + ” ” async for _ in self.tts.synthesize_stream(text_stream()): in any other case metrics.tts_first_chunk: metrics.tts_first_chunk = time.time() – start_time metrics.tts_complete = time.time() – start_time self.metrics_history.append(metrics) Return metrics

Coordinate a full-voice agent by connecting voice enter, ASR, LLM, and TTS right into a single asynchronous stream. Document precise timestamps on every transition to calculate essential latency metrics. To allow systematic efficiency evaluation, we deal with every person flip as an unbiased experiment. Try the entire code right here.

async def run_demo(): Budgets = LatencyBudgets( asr_processing=0.08, llm_first_token=0.3, llm_token_generation=0.02, tts_first_chunk=0.15, time_to_first_audio=0.8 ) agent = StreamingVoiceAgent(budgets) enter = [
“Hello, how are you today?”,
“What’s the weather like?”,
“Can you tell me the time?”
]

For textual content in enter: await Agent.process_turn(textual content) await asyncio.sleep(1) if __name__ == “__main__”: asyncio.run(run_demo())

Run the whole system over a number of dialog turns and observe the consistency and variance in latency. Apply aggressive latency budgets to load the pipeline underneath life like constraints. Use these runs to validate whether or not your system meets your responsiveness objectives throughout interactions.

In conclusion, we demonstrated how a whole streaming voice agent will be orchestrated as a single asynchronous pipeline with clear stage boundaries and measurable efficiency ensures. We confirmed that combining partial ASR, token-level LLM streaming, and early-start TTS reduces perceived delay, despite the fact that complete computation time stays important. This method helps you systematically purpose about turn-taking, responsiveness, and optimization levers, and offers a stable basis for extending your system to real-world deployments utilizing manufacturing ASR, LLM, and TTS fashions.

Try the entire code right here. Additionally, be happy to comply with us on Twitter. Additionally, remember to affix the 100,000+ ML SubReddit and subscribe to our e-newsletter. grasp on! Are you on telegram? Now you can additionally take part by telegram.

Asif Razzaq is the CEO of Marktechpost Media Inc. As a visionary entrepreneur and engineer, Asif is dedicated to harnessing the potential of synthetic intelligence for social good. His newest endeavor is the launch of Marktechpost, a synthetic intelligence media platform. It stands out for its thorough protection of machine studying and deep studying information, which is technically sound and simply understood by a large viewers. The platform boasts over 2 million views monthly, demonstrating its recognition amongst viewers.

Scorsese’s Killers Of The Flower Moon Better Suited As Streaming Series
How to Build a Fully Functional Computer-Use Agent that Thinks, Plans, and Executes Virtual Actions Using Local AI Models
How to Build a Fully Self-Verifying Data Operations AI Agent Using Local Hugging Face Models for Automated Planning, Execution, and Testing
Run LM Studio Models Locally on your Machine
Is ChatGPT-5 Able to Provide Proofs for Advanced Mathematics?
TAGGED:agentASRbudgetsDesignEndtoEndFullyIncrementalLatencyLLMrealtimestreamingTTSVoice
Share This Article
Facebook Email Print
Leave a Comment

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Follow US

Find US on Social Medias
FacebookLike
XFollow
YoutubeSubscribe
TelegramFollow

Weekly Newsletter

Subscribe to our newsletter to get our newest articles instantly!

Popular News
Nyt new york times strands 2054.jpg
Tech

Today’s NYT Strands Hints, Answer and Help for Sept. 15 #561

AllTopicsToday
AllTopicsToday
September 14, 2025
8.29 Friday Faves – The Fitnessista
GFN Thursday: 20 Games Join in July
Kit Harington to Star in A Tale of Two Cities Adaptation for BBC, MGM+
Silent Hill F Has One of Horror’s Most Gnarly And Profound Transformations
- Advertisement -
Ad space (1)

Categories

  • Tech
  • Investing & Finance
  • AI
  • Entertainment
  • Wellness
  • Gaming
  • Movies

About US

We believe in the power of information to empower decisions, fuel curiosity, and spark innovation.
Quick Links
  • Home
  • Blog
  • About Us
  • Contact
Important Links
  • About Us
  • Privacy Policy
  • Terms and Conditions
  • Disclaimer
  • Contact

Subscribe US

Subscribe to our newsletter to get our newest articles instantly!

©AllTopicsToday 2026. All Rights Reserved.
1 2
Welcome Back!

Sign in to your account

Username or Email Address
Password

Lost your password?