Build a Real-Time AI Emergency Voice Agent with LangChai

We’ve all been in an emergency the place each second issues. Somebody’s life is in danger however there you’re panicking. Now, think about on this state of affairs of misery when a helpline asks you to press numbers in your keypad to attach with the correct agent? Pure chaos, proper? Right here, we simply want somebody to pay attention and act instantly as an alternative of passing it on and that too with out dropping the decision.

On this weblog, we’ll be fixing this large problem by constructing our very personal AI Emergency Helpline voice agent. The agent listens to a caller’s spoken misery, triages the state of affairs, dispatches the correct emergency service, and retains the caller calm, all in real-time, all-over voice.

No typing. No menus. Simply speak.

Why an Emergency Helpline?

Maybe the most typical examples of voice assistants in use as we speak are meals ordering or music streaming. These “purposeful” use instances are comparatively innocent from a perspective of person expertise, however simply forgettable. However, the use case of an emergency helpline is fully totally different.

For this use case, latency is a essential issue, the tone of the voice assistant can have an effect on who receives assist first, and you can’t use an alternate methodology to dispatch an emergency car (ambulance). As such, each design determination made inside this pipeline has a possible to trigger actual penalties, making this design probably the most useful use case to realize expertise from.

How the Pipeline Works?

The Sandwich Mannequin of Structure contains 3 unbiased elements, and each is designed to work concurrently. Every one will start processing independently and concurrently the one earlier than it finishes its processing stage, i.e.:

whereas talking, transcribing will start through the center of the speaker’s sentence,

the reasoning agent will start reasoning on the earlier responses whereas the speaker finishes their sentence,

text-to-speech will start synthesizing responses to that speaker’s sentence whereas the reasoning agent continues reasoning.

If all the things is carried out accurately, all the course of will likely be accomplished in lower than ten seconds. In a timed execution situation, this might permit the audio to be constantly streamed, offering no interruptions in audio supply.

Getting Began with the Voice Agent

You’ll want API keys for AssemblyAI (real-time STT) and OpenAI (each the agent mind and TTS). You may simply consolidate your APIs into one supplier and one job by utilizing OpenAI TTS.

Listed here are the command strains wanted to put in the required libraries:

!pip set up langchain langgraph assemblyai websockets fastapi uvicorn openai

Directions for setting setting variables:

export ASSEMBLYAI_API_KEY=”your_key”
export OPENAI_API_KEY=”your_key”
export LANGSMITH_TRACING=”true”
export LANGSMITH_API_KEY=”your_key”

You need to allow Langsmith to make sure that each dialog between your agent and a buyer may be thought-about an audit in addition to that it may be utilized as a possible assist ticket. Auditing gives for compliance and debugging by offering documentation relating to what your agent mentioned when.

Stage 1: Speech-to-Textual content with AssemblyAI

On the STT stage, we transcribe the voice of the caller dwell. As such, we’ll use the WebSocket API from AssemblyAI following a producer-consumer mannequin, the place audio chunks go inside and transcripts exit, respectively, on the similar time.

from typing import AsyncIterator
import asyncio
import contextlib

async def stt_stream(
audio_stream: AsyncIterator[bytes],
) -> AsyncIterator[VoiceAgentEvent]:
stt = AssemblyAISTT(sample_rate=16000)

async def send_audio():
strive:
async for chunk in audio_stream:
await stt.send_audio(chunk)
lastly:
await stt.shut()

send_task = asyncio.create_task(send_audio())

strive:
async for occasion in stt.receive_events():
yield occasion
lastly:
send_task.cancel()

with contextlib.suppress(asyncio.CancelledError):
await send_task

await stt.shut()

The 2 key occasion varieties are STT Chunk and STT Output. STT Chunk incorporates partial transcripts generated whereas the caller is talking, permitting a human supervisor to observe the dialog in actual time. STT Output is the ultimate punctuated transcript utilized by the agent to set off actions.

When utilizing AssemblyAI for a helpline, the content material security detection flag needs to be enabled. It gives early warnings of misery alerts by transcript metadata earlier than the agent processes the textual content, giving the agent extra time to find out an acceptable response.

Stage 2: The Emergency Triage Agent

The second stage of aiding a caller will likely be by an Emergency Triage Agent. That is the place the agent analyzes the transcript acquired from a caller, evaluates whether or not help is required, determines which software needs to be used, and interacts with the caller in a peaceful method.

The agent has 4 instruments obtainable to carry out these duties: location lookup, emergency dispatch, escalation to a dwell operator and deescalation of non-life-threatening misery to cut back emotional discomfort.

from uuid import uuid4

from langchain.brokers import create_agent
from langchain.messages import HumanMessage
from langgraph.checkpoint.reminiscence import InMemorySaver

# Lively name registry
active_calls = {}

def get_caller_location(caller_id: str) -> str:
“””Search for the caller’s registered handle or final recognized GPS location.”””
areas = {
“caller_001”: “12 MG Street, Bengaluru, Karnataka 560001”,
“caller_002”: “45 Park Road, Kolkata, West Bengal 700016”,
}

return areas.get(
caller_id,
“Location not discovered. Ask caller to substantiate handle.”,
)

def dispatch_emergency(service: str, location: str, severity: str) -> str:
“””Dispatch police, ambulance, or fireplace providers to a location.”””
valid_services = [“ambulance”, “police”, “fire”]

if service.decrease() not in valid_services:
return f”Unknown service: {service}. Use ambulance, police, or fireplace.”

return (
f”{service.capitalize()} dispatched to {location}. ”
f”Severity: {severity}. ETA: 8-12 minutes. ”
f”Reference: EM-{uuid4().hex[:6].higher()}”
)

def escalate_to_human(caller_id: str, cause: str) -> str:
“””Escalate the decision to a human operator when the state of affairs exceeds AI functionality.”””
active_calls[caller_id] = {
“standing”: “escalated”,
“cause”: cause,
}

return (
f”Escalating name {caller_id} to human operator. ”
f”Purpose: {cause}. Maintain time: beneath 2 minutes.”
)

def calming_protocol(state of affairs: str) -> str:
“””Return guided respiration or grounding directions for distressed callers.”””
return (
“I hear you. You might be secure proper now. ”
“Take a sluggish breath in for 4 counts, maintain for 4, out for 4. ”
“I’m right here with you.”
)

agent = create_agent(
mannequin=”openai:gpt-4o-mini”,
instruments=[
get_caller_location,
dispatch_emergency,
escalate_to_human,
calming_protocol,
],
system_prompt=”””You might be ARIA, an AI emergency response assistant for a 24/7 helpline.

Your job is to remain calm, assess the state of affairs shortly, and take the correct motion.

Guidelines you have to all the time comply with:

– At all times acknowledge the caller’s misery earlier than asking questions.
– Ask just one query at a time. By no means overwhelm a panicking caller.
– If somebody mentions chest ache, issue respiration, or unconsciousness — dispatch ambulance instantly.
– If somebody mentions violence, threats, or break-in — dispatch police instantly.
– If the state of affairs is unclear or emotional disaster — use calming protocol first.
– Escalate to a human operator if the caller is unresponsive or the state of affairs is ambiguous.
– Maintain each response beneath 3 sentences. Brief and clear saves lives.
– Do NOT use emojis, asterisks, bullet factors, or markdown. You might be talking aloud.”””,
checkpointer=InMemorySaver(),
)

The InMemorySaver checkpointer performs a vital function right here because it permits ARIA to recollect all the name historical past, together with:

what was mentioned by the caller three calls in the past,

what has already been despatched to the caller,

whether or not the caller verified their very own location, and many others.

If there have been no reminiscence, then each response would start from a clean state, which may be very problematic in an pressing state of affairs.

Subsequent, think about the streaming agent perform.

async def agent_stream(
event_stream: AsyncIterator[VoiceAgentEvent],
) -> AsyncIterator[VoiceAgentEvent]:
thread_id = str(uuid4()) # Distinctive per name session

async for occasion in event_stream:
yield occasion

if occasion.sort == “stt_output”:
stream = agent.astream(
{“messages”: [HumanMessage(content=event.transcript)]},
{“configurable”: {“thread_id”: thread_id}},
stream_mode=”messages”,
)

async for message, _ in stream:
if message.textual content:
yield AgentChunkEvent.create(message.textual content)

stream_mode=”messages” sends tokens to TTS as they’re produced. ARIA’s first phrases have began to be spoken earlier than she has accomplished her reasoning course of. That is what creates a 400-millisecond response vs. a 2-second response!

Stage 3: Textual content-to-Speech with OpenAI TTS

OpenAI TTS is the pure alternative, you might be already utilizing an OpenAI API key on your agent, thus making one API name, one SDK, and no additional accounts. The tts-1 mannequin was constructed for real-time/streamed text-to-speech rendering. The shimmer voice could be very calm, clear, and rational; all acceptable tones for a helpline.

from utils import merge_async_iters
from openai import AsyncOpenAI

shopper = AsyncOpenAI()

async def tts_stream(
event_stream: AsyncIterator[VoiceAgentEvent],
) -> AsyncIterator[VoiceAgentEvent]:
text_buffer = []

async def process_upstream() -> AsyncIterator[VoiceAgentEvent]:
async for occasion in event_stream:
yield occasion

if occasion.sort == “agent_chunk”:
text_buffer.append(occasion.textual content)

async def synthesize_audio() -> AsyncIterator[VoiceAgentEvent]:
full_text = “”.be a part of(text_buffer)

if not full_text.strip():
return

async with shopper.audio.speech.with_streaming_response.create(
mannequin=”tts-1″,
voice=”shimmer”, # Calm, composed — proper for emergencies
enter=full_text,
response_format=”pcm”, # Uncooked PCM for lowest latency playback
) as response:
async for chunk in response.iter_bytes(chunk_size=4096):
yield TTSChunkEvent.create(chunk)

async for occasion in merge_async_iters(
process_upstream(),
synthesize_audio(),
):
yield occasion

Tts-1 begins streaming audio chunks as quickly because the preliminary sentence has been synthesized somewhat than ready till all the sentence has been created. You should use response_format=”pcm” to skip the overhead of a container and stream audio straight into the websocket byte stream. With a tts-1-hd because of this whereas the standard is elevated, there will likely be roughly a 200ms improve in latency in comparison with utilizing tts-1. To get the very best efficiency for an emergency helpline, it’s advisable to make use of the tts-1 voice choice.

There are a number of voice choices obtainable to you: alloy is a impartial and assured voice; echo has a bit of bit of heat in his voice; shimmer has a peaceful and regular voice. All three are good decisions for helpline contexts, whilst you ought to keep away from fable and onyx as a result of they might be too informal or too authoritative respectively.

Utilizing merge_async_iters, it is possible for you to to carry out textual content accumulation and audio synthesis concurrently in order that your audio byte stream will start to stream instantly after the primary sentence has been accomplished.

Wiring the Full Pipeline

LangChain’s RunnableGenerator connects all three phases right into a single composable pipeline:

from langchain_core.runnables import RunnableGenerator
from fastapi import FastAPI, WebSocket

app = FastAPI()

pipeline = (
RunnableGenerator(stt_stream)
| RunnableGenerator(agent_stream)
| RunnableGenerator(tts_stream)
)

@app.websocket(“/ws/{caller_id}”)
async def websocket_endpoint(websocket: WebSocket, caller_id: str):
await websocket.settle for()

active_calls[caller_id] = {“standing”: “lively”}

async def audio_stream():
whereas True:
knowledge = await websocket.receive_bytes()
yield knowledge

strive:
async for occasion in pipeline.atransform(audio_stream()):
if occasion.sort == “tts_chunk”:
await websocket.send_bytes(occasion.audio)
lastly:
active_calls[caller_id][“status”] = “ended”
await websocket.shut()

Control the caller_id inside the WebSocket path. Every name connection will likely be tracked from the beginning of the connection till the tip of the connection. All entries within the name’s registry will likely be up to date, even when there’s a lack of connection mid-call (which might happen throughout precise emergencies).

Testing the Voice Agent

We’ve constructed all the pipeline and now we’ll do some testing primarily based on totally different situations.

Situation 1: Name for Medical Chest ache

A lady’s husband collapses with chest ache and a numb left arm. ARIA identifies a cardiac emergency, dispatches an ambulance, and provides her directions whereas she waits.

Response:

Situation 2: Break-In and dealing with lively Risk

A caller is hiding of their bed room whereas somebody breaks in downstairs. ARIA dispatches police instantly and retains the caller quiet and nonetheless till assist arrives.

Response:

Situation 3: Fireplace inflicting smoke and Confusion

A neighbour spots thick smoke from the flat subsequent door with no signal of the occupant. ARIA dispatches the fireplace division and guides the caller to evacuate and alert the constructing.

Response:

Situation 4: Emotional Disaster on account of panic assault

A caller hasn’t left their flat in three days and is hyperventilating with no clear emergency. ARIA applies the calming protocol first, then dispatches an ambulance when respiration issue is confirmed.

Response:

Conclusion

You now have an operational emergency agent at your disposal. ARIA listens 24/7 and gives triage, service dispatch by the right channel and retransmits messages again to the caller utilizing an correct and calm voice in lower than 700 ms. The sandwich structure offers you full interchangeability of all elements.

Subsequent enhancements embrace name recording, per-response auditing, dwell monitoring dashboards for escalations, and voice exercise detection for smoother interruptions. These may be added with out rewriting the pipeline. Vital voice brokers are more durable than assist desks as a result of they need to ship pressing assist with out silence when callers need assistance most.

Knowledge Science Trainee at Analytics Vidhya
I’m at the moment working as a Knowledge Science Trainee at Analytics Vidhya, the place I concentrate on constructing data-driven options and making use of AI/ML methods to unravel real-world enterprise issues. My work permits me to discover superior analytics, machine studying, and AI purposes that empower organizations to make smarter, evidence-based selections.
With a robust basis in pc science, software program improvement, and knowledge analytics, I’m keen about leveraging AI to create impactful, scalable options that bridge the hole between know-how and enterprise.
📩 You may also attain out to me at [email protected]

Contents

Why an Emergency Helpline?How the Pipeline Works?Getting Began with the Voice Agent Stage 1: Speech-to-Textual content with AssemblyAI Stage 2: The Emergency Triage Agent Stage 3: Textual content-to-Speech with OpenAI TTS Wiring the Full Pipeline Testing the Voice Agent Situation 1: Name for Medical Chest ache Situation 2: Break-In and dealing with lively Risk Situation 3: Fireplace inflicting smoke and Confusion Situation 4: Emotional Disaster on account of panic assault Conclusion Login to proceed studying and revel in expert-curated content material.

Maintain Studying for Free

Build a Real-Time AI Emergency Voice Agent with LangChai

Why an Emergency Helpline?

How the Pipeline Works?

Getting Began with the Voice Agent