graph TD
  Input --> Speech-to-speech --> Output
	Input --> Speech-to-text--> LLM-or-Agentic-Workflow --> Text-to-speech --> Output

Voice Agents Stack Components

Pipeline Architecture

Speech-to-Speech Architecture

Latency of Human Conversations : Humans except the latency of conversations to be around 236ms. With the best architecture practices we can achieve latency of 540ms.

Task Latency(in ms)
VAD 20
EOU 100
ASR/STT 100-500
LLM 220-500
TTS 100-450

Voice Agent Architecture

Voice Agent Architecture

WebRTC stands for Web Real-Time Communication. It is free open-source project providing web browsers and mobile applications with real-time communication via APIs.

Unique Challenges