In the world of AI voice agents, latency is everything. A delay of even 1-2 seconds can make conversations feel robotic and frustrating. At VaniAgent, we've engineered our platform to achieve sub-500ms latency, creating natural, human-like conversations that feel instantaneous.

Why Latency Matters in Voice AI

Human conversations have a natural rhythm. When someone asks a question, they expect a response within 200-500ms. Any longer, and the conversation feels awkward. Traditional phone systems add 100-300ms of network latency, so our AI processing budget is razor-thin.

Poor latency leads to:

Conversation overlap: Users start talking again before the AI responds
Reduced trust: Delays make the AI feel less intelligent
Lower completion rates: Users hang up on slow systems
Poor user experience: Natural flow is disrupted

Our goal: Keep total latency under 500ms from speech input to audio output.

The Latency Budget Breakdown

Let's break down where latency comes from in a typical AI voice interaction:

Network transmission (user → server): 50-150ms
Speech-to-Text (STT): 100-300ms
LLM processing: 200-800ms
Text-to-Speech (TTS): 100-400ms
Network transmission (server → user): 50-150ms

Total: 500-1800ms (way too slow!)

To hit our <500ms target, we need to optimize every layer.

Edge Computing Architecture

The foundation of our low-latency system is edge computing. Instead of routing all traffic through centralized data centers, we deploy our voice processing infrastructure at the edge—geographically close to users.

Geographic Distribution

We run voice processing nodes in 15+ regions worldwide:

North America: US East, US West, US Central, Canada
Europe: London, Frankfurt, Paris, Amsterdam
Asia Pacific: Singapore, Tokyo, Sydney, Mumbai
Latin America: São Paulo, Mexico City

When a call comes in, we route it to the nearest edge node, reducing network latency by 50-80%.

Edge Node Architecture

Each edge node runs a complete voice processing stack:

┌─────────────────────────────────────┐
│         Edge Node (Region)          │
├─────────────────────────────────────┤
│  WebSocket Gateway (Load Balanced)  │
│  ├─ Connection pooling              │
│  └─ Session management              │
├─────────────────────────────────────┤
│  Audio Processing Pipeline          │
│  ├─ Streaming STT (Deepgram)        │
│  ├─ LLM Inference (GPT-4/Claude)    │
│  └─ Streaming TTS (ElevenLabs)      │
├─────────────────────────────────────┤
│  Redis Cache (Hot data)             │
│  └─ Conversation context            │
└─────────────────────────────────────┘

WebSocket Optimization

HTTP request/response cycles add significant overhead. We use WebSockets for bidirectional, persistent connections that eliminate handshake latency.

Connection Setup

// Client-side WebSocket setup with optimizations
const ws = new WebSocket('wss://edge.vaniagent.com/voice', {
  perMessageDeflate: false, // Disable compression for lower latency
  maxPayload: 256 * 1024,   // 256KB max message size
});

ws.binaryType = 'arraybuffer'; // Efficient binary audio handling

ws.onopen = () => {
  // Send initial configuration
  ws.send(JSON.stringify({
    type: 'config',
    sampleRate: 16000,
    encoding: 'opus',
    language: 'en-US'
  }));
};

ws.onmessage = (event) => {
  if (event.data instanceof ArrayBuffer) {
    // Audio data - play immediately
    playAudioChunk(event.data);
  } else {
    // JSON control messages
    handleControlMessage(JSON.parse(event.data));
  }
};

Key Optimizations

Disable compression: Per-message deflate adds 10-30ms. We send pre-compressed audio instead.
Binary frames: Use binary WebSocket frames for audio, JSON for control messages.
Connection pooling: Reuse connections across multiple calls.
Keepalive tuning: 30-second keepalive prevents connection drops without overhead.

Audio Codec Selection

Choosing the right audio codec is critical for balancing quality and latency.

Codec Comparison

Codec	Bitrate	Latency	Quality	Use Case
Opus	16-32 kbps	5-10ms	Excellent	Our default choice
G.711	64 kbps	0.125ms	Good	Legacy systems
AAC	32-128 kbps	50-100ms	Excellent	Too slow for real-time
MP3	128-320 kbps	100-200ms	Good	Not suitable

We use Opus for its optimal balance:

Low algorithmic latency (5-10ms)
Excellent quality at 24 kbps
Built-in packet loss concealment
Wide browser support

Opus Configuration

const opusConfig = {
  sampleRate: 16000,      // 16kHz is sufficient for speech
  channels: 1,            // Mono for voice
  bitrate: 24000,         // 24 kbps
  frameSize: 20,          // 20ms frames (320 samples)
  complexity: 5,          // Balance quality/CPU (0-10)
  packetLossPerc: 10,     // Expect 10% packet loss
  useDTX: true,           // Discontinuous transmission
  useInbandFEC: true      // Forward error correction
};

Streaming Everything

The biggest latency win comes from streaming at every layer. Instead of waiting for complete responses, we stream partial results.

Streaming STT

// Deepgram streaming STT
const deepgram = new Deepgram(apiKey);
const dgConnection = deepgram.transcription.live({
  model: 'nova-2',
  language: 'en-US',
  punctuate: true,
  interim_results: true,  // Get partial transcripts
  endpointing: 300,       // 300ms silence = end of speech
  vad_events: true        // Voice activity detection
});

dgConnection.on('transcriptReceived', (data) => {
  if (data.is_final) {
    // Final transcript - send to LLM
    processUserInput(data.channel.alternatives[0].transcript);
  } else {
    // Interim result - can show in UI
    updateTranscript(data.channel.alternatives[0].transcript);
  }
});

Streaming LLM

// OpenAI streaming completion
const stream = await openai.chat.completions.create({
  model: 'gpt-4-turbo',
  messages: conversationHistory,
  stream: true,
  max_tokens: 150,
  temperature: 0.7
});

let fullResponse = '';
for await (const chunk of stream) {
  const content = chunk.choices[0]?.delta?.content || '';
  fullResponse += content;
  
  // Send to TTS as soon as we have a complete sentence
  if (content.match(/[.!?]$/)) {
    streamToTTS(fullResponse);
    fullResponse = '';
  }
}

Streaming TTS

// ElevenLabs streaming TTS
const ttsStream = await elevenlabs.textToSpeech({
  voice_id: 'voice_id_here',
  text: textChunk,
  model_id: 'eleven_turbo_v2',  // Fastest model
  output_format: 'pcm_16000',
  optimize_streaming_latency: 4  // Max optimization
});

// Stream audio chunks directly to WebSocket
for await (const audioChunk of ttsStream) {
  ws.send(audioChunk);  // Send immediately, don't buffer
}

Intelligent Caching

We cache aggressively to avoid redundant processing:

Conversation Context Cache

// Redis cache for conversation state
const cacheKey = `conversation:${callId}`;
await redis.setex(cacheKey, 3600, JSON.stringify({
  history: conversationHistory,
  context: userContext,
  lastActivity: Date.now()
}));

TTS Audio Cache

Common phrases are pre-generated and cached:

const commonPhrases = [
  "How can I help you today?",
  "Let me check that for you.",
  "Is there anything else I can help with?"
];

// Pre-generate and cache on startup
for (const phrase of commonPhrases) {
  const audio = await generateTTS(phrase);
  await redis.set(`tts:${hash(phrase)}`, audio);
}

Measuring Latency

We instrument every layer to track latency in production:

const metrics = {
  networkIn: Date.now() - requestTimestamp,
  sttLatency: sttEndTime - sttStartTime,
  llmLatency: llmEndTime - llmStartTime,
  ttsLatency: ttsEndTime - ttsStartTime,
  networkOut: clientReceiveTime - sendTime,
  totalLatency: clientReceiveTime - requestTimestamp
};

// Log to monitoring system
logger.info('voice_latency', metrics);

// Alert if latency exceeds threshold
if (metrics.totalLatency > 500) {
  alerting.warn('High latency detected', metrics);
}

Real-World Performance

Our production metrics (P95):

Total latency: 420ms
STT: 180ms
LLM: 140ms
TTS: 80ms
Network: 20ms

We consistently hit sub-500ms latency for 95% of interactions.

Best Practices

Choose the nearest region: Always route to the closest edge node
Use streaming: Stream at every layer—STT, LLM, TTS
Optimize audio: Use Opus codec with 20ms frames
Cache aggressively: Pre-generate common responses
Monitor continuously: Track latency metrics in production
Tune for your use case: Balance latency vs. quality based on needs

Conclusion

Achieving sub-500ms latency requires optimization at every layer: edge computing, WebSocket connections, streaming protocols, and intelligent caching. By treating latency as a first-class concern, we've built AI voice agents that feel truly conversational.

The result? Natural conversations that users trust, higher completion rates, and a better overall experience.

Ready to build low-latency voice AI? Get started with VaniAgent and experience the difference.

Build with Vani

Put these ideas into production

Deploy AI voice agents in minutes and build outbound, inbound, and follow-up workflows on one platform.

Keep exploring

Compliance

HIPAA Compliance for AI Voice Agents: A Complete Guide

Complete guide to HIPAA compliance for AI voice agents. Learn BAA requirements, PHI handling, encryption standards & audit logging for healthcare voice automation.

Read Articlearrow_forward

Product

Introducing Multilingual Support: Building AI Voice Agents for Global Markets

Build multilingual AI voice agents supporting 12+ languages. Learn implementation strategies, language detection, cultural nuances & global voice automation.