VaniAgent
Vani AgentMobile menu
VaniAgent
Vani AgentMobile menu
articleEngineering

Reducing Voice Latency to <500ms: A Deep Dive into Edge Architecture

personEngineering Team
calendar_todayOctober 12, 2024
schedule8 min read
Share

Reducing Voice Latency to <500ms: A Deep Dive into Edge Architecture

In the world of AI voice agents, latency is everything. A delay of even 1-2 seconds can make conversations feel robotic and frustrating. At VaniAgent, we've engineered our platform to achieve sub-500ms latency, creating natural, human-like conversations that feel instantaneous.

Why Latency Matters in Voice AI

Human conversations have a natural rhythm. When someone asks a question, they expect a response within 200-500ms. Any longer, and the conversation feels awkward. Traditional phone systems add 100-300ms of network latency, so our AI processing budget is razor-thin.

Poor latency leads to:

  • Conversation overlap: Users start talking again before the AI responds
  • Reduced trust: Delays make the AI feel less intelligent
  • Lower completion rates: Users hang up on slow systems
  • Poor user experience: Natural flow is disrupted

Our goal: Keep total latency under 500ms from speech input to audio output.

The Latency Budget Breakdown

Let's break down where latency comes from in a typical AI voice interaction:

  1. Network transmission (user → server): 50-150ms
  2. Speech-to-Text (STT): 100-300ms
  3. LLM processing: 200-800ms
  4. Text-to-Speech (TTS): 100-400ms
  5. Network transmission (server → user): 50-150ms

Total: 500-1800ms (way too slow!)

To hit our <500ms target, we need to optimize every layer.

Edge Computing Architecture

The foundation of our low-latency system is edge computing. Instead of routing all traffic through centralized data centers, we deploy our voice processing infrastructure at the edge—geographically close to users.

Geographic Distribution

We run voice processing nodes in 15+ regions worldwide:

  • North America: US East, US West, US Central, Canada
  • Europe: London, Frankfurt, Paris, Amsterdam
  • Asia Pacific: Singapore, Tokyo, Sydney, Mumbai
  • Latin America: São Paulo, Mexico City

When a call comes in, we route it to the nearest edge node, reducing network latency by 50-80%.

Edge Node Architecture

Each edge node runs a complete voice processing stack:

┌─────────────────────────────────────┐
│         Edge Node (Region)          │
├─────────────────────────────────────┤
│  WebSocket Gateway (Load Balanced)  │
│  ├─ Connection pooling              │
│  └─ Session management              │
├─────────────────────────────────────┤
│  Audio Processing Pipeline          │
│  ├─ Streaming STT (Deepgram)        │
│  ├─ LLM Inference (GPT-4/Claude)    │
│  └─ Streaming TTS (ElevenLabs)      │
├─────────────────────────────────────┤
│  Redis Cache (Hot data)             │
│  └─ Conversation context            │
└─────────────────────────────────────┘

WebSocket Optimization

HTTP request/response cycles add significant overhead. We use WebSockets for bidirectional, persistent connections that eliminate handshake latency.

Connection Setup

// Client-side WebSocket setup with optimizations
const ws = new WebSocket('wss://edge.vaniagent.com/voice', {
  perMessageDeflate: false, // Disable compression for lower latency
  maxPayload: 256 * 1024,   // 256KB max message size
});

ws.binaryType = 'arraybuffer'; // Efficient binary audio handling

ws.onopen = () => {
  // Send initial configuration
  ws.send(JSON.stringify({
    type: 'config',
    sampleRate: 16000,
    encoding: 'opus',
    language: 'en-US'
  }));
};

ws.onmessage = (event) => {
  if (event.data instanceof ArrayBuffer) {
    // Audio data - play immediately
    playAudioChunk(event.data);
  } else {
    // JSON control messages
    handleControlMessage(JSON.parse(event.data));
  }
};

Key Optimizations

  1. Disable compression: Per-message deflate adds 10-30ms. We send pre-compressed audio instead.
  2. Binary frames: Use binary WebSocket frames for audio, JSON for control messages.
  3. Connection pooling: Reuse connections across multiple calls.
  4. Keepalive tuning: 30-second keepalive prevents connection drops without overhead.

Audio Codec Selection

Choosing the right audio codec is critical for balancing quality and latency.

Codec Comparison

CodecBitrateLatencyQualityUse Case
Opus16-32 kbps5-10msExcellentOur default choice
G.71164 kbps0.125msGoodLegacy systems
AAC32-128 kbps50-100msExcellentToo slow for real-time
MP3128-320 kbps100-200msGoodNot suitable

We use Opus for its optimal balance:

  • Low algorithmic latency (5-10ms)
  • Excellent quality at 24 kbps
  • Built-in packet loss concealment
  • Wide browser support

Opus Configuration

const opusConfig = {
  sampleRate: 16000,      // 16kHz is sufficient for speech
  channels: 1,            // Mono for voice
  bitrate: 24000,         // 24 kbps
  frameSize: 20,          // 20ms frames (320 samples)
  complexity: 5,          // Balance quality/CPU (0-10)
  packetLossPerc: 10,     // Expect 10% packet loss
  useDTX: true,           // Discontinuous transmission
  useInbandFEC: true      // Forward error correction
};

Streaming Everything

The biggest latency win comes from streaming at every layer. Instead of waiting for complete responses, we stream partial results.

Streaming STT

// Deepgram streaming STT
const deepgram = new Deepgram(apiKey);
const dgConnection = deepgram.transcription.live({
  model: 'nova-2',
  language: 'en-US',
  punctuate: true,
  interim_results: true,  // Get partial transcripts
  endpointing: 300,       // 300ms silence = end of speech
  vad_events: true        // Voice activity detection
});

dgConnection.on('transcriptReceived', (data) => {
  if (data.is_final) {
    // Final transcript - send to LLM
    processUserInput(data.channel.alternatives[0].transcript);
  } else {
    // Interim result - can show in UI
    updateTranscript(data.channel.alternatives[0].transcript);
  }
});

Streaming LLM

// OpenAI streaming completion
const stream = await openai.chat.completions.create({
  model: 'gpt-4-turbo',
  messages: conversationHistory,
  stream: true,
  max_tokens: 150,
  temperature: 0.7
});

let fullResponse = '';
for await (const chunk of stream) {
  const content = chunk.choices[0]?.delta?.content || '';
  fullResponse += content;
  
  // Send to TTS as soon as we have a complete sentence
  if (content.match(/[.!?]$/)) {
    streamToTTS(fullResponse);
    fullResponse = '';
  }
}

Streaming TTS

// ElevenLabs streaming TTS
const ttsStream = await elevenlabs.textToSpeech({
  voice_id: 'voice_id_here',
  text: textChunk,
  model_id: 'eleven_turbo_v2',  // Fastest model
  output_format: 'pcm_16000',
  optimize_streaming_latency: 4  // Max optimization
});

// Stream audio chunks directly to WebSocket
for await (const audioChunk of ttsStream) {
  ws.send(audioChunk);  // Send immediately, don't buffer
}

Intelligent Caching

We cache aggressively to avoid redundant processing:

Conversation Context Cache

// Redis cache for conversation state
const cacheKey = `conversation:${callId}`;
await redis.setex(cacheKey, 3600, JSON.stringify({
  history: conversationHistory,
  context: userContext,
  lastActivity: Date.now()
}));

TTS Audio Cache

Common phrases are pre-generated and cached:

const commonPhrases = [
  "How can I help you today?",
  "Let me check that for you.",
  "Is there anything else I can help with?"
];

// Pre-generate and cache on startup
for (const phrase of commonPhrases) {
  const audio = await generateTTS(phrase);
  await redis.set(`tts:${hash(phrase)}`, audio);
}

Measuring Latency

We instrument every layer to track latency in production:

const metrics = {
  networkIn: Date.now() - requestTimestamp,
  sttLatency: sttEndTime - sttStartTime,
  llmLatency: llmEndTime - llmStartTime,
  ttsLatency: ttsEndTime - ttsStartTime,
  networkOut: clientReceiveTime - sendTime,
  totalLatency: clientReceiveTime - requestTimestamp
};

// Log to monitoring system
logger.info('voice_latency', metrics);

// Alert if latency exceeds threshold
if (metrics.totalLatency > 500) {
  alerting.warn('High latency detected', metrics);
}

Real-World Performance

Our production metrics (P95):

  • Total latency: 420ms
  • STT: 180ms
  • LLM: 140ms
  • TTS: 80ms
  • Network: 20ms

We consistently hit sub-500ms latency for 95% of interactions.

Best Practices

  1. Choose the nearest region: Always route to the closest edge node
  2. Use streaming: Stream at every layer—STT, LLM, TTS
  3. Optimize audio: Use Opus codec with 20ms frames
  4. Cache aggressively: Pre-generate common responses
  5. Monitor continuously: Track latency metrics in production
  6. Tune for your use case: Balance latency vs. quality based on needs

Conclusion

Achieving sub-500ms latency requires optimization at every layer: edge computing, WebSocket connections, streaming protocols, and intelligent caching. By treating latency as a first-class concern, we've built AI voice agents that feel truly conversational.

The result? Natural conversations that users trust, higher completion rates, and a better overall experience.

Ready to build low-latency voice AI? Get started with VaniAgent and experience the difference.

Build with Vani

Put these ideas into production

Deploy AI voice agents in minutes and build outbound, inbound, and follow-up workflows on one platform.

Keep exploring

Related Articles