
Support
+91 73375 92673Quick note
Compare metered billing against unlimited Vani TTS before you pick a plan.

Support
+91 73375 92673Quick note
Compare metered billing against unlimited Vani TTS before you pick a plan.
In the world of AI voice agents, latency is everything. A delay of even 1-2 seconds can make conversations feel robotic and frustrating. At VaniAgent, we've engineered our platform to achieve sub-500ms latency, creating natural, human-like conversations that feel instantaneous.
Human conversations have a natural rhythm. When someone asks a question, they expect a response within 200-500ms. Any longer, and the conversation feels awkward. Traditional phone systems add 100-300ms of network latency, so our AI processing budget is razor-thin.
Poor latency leads to:
Our goal: Keep total latency under 500ms from speech input to audio output.
Let's break down where latency comes from in a typical AI voice interaction:
Total: 500-1800ms (way too slow!)
To hit our <500ms target, we need to optimize every layer.
The foundation of our low-latency system is edge computing. Instead of routing all traffic through centralized data centers, we deploy our voice processing infrastructure at the edge—geographically close to users.
We run voice processing nodes in 15+ regions worldwide:
When a call comes in, we route it to the nearest edge node, reducing network latency by 50-80%.
Each edge node runs a complete voice processing stack:
┌─────────────────────────────────────┐
│ Edge Node (Region) │
├─────────────────────────────────────┤
│ WebSocket Gateway (Load Balanced) │
│ ├─ Connection pooling │
│ └─ Session management │
├─────────────────────────────────────┤
│ Audio Processing Pipeline │
│ ├─ Streaming STT (Deepgram) │
│ ├─ LLM Inference (GPT-4/Claude) │
│ └─ Streaming TTS (ElevenLabs) │
├─────────────────────────────────────┤
│ Redis Cache (Hot data) │
│ └─ Conversation context │
└─────────────────────────────────────┘
HTTP request/response cycles add significant overhead. We use WebSockets for bidirectional, persistent connections that eliminate handshake latency.
// Client-side WebSocket setup with optimizations
const ws = new WebSocket('wss://edge.vaniagent.com/voice', {
perMessageDeflate: false, // Disable compression for lower latency
maxPayload: 256 * 1024, // 256KB max message size
});
ws.binaryType = 'arraybuffer'; // Efficient binary audio handling
ws.onopen = () => {
// Send initial configuration
ws.send(JSON.stringify({
type: 'config',
sampleRate: 16000,
encoding: 'opus',
language: 'en-US'
}));
};
ws.onmessage = (event) => {
if (event.data instanceof ArrayBuffer) {
// Audio data - play immediately
playAudioChunk(event.data);
} else {
// JSON control messages
handleControlMessage(JSON.parse(event.data));
}
};
Choosing the right audio codec is critical for balancing quality and latency.
| Codec | Bitrate | Latency | Quality | Use Case |
|---|---|---|---|---|
| Opus | 16-32 kbps | 5-10ms | Excellent | Our default choice |
| G.711 | 64 kbps | 0.125ms | Good | Legacy systems |
| AAC | 32-128 kbps | 50-100ms | Excellent | Too slow for real-time |
| MP3 | 128-320 kbps | 100-200ms | Good | Not suitable |
We use Opus for its optimal balance:
const opusConfig = {
sampleRate: 16000, // 16kHz is sufficient for speech
channels: 1, // Mono for voice
bitrate: 24000, // 24 kbps
frameSize: 20, // 20ms frames (320 samples)
complexity: 5, // Balance quality/CPU (0-10)
packetLossPerc: 10, // Expect 10% packet loss
useDTX: true, // Discontinuous transmission
useInbandFEC: true // Forward error correction
};
The biggest latency win comes from streaming at every layer. Instead of waiting for complete responses, we stream partial results.
// Deepgram streaming STT
const deepgram = new Deepgram(apiKey);
const dgConnection = deepgram.transcription.live({
model: 'nova-2',
language: 'en-US',
punctuate: true,
interim_results: true, // Get partial transcripts
endpointing: 300, // 300ms silence = end of speech
vad_events: true // Voice activity detection
});
dgConnection.on('transcriptReceived', (data) => {
if (data.is_final) {
// Final transcript - send to LLM
processUserInput(data.channel.alternatives[0].transcript);
} else {
// Interim result - can show in UI
updateTranscript(data.channel.alternatives[0].transcript);
}
});
// OpenAI streaming completion
const stream = await openai.chat.completions.create({
model: 'gpt-4-turbo',
messages: conversationHistory,
stream: true,
max_tokens: 150,
temperature: 0.7
});
let fullResponse = '';
for await (const chunk of stream) {
const content = chunk.choices[0]?.delta?.content || '';
fullResponse += content;
// Send to TTS as soon as we have a complete sentence
if (content.match(/[.!?]$/)) {
streamToTTS(fullResponse);
fullResponse = '';
}
}
// ElevenLabs streaming TTS
const ttsStream = await elevenlabs.textToSpeech({
voice_id: 'voice_id_here',
text: textChunk,
model_id: 'eleven_turbo_v2', // Fastest model
output_format: 'pcm_16000',
optimize_streaming_latency: 4 // Max optimization
});
// Stream audio chunks directly to WebSocket
for await (const audioChunk of ttsStream) {
ws.send(audioChunk); // Send immediately, don't buffer
}
We cache aggressively to avoid redundant processing:
// Redis cache for conversation state
const cacheKey = `conversation:${callId}`;
await redis.setex(cacheKey, 3600, JSON.stringify({
history: conversationHistory,
context: userContext,
lastActivity: Date.now()
}));
Common phrases are pre-generated and cached:
const commonPhrases = [
"How can I help you today?",
"Let me check that for you.",
"Is there anything else I can help with?"
];
// Pre-generate and cache on startup
for (const phrase of commonPhrases) {
const audio = await generateTTS(phrase);
await redis.set(`tts:${hash(phrase)}`, audio);
}
We instrument every layer to track latency in production:
const metrics = {
networkIn: Date.now() - requestTimestamp,
sttLatency: sttEndTime - sttStartTime,
llmLatency: llmEndTime - llmStartTime,
ttsLatency: ttsEndTime - ttsStartTime,
networkOut: clientReceiveTime - sendTime,
totalLatency: clientReceiveTime - requestTimestamp
};
// Log to monitoring system
logger.info('voice_latency', metrics);
// Alert if latency exceeds threshold
if (metrics.totalLatency > 500) {
alerting.warn('High latency detected', metrics);
}
Our production metrics (P95):
We consistently hit sub-500ms latency for 95% of interactions.
Achieving sub-500ms latency requires optimization at every layer: edge computing, WebSocket connections, streaming protocols, and intelligent caching. By treating latency as a first-class concern, we've built AI voice agents that feel truly conversational.
The result? Natural conversations that users trust, higher completion rates, and a better overall experience.
Ready to build low-latency voice AI? Get started with VaniAgent and experience the difference.
Deploy AI voice agents in minutes and build outbound, inbound, and follow-up workflows on one platform.
Complete guide to HIPAA compliance for AI voice agents. Learn BAA requirements, PHI handling, encryption standards & audit logging for healthcare voice automation.
Build multilingual AI voice agents supporting 12+ languages. Learn implementation strategies, language detection, cultural nuances & global voice automation.