Skip to main content

The voice pipeline

Every call runs the same real-time loop: the caller speaks, the audio is transcribed, the LLM generates a reply, and TTS speaks it back. This happens continuously for the duration of the call. Each stage is configurable — you choose the provider and model. OneInbox handles streaming, connection management, and timing.

Stage 1 — Speech-to-text (STT)

When the caller speaks, audio streams in real time to the STT provider. OneInbox uses end-of-speech detection to know when the caller has finished a turn — based on a pause threshold, not a fixed timer. The resulting transcript is passed immediately to the LLM. What you configure: provider, model, language. Platform providers (no credential needed): Deepgram (nova-3, flux-en, flux-multi), Whisper, AssemblyAI, Azure. Deepgram nova-3 is the default and recommended for most use cases — it offers the lowest transcription latency, which directly reduces overall response time.

Stage 2 — Language model (LLM)

The transcript arrives at the LLM along with the full conversation history, the system prompt, and any knowledge base context. The LLM generates a reply. Streaming: the LLM streams tokens as it generates them. OneInbox begins passing text to TTS as soon as the first sentence boundary is detected — it does not wait for the full reply. Tool calls: if the LLM decides to call a tool (transfer, SMS, booking, etc.), OneInbox executes the tool, returns the result to the LLM, and the LLM continues generating. The caller hears normal audio while this happens — the agent keeps speaking rather than going silent. What you configure: provider, model, system prompt, temperature, tools, knowledge bases. Platform providers (no credential needed): OpenAI, Shisa. Anthropic and Groq require a credential (BYOK).

Stage 3 — Text-to-speech (TTS)

As the LLM streams reply text, TTS converts it to audio in real time. Audio is streamed back to the caller as it is generated — the caller starts hearing the agent speak before the full reply is finished. What you configure: provider, voice, speed, stability. Platform providers (no credential needed): Cartesia, Deepgram, ElevenLabs, OpenAI, Minimax, Shisa.

Interruption handling

When a caller speaks while the agent is talking, OneInbox detects it and stops the agent’s audio immediately — the caller’s new speech goes straight into a new STT → LLM → TTS cycle. The interruption_sensitivity field (0.0–1.0) controls how easily the agent can be interrupted.

Latency

Response latency is the sum of three segments:
SegmentWhat it measures
STT latencyTime from caller finishing speaking to transcript arriving at the LLM
LLM TTFT (time to first token)Time from transcript arriving to the LLM producing its first token
TTS TTFB (time to first byte)Time from first token arriving at TTS to first audio byte produced
Because all three stages stream in parallel, the caller typically starts hearing the agent’s reply within 800–1500 ms of finishing their own sentence, depending on provider choices.

Call types

TypeTransportHow it connects
Browser callWebRTCVia the Web SDK or POST /v1/calls/web. No phone number needed — runs entirely over the internet
Outbound phone callPSTN / SIPAgent dials to_number from from_number via POST /v1/calls
Inbound phone callPSTN / SIPCaller dials your number — routed to the agent assigned to that number
The same agent config works for all three call types.

What OneInbox manages for you

  • Real-time audio streaming between STT, LLM, and TTS
  • End-of-speech detection and turn management
  • Interruption detection and agent audio cutoff
  • Tool execution and result injection into the LLM context
  • Knowledge base retrieval and context injection
  • Silence timeout and end-call-phrase detection
  • Transcripts, call metadata, and AI summaries
  • Telephony routing (PSTN/SIP) for phone calls
  • WebRTC session tokens for browser calls

Next steps

  • Quickstart — make your first call in minutes
  • Agents — full agent configuration reference
  • LLMs — configure the AI brain
  • Voices — browse available voices and configure TTS