The voice pipeline
Every call runs the same real-time loop: the caller speaks, the audio is transcribed, the LLM generates a reply, and TTS speaks it back. This happens continuously for the duration of the call. Each stage is configurable — you choose the provider and model. OneInbox handles streaming, connection management, and timing.Stage 1 — Speech-to-text (STT)
When the caller speaks, audio streams in real time to the STT provider. OneInbox uses end-of-speech detection to know when the caller has finished a turn — based on a pause threshold, not a fixed timer. The resulting transcript is passed immediately to the LLM. What you configure: provider, model, language. Platform providers (no credential needed): Deepgram (nova-3, flux-en, flux-multi), Whisper, AssemblyAI, Azure.
Deepgram nova-3 is the default and recommended for most use cases — it offers the lowest transcription latency, which directly reduces overall response time.
Stage 2 — Language model (LLM)
The transcript arrives at the LLM along with the full conversation history, the system prompt, and any knowledge base context. The LLM generates a reply. Streaming: the LLM streams tokens as it generates them. OneInbox begins passing text to TTS as soon as the first sentence boundary is detected — it does not wait for the full reply. Tool calls: if the LLM decides to call a tool (transfer, SMS, booking, etc.), OneInbox executes the tool, returns the result to the LLM, and the LLM continues generating. The caller hears normal audio while this happens — the agent keeps speaking rather than going silent. What you configure: provider, model, system prompt, temperature, tools, knowledge bases. Platform providers (no credential needed): OpenAI, Shisa. Anthropic and Groq require a credential (BYOK).Stage 3 — Text-to-speech (TTS)
As the LLM streams reply text, TTS converts it to audio in real time. Audio is streamed back to the caller as it is generated — the caller starts hearing the agent speak before the full reply is finished. What you configure: provider, voice, speed, stability. Platform providers (no credential needed): Cartesia, Deepgram, ElevenLabs, OpenAI, Minimax, Shisa.Interruption handling
When a caller speaks while the agent is talking, OneInbox detects it and stops the agent’s audio immediately — the caller’s new speech goes straight into a new STT → LLM → TTS cycle. Theinterruption_sensitivity field (0.0–1.0) controls how easily the agent can be interrupted.
Latency
Response latency is the sum of three segments:| Segment | What it measures |
|---|---|
| STT latency | Time from caller finishing speaking to transcript arriving at the LLM |
| LLM TTFT (time to first token) | Time from transcript arriving to the LLM producing its first token |
| TTS TTFB (time to first byte) | Time from first token arriving at TTS to first audio byte produced |
Call types
| Type | Transport | How it connects |
|---|---|---|
| Browser call | WebRTC | Via the Web SDK or POST /v1/calls/web. No phone number needed — runs entirely over the internet |
| Outbound phone call | PSTN / SIP | Agent dials to_number from from_number via POST /v1/calls |
| Inbound phone call | PSTN / SIP | Caller dials your number — routed to the agent assigned to that number |
What OneInbox manages for you
- Real-time audio streaming between STT, LLM, and TTS
- End-of-speech detection and turn management
- Interruption detection and agent audio cutoff
- Tool execution and result injection into the LLM context
- Knowledge base retrieval and context injection
- Silence timeout and end-call-phrase detection
- Transcripts, call metadata, and AI summaries
- Telephony routing (PSTN/SIP) for phone calls
- WebRTC session tokens for browser calls
Next steps
- Quickstart — make your first call in minutes
- Agents — full agent configuration reference
- LLMs — configure the AI brain
- Voices — browse available voices and configure TTS