How it works - OneInbox

The voice pipeline

Every call runs the same real-time loop: the caller speaks, the audio is transcribed, the LLM generates a reply, and TTS speaks it back. This happens continuously for the duration of the call. Each stage is configurable — you choose the provider and model. OneInbox handles streaming, connection management, and timing.

Stage 1 — Speech-to-text (STT)

When the caller speaks, audio streams in real time to the STT provider. OneInbox uses end-of-speech detection to know when the caller has finished a turn — based on a pause threshold, not a fixed timer. The resulting transcript is passed immediately to the LLM. What you configure: provider, model, language. Platform providers (no credential needed): Deepgram (nova-3, flux-en, flux-multi), Whisper, AssemblyAI, Azure. Deepgram nova-3 is the default and recommended for most use cases — it offers the lowest transcription latency, which directly reduces overall response time.

Stage 2 — Language model (LLM)

The transcript arrives at the LLM along with the full conversation history, the system prompt, and any knowledge base context. The LLM generates a reply. Streaming: the LLM streams tokens as it generates them. OneInbox begins passing text to TTS as soon as the first sentence boundary is detected — it does not wait for the full reply. Tool calls: if the LLM decides to call a tool (transfer, SMS, booking, etc.), OneInbox executes the tool, returns the result to the LLM, and the LLM continues generating. The caller hears normal audio while this happens — the agent keeps speaking rather than going silent. What you configure: provider, model, system prompt, temperature, tools, knowledge bases. Platform providers (no credential needed): OpenAI, Shisa. Anthropic and Groq require a credential (BYOK).

Stage 3 — Text-to-speech (TTS)

As the LLM streams reply text, TTS converts it to audio in real time. Audio is streamed back to the caller as it is generated — the caller starts hearing the agent speak before the full reply is finished. What you configure: provider, voice, speed, stability. Platform providers (no credential needed): Cartesia, Deepgram, ElevenLabs, OpenAI, Minimax, Shisa.

Interruption handling

When a caller speaks while the agent is talking, OneInbox detects it and stops the agent’s audio immediately — the caller’s new speech goes straight into a new STT → LLM → TTS cycle. The interruption_sensitivity field (0.0–1.0) controls how easily the agent can be interrupted.

Latency

Response latency is the sum of three segments:

Segment	What it measures
STT latency	Time from caller finishing speaking to transcript arriving at the LLM
LLM TTFT (time to first token)	Time from transcript arriving to the LLM producing its first token
TTS TTFB (time to first byte)	Time from first token arriving at TTS to first audio byte produced

Because all three stages stream in parallel, the caller typically starts hearing the agent’s reply within 800–1500 ms of finishing their own sentence, depending on provider choices.

Call types

Type	Transport	How it connects
Browser call	WebRTC	Via the Web SDK or `POST /v1/calls/web`. No phone number needed — runs entirely over the internet
Outbound phone call	PSTN / SIP	Agent dials `to_number` from `from_number` via `POST /v1/calls`
Inbound phone call	PSTN / SIP	Caller dials your number — routed to the agent assigned to that number

The same agent config works for all three call types.

What OneInbox manages for you

Real-time audio streaming between STT, LLM, and TTS
End-of-speech detection and turn management
Interruption detection and agent audio cutoff
Tool execution and result injection into the LLM context
Knowledge base retrieval and context injection
Silence timeout and end-call-phrase detection
Transcripts, call metadata, and AI summaries
Telephony routing (PSTN/SIP) for phone calls
WebRTC session tokens for browser calls

Next steps

Quickstart — make your first call in minutes
Agents — full agent configuration reference
LLMs — configure the AI brain
Voices — browse available voices and configure TTS

​The voice pipeline

​Stage 1 — Speech-to-text (STT)

​Stage 2 — Language model (LLM)

​Stage 3 — Text-to-speech (TTS)

​Interruption handling

​Latency

​Call types

​What OneInbox manages for you

​Next steps