> ## Documentation Index
> Fetch the complete documentation index at: https://docs.oneinbox.ai/llms.txt
> Use this file to discover all available pages before exploring further.

# How it works

> What happens from the moment a call connects to the moment it ends — the full pipeline explained.

## The voice pipeline

Every call runs the same real-time loop: the caller speaks, the audio is transcribed, the LLM generates a reply, and TTS speaks it back. This happens continuously for the duration of the call.

```mermaid theme={null}
sequenceDiagram
  participant Caller
  participant STT as Speech-to-text (STT)
  participant LLM as Language model (LLM)
  participant TTS as Text-to-speech (TTS)
  participant Tool as Tool (optional)

  Caller->>STT: speaks
  STT->>LLM: transcript text
  LLM->>Tool: tool call (if triggered)
  Tool->>LLM: tool result
  LLM->>TTS: reply text (streamed)
  TTS->>Caller: spoken audio (streamed)
```

Each stage is configurable — you choose the provider and model. OneInbox handles streaming, connection management, and timing.

***

## Stage 1 — Speech-to-text (STT)

When the caller speaks, audio streams in real time to the STT provider. OneInbox uses **end-of-speech detection** to know when the caller has finished a turn — based on a pause threshold, not a fixed timer. The resulting transcript is passed immediately to the LLM.

**What you configure:** provider, model, language.

**Platform providers (no credential needed):** Deepgram (`nova-3`, `flux-en`, `flux-multi`), Whisper, AssemblyAI, Azure.

**Deepgram nova-3** is the default and recommended for most use cases — it offers the lowest transcription latency, which directly reduces overall response time.

***

## Stage 2 — Language model (LLM)

The transcript arrives at the LLM along with the full conversation history, the system prompt, and any knowledge base context. The LLM generates a reply.

**Streaming:** the LLM streams tokens as it generates them. OneInbox begins passing text to TTS as soon as the first sentence boundary is detected — it does not wait for the full reply.

**Tool calls:** if the LLM decides to call a tool (transfer, SMS, booking, etc.), OneInbox executes the tool, returns the result to the LLM, and the LLM continues generating. The caller hears normal audio while this happens — the agent keeps speaking rather than going silent.

**What you configure:** provider, model, system prompt, temperature, tools, knowledge bases.

**Platform providers (no credential needed):** OpenAI, Shisa. Anthropic and Groq require a credential (BYOK).

***

## Stage 3 — Text-to-speech (TTS)

As the LLM streams reply text, TTS converts it to audio in real time. Audio is streamed back to the caller as it is generated — the caller starts hearing the agent speak before the full reply is finished.

**What you configure:** provider, voice, speed, stability.

**Platform providers (no credential needed):** Cartesia, Deepgram, ElevenLabs, OpenAI, Minimax, Shisa.

***

## Interruption handling

When a caller speaks while the agent is talking, OneInbox detects it and stops the agent's audio immediately — the caller's new speech goes straight into a new STT → LLM → TTS cycle. The `interruption_sensitivity` field (0.0–1.0) controls how easily the agent can be interrupted.

***

## Latency

Response latency is the sum of three segments:

| Segment                            | What it measures                                                      |
| ---------------------------------- | --------------------------------------------------------------------- |
| **STT latency**                    | Time from caller finishing speaking to transcript arriving at the LLM |
| **LLM TTFT** (time to first token) | Time from transcript arriving to the LLM producing its first token    |
| **TTS TTFB** (time to first byte)  | Time from first token arriving at TTS to first audio byte produced    |

Because all three stages stream in parallel, the caller typically starts hearing the agent's reply within 800–1500 ms of finishing their own sentence, depending on provider choices.

***

## Call types

| Type                    | Transport  | How it connects                                                                                   |
| ----------------------- | ---------- | ------------------------------------------------------------------------------------------------- |
| **Browser call**        | WebRTC     | Via the Web SDK or `POST /v1/calls/web`. No phone number needed — runs entirely over the internet |
| **Outbound phone call** | PSTN / SIP | Agent dials `to_number` from `from_number` via `POST /v1/calls`                                   |
| **Inbound phone call**  | PSTN / SIP | Caller dials your number — routed to the agent assigned to that number                            |

The same agent config works for all three call types.

***

## What OneInbox manages for you

* Real-time audio streaming between STT, LLM, and TTS
* End-of-speech detection and turn management
* Interruption detection and agent audio cutoff
* Tool execution and result injection into the LLM context
* Knowledge base retrieval and context injection
* Silence timeout and end-call-phrase detection
* Transcripts, call metadata, and AI summaries
* Telephony routing (PSTN/SIP) for phone calls
* WebRTC session tokens for browser calls

***

## Next steps

* **[Quickstart](/guides/quickstart)** — make your first call in minutes
* **[Agents](/concepts/agents)** — full agent configuration reference
* **[LLMs](/guides/llms)** — configure the AI brain
* **[Voices](/guides/voices)** — browse available voices and configure TTS
