Open AI – Voice Agents

  • Realtime speech-to-speech model: streams audio in, and streams audio out as the answer (no separate TTS call).
  • Turn detection built in (VAD + optional semantic mode) so the agent knows when you’ve finished speaking, and when it can start talking.
  • Tools/functions & MCP support to let the agent call your functions or connect to external data/services.

Realtime vs. classic STT→LLM→TTS

AspectRealtime voice-to-voiceClassic STT→LLM→TTS
Audio handlingNative speech-in & speech-out3 separate calls (STT, LLM, TTS)
LatencyVery low (streamed both ways)Higher (each stage adds delay)
Turn takingBuilt-in VAD/semantic endpointingYou implement VAD/endpointing yourself
Interruptions (barge-in)Supported by designYou must wire partial TTS + cancels
Integration surfaceSingle session (WebRTC/WebSocket)Multiple APIs to orchestrate

Sources: OpenAI Realtime overview & best-practice write-ups.


How to “Create” one (UI + API)

A) From the OpenAI Realtime console (UI)

  1. Go to Audio → Realtime on the OpenAI Platform and click Create voice-to-voice.
  2. Pick a Model (e.g., latest realtime model), a Voice, and set Turn detection (Normal/Automatic, Semantic, or Disabled).
  3. Tune Threshold, Prefix padding, Silence duration, and other options below.
  4. Press Start to talk; the page streams mic audio to the model and plays the reply.
    (These controls map to the API options documented in the Realtime guides.)

2) Configure session parameters (voice, turn detection, tools)

c

Voice
The TTS “persona” the Realtime model uses for audio replies. Choose from built-in voices; set per session.

Automatic turn detection — Normal
Server-side VAD (voice activity detection). It watches energy in the input signal to decide when you stopped speaking, then the model replies automatically. Good default.

Semantic
Turn detection that looks at the meaning of your words (a semantic classifier) rather than just silence, to avoid premature cut-offs in short/fast speech. Useful in noisy/overlapping talk.

Disabled
No automatic turn detection; you manually control when the model should answer (e.g., via a button or programmatic event).

Threshold (e.g., 0.50)
How sensitive VAD is to detecting speech vs. background. Lower = more sensitive (may trigger early); higher = less sensitive (may miss very soft starts). Tune together with silence duration.

Prefix padding (ms) (e.g., 300)
How many milliseconds before detected speech to include, so the very first phonemes aren’t clipped.

Silence duration (ms) (e.g., 500)
How long the input must be silent before the system decides “the user finished” and starts replying. Larger = more patient; smaller = snappier but risks talking over the user.

Idle timeout
How long the session can sit without activity before it auto-ends or resets. Use to control cost and resource use. (Setting appears in Realtime session/config flows.)

Functions
Structured function calls the model can trigger (with JSON args) so your code runs business logic (e.g., “bookAppointment”). Same idea as Chat Completions tools, available in Realtime.

MCP servers
External Model Context Protocol endpoints that expose tools/data (files, DBs, SaaS) to the agent in a standardized way. Think “USB-C for AI tools”.

Model
The realtime model to run (e.g., the current “gpt-realtime”/“gpt-4o-realtime” variant). Newer releases may add features like SIP calling or improved speech quality.

User transcript model
Optional STT used to produce a text transcript of the user’s speech for logs/UX, separate from the speech-to-speech path. Often Whisper or similar; configurable in Agents SDK.

Noise reduction
Input denoising/suppression. Helps VAD and transcription in noisy rooms. Use either client-side DSP or session options if provided by your stack; pair with a sane threshold.

Model configuration
Per-session generation knobs (e.g., temperature, response format, stop sequences) and audio options. Set on session create or updated live via events.

Max tokens (e.g., 4096)
Upper bound for output tokens per response (not the full context window). Cap this to control verbosity/cost. Actual limits depend on the chosen realtime model.

Next