upload_type: stream, Create Session returns a wss:// URL in the upload_url field instead of an HTTPS upload URL. Connect to it and stream audio in real time.
This is the alternative to Upload Audio (HTTP binary). A session uses one path or the other, decided by
upload_type:chunked→ HTTP binary uploadstream→ WebSocket (this page)
1. Get the WebSocket URL
upload_url is a full wss:// endpoint bound to this session. It contains an internal stream_id — connect to it exactly as returned.
2. Connect and stream
The server auto-detects the wire format from your first frame. Use whichever fits your client:Option A — Raw binary PCM
Send raw 16-bit little-endian, mono, 16 kHz PCM audio as binary WebSocket frames:Option B — JSON envelope (Twilio / Vobiz Media Streams compatible)
Send text frames using the Media Streams envelope. Audio payloads are base64-encoded PCM:Declare your sample rate in the
start event’s mediaFormat.sampleRate if it differs from the 16 kHz default. The server applies Voice Activity Detection (VAD) and accumulates speech-boundary-aware chunks (~10–25s) which are written to storage automatically.3. Finish: close, then end the session
When the audio is done:Stop streaming
Send a
stop event (JSON mode) or simply close the WebSocket. The server flushes any buffered audio to storage.Call End Session
Call End Session (
POST /voice/v1/sessions/{session_id}/end). For protocol streaming sessions this is the single, canonical finalize trigger — closing the socket flushes audio but does not start processing on its own.Poll for results
Poll Get Session at ~1-second intervals until the status is no longer
202.Format summary
| Property | Value |
|---|---|
| Transport | WebSocket (wss://) |
| Audio encoding | 16-bit signed PCM, little-endian, mono |
| Default sample rate | 16000 Hz |
| Frame modes | Raw binary or JSON envelope (start / media / stop) |
| Chunking | Server-side VAD, ~10–25s speech-aware chunks |
| Finalize | stop / socket close → flush; End Session → process |

