Table of Contents

WEBSOCKET PROTOCOL

Premium sessions connect to the gatewayWsUrl returned by the session API. The UE runtime owns exactly one customer gateway connection and one signed renderer player for the session.

CONNECTION LIFECYCLE

Client Server
|--- WS connect -------------->|
|<-- { type: "connected", |
| clientId } |
| |
|--- { type: "authenticate", |
| sessionToken } |
|<-- { type: "authenticated", |
| sessionId, ... } |
| OR |
|<-- { type: "auth_error", |
| message } |
| |
|<------- ping (30s) ----------|
|-------- pong --------------->|

The gateway sends a WebSocket ping every 30 seconds. If the client does not respond before the next cycle, the connection is terminated. A second active or pending connection for the same runtime is rejected with close code 4001.

AUTHENTICATION

CLIENT → SERVER

{
"type": "authenticate",
"sessionToken": "st_abc123..."
}

SERVER → CLIENT (SUCCESS)

type"authenticated"
sessionIdstringAssigned session ID
channelNamestring | nullSupabase realtime channel (prod)
expiresAtstring | nullSession expiration timestamp
signalingMode"supabase" | "gateway"Signaling transport

CALL SESSION

call_start

{
"type": "call_start",
"fps": 30,
"sampleRate": 16000,
"providers": { "asr": "self", "llm": "self", "tts": "self" },
"systemPrompt": "You are a helpful assistant.",
"priorContext": [{ "role": "user", "content": "Hello" }],
"configOverrides": { "EOU_SILENCE_MS": 800 }
}
fpsdefault: 30Animation frame rate
sampleRatedefault: 16000Audio sample rate (Hz)
providersdefault: {}ASR/LLM/TTS provider overrides
systemPromptdefault: nullSystem prompt for LLM
priorContextdefault: []Prior conversation messages
configOverridesdefault: {}Runtime config overrides
disableA2Fdefault: falseSkip A2F, use amplitude blendshapes

call_text_input

// Send text input instead of voice
{ "type": "call_text_input", "text": "Hello, how are you?" }

Send text input to the AI during an active call session. Triggers the same LLM → TTS → A2F pipeline as voice input. Does not require an active microphone.

call_audio / call_image / call_stop

// Stream audio (base64 PCM)
{ "type": "call_audio", "audio": "<base64>" }
// Send camera frame (base64 JPEG)
{ "type": "call_image", "image": "<base64>" }
// Stop the call
{ "type": "call_stop" }

SERVER EVENTS

Events emitted by the server during a conversation turn, in approximate order.

ASR EVENTS

call_speech_startedUser speech detected
call_interimInterim ASR transcript (text, raw, timing)
call_utterance_endEnd of utterance detected
call_transcriptFinal accepted transcript

PIPELINE EVENTS

call_llm_startLLM generation started
call_llm_ttftLLM first token generated
call_llm_ttfsLLM first complete sentence
call_llm_endLLM generation complete
call_tts_startTTS synthesis started
call_tts_ttfuTTS first usable audio chunk
call_tts_endTTS synthesis complete
call_a2f_startAudio2Face processing started
call_a2f_ttfuA2F first usable blendshape frame
call_a2f_endA2F processing complete

RESPONSE EVENTS

call_responseFull LLM response text
call_chunkAudio + blendshape data for playback
call_response_completeAll chunks sent
call_buffer_endClient audio playback estimated complete
turn_metricsEnd-of-turn latency summary

CALL CHUNK SCHEMA

{
"type": "call_chunk",
"audio": "<base64 PCM16 audio at 24kHz>",
"blendshapes": [
[0.0, 0.1, ...], // frame 1: 52 ARKit blendshape weights
[0.0, 0.2, ...], // frame 2
...
],
"timestamp": 1234567890
}

Frames are sent in batches of 5 at the configured FPS. Each frame contains 52 ARKit blendshape weights (0-1 range) for facial animation.

TYPICAL TURN ORDER

call_speech_started
call_interim (repeated as ASR refines)
spec_start # speculative pipeline begins
call_llm_start → call_llm_ttft → call_llm_ttfs
call_tts_start → call_tts_ttfu
call_a2f_start → call_a2f_ttfu
call_interim (newer transcript)
spec_cancel # old spec cancelled
spec_start # new spec with updated transcript
call_fire_eou # VAD silence threshold met
call_utterance_end
call_transcript # final transcript
call_speculative_accepted
call_response # full AI response text
call_chunk (repeated) # audio + blendshapes streamed
call_response_complete
call_buffer_end
turn_metrics

SERVICE BOUNDARIES

ASR, VAD, A2F, LLM, TTS, and renderer control endpoints are not customer WebSocket APIs. The customer-facing gateway protocol is the call lifecycle above: authenticate, start or stop a call, stream audio/text/image input, and receive transcript, response, chunk, lifecycle, and metrics events.

The renderer player is a separate signed URL returned as rendererPlayerUrl. Pixel Streaming telemetry and WebRTC media state stay inside that player surface, not this call WebSocket.

ERROR TYPES

auth_errorAuthentication failure
call_errorCall session error
call_asr_errorCall ASR processing error
errorGeneric / malformed message error

All error messages include a message string and timestamp.