WEBSOCKET PROTOCOL
Premium sessions connect to the gatewayWsUrl returned by the session API. The UE runtime owns exactly one customer gateway connection and one signed renderer player for the session.
CONNECTION LIFECYCLE
Client Server|--- WS connect -------------->||<-- { type: "connected", || clientId } || ||--- { type: "authenticate", || sessionToken } ||<-- { type: "authenticated", || sessionId, ... } || OR ||<-- { type: "auth_error", || message } || ||<------- ping (30s) ----------||-------- pong --------------->|
The gateway sends a WebSocket ping every 30 seconds. If the client does not respond before the next cycle, the connection is terminated. A second active or pending connection for the same runtime is rejected with close code 4001.
AUTHENTICATION
CLIENT → SERVER
{"type": "authenticate","sessionToken": "st_abc123..."}
SERVER → CLIENT (SUCCESS)
type"authenticated"sessionIdstringAssigned session IDchannelNamestring | nullSupabase realtime channel (prod)expiresAtstring | nullSession expiration timestampsignalingMode"supabase" | "gateway"Signaling transportCALL SESSION
call_start
{"type": "call_start","fps": 30,"sampleRate": 16000,"providers": { "asr": "self", "llm": "self", "tts": "self" },"systemPrompt": "You are a helpful assistant.","priorContext": [{ "role": "user", "content": "Hello" }],"configOverrides": { "EOU_SILENCE_MS": 800 }}
fpsdefault: 30Animation frame ratesampleRatedefault: 16000Audio sample rate (Hz)providersdefault: {}ASR/LLM/TTS provider overridessystemPromptdefault: nullSystem prompt for LLMpriorContextdefault: []Prior conversation messagesconfigOverridesdefault: {}Runtime config overridesdisableA2Fdefault: falseSkip A2F, use amplitude blendshapescall_text_input
// Send text input instead of voice{ "type": "call_text_input", "text": "Hello, how are you?" }
Send text input to the AI during an active call session. Triggers the same LLM → TTS → A2F pipeline as voice input. Does not require an active microphone.
call_audio / call_image / call_stop
// Stream audio (base64 PCM){ "type": "call_audio", "audio": "<base64>" }// Send camera frame (base64 JPEG){ "type": "call_image", "image": "<base64>" }// Stop the call{ "type": "call_stop" }
SERVER EVENTS
Events emitted by the server during a conversation turn, in approximate order.
ASR EVENTS
call_speech_startedUser speech detectedcall_interimInterim ASR transcript (text, raw, timing)call_utterance_endEnd of utterance detectedcall_transcriptFinal accepted transcriptPIPELINE EVENTS
call_llm_startLLM generation startedcall_llm_ttftLLM first token generatedcall_llm_ttfsLLM first complete sentencecall_llm_endLLM generation completecall_tts_startTTS synthesis startedcall_tts_ttfuTTS first usable audio chunkcall_tts_endTTS synthesis completecall_a2f_startAudio2Face processing startedcall_a2f_ttfuA2F first usable blendshape framecall_a2f_endA2F processing completeRESPONSE EVENTS
call_responseFull LLM response textcall_chunkAudio + blendshape data for playbackcall_response_completeAll chunks sentcall_buffer_endClient audio playback estimated completeturn_metricsEnd-of-turn latency summaryCALL CHUNK SCHEMA
{"type": "call_chunk","audio": "<base64 PCM16 audio at 24kHz>","blendshapes": [[0.0, 0.1, ...], // frame 1: 52 ARKit blendshape weights[0.0, 0.2, ...], // frame 2...],"timestamp": 1234567890}
Frames are sent in batches of 5 at the configured FPS. Each frame contains 52 ARKit blendshape weights (0-1 range) for facial animation.
TYPICAL TURN ORDER
call_speech_startedcall_interim (repeated as ASR refines)spec_start # speculative pipeline beginscall_llm_start → call_llm_ttft → call_llm_ttfscall_tts_start → call_tts_ttfucall_a2f_start → call_a2f_ttfucall_interim (newer transcript)spec_cancel # old spec cancelledspec_start # new spec with updated transcriptcall_fire_eou # VAD silence threshold metcall_utterance_endcall_transcript # final transcriptcall_speculative_acceptedcall_response # full AI response textcall_chunk (repeated) # audio + blendshapes streamedcall_response_completecall_buffer_endturn_metrics
SERVICE BOUNDARIES
ASR, VAD, A2F, LLM, TTS, and renderer control endpoints are not customer WebSocket APIs. The customer-facing gateway protocol is the call lifecycle above: authenticate, start or stop a call, stream audio/text/image input, and receive transcript, response, chunk, lifecycle, and metrics events.
The renderer player is a separate signed URL returned as rendererPlayerUrl. Pixel Streaming telemetry and WebRTC media state stay inside that player surface, not this call WebSocket.
ERROR TYPES
auth_errorAuthentication failurecall_errorCall session errorcall_asr_errorCall ASR processing errorerrorGeneric / malformed message errorAll error messages include a message string and timestamp.