WEBSOCKET PROTOCOL

Premium sessions connect to the gatewayWsUrl returned by the session API. The UE runtime owns exactly one customer gateway connection and one signed renderer player for the session.

CONNECTION LIFECYCLE

Client                          Server
  |--- WS connect -------------->|
  |<-- { type: "connected",      |
  |      clientId }              |
  |                               |
  |--- { type: "authenticate",   |
  |      sessionToken }          |
  |<-- { type: "authenticated",  |
  |      sessionId, ... }        |
  |         OR                    |
  |<-- { type: "auth_error",     |
  |      message }               |
  |                               |
  |<------- ping (30s) ----------|
  |-------- pong --------------->|

The gateway sends a WebSocket ping every 30 seconds. If the client does not respond before the next cycle, the connection is terminated. A second active or pending connection for the same runtime is rejected with close code 4001.

AUTHENTICATION

CLIENT → SERVER

{
  "type": "authenticate",
  "sessionToken": "st_abc123..."
}

SERVER → CLIENT (SUCCESS)

type"authenticated"

sessionIdstringAssigned session ID

channelNamestring | nullSupabase realtime channel (prod)

expiresAtstring | nullSession expiration timestamp

signalingMode"supabase" | "gateway"Signaling transport

CALL SESSION

call_start

{
  "type": "call_start",
  "fps": 30,
  "sampleRate": 16000,
  "providers": { "asr": "self", "llm": "self", "tts": "self" },
  "systemPrompt": "You are a helpful assistant.",
  "priorContext": [{ "role": "user", "content": "Hello" }],
  "configOverrides": { "EOU_SILENCE_MS": 800 }
}

fpsdefault: 30Animation frame rate

sampleRatedefault: 16000Audio sample rate (Hz)

providersdefault: {}ASR/LLM/TTS provider overrides

systemPromptdefault: nullSystem prompt for LLM

priorContextdefault: []Prior conversation messages

configOverridesdefault: {}Runtime config overrides

disableA2Fdefault: falseSkip A2F, use amplitude blendshapes

call_text_input

// Send text input instead of voice
{ "type": "call_text_input", "text": "Hello, how are you?" }

Send text input to the AI during an active call session. Triggers the same LLM → TTS → A2F pipeline as voice input. Does not require an active microphone.

call_audio / call_image / call_stop

// Stream audio (base64 PCM)
{ "type": "call_audio", "audio": "<base64>" }

// Send camera frame (base64 JPEG)
{ "type": "call_image", "image": "<base64>" }

// Stop the call
{ "type": "call_stop" }

SERVER EVENTS

Events emitted by the server during a conversation turn, in approximate order.

ASR EVENTS

call_speech_startedUser speech detected

call_interimInterim ASR transcript (text, raw, timing)

call_utterance_endEnd of utterance detected

call_transcriptFinal accepted transcript

PIPELINE EVENTS

call_llm_startLLM generation started

call_llm_ttftLLM first token generated

call_llm_ttfsLLM first complete sentence

call_llm_endLLM generation complete

call_tts_startTTS synthesis started

call_tts_ttfuTTS first usable audio chunk

call_tts_endTTS synthesis complete

call_a2f_startAudio2Face processing started

call_a2f_ttfuA2F first usable blendshape frame

call_a2f_endA2F processing complete

RESPONSE EVENTS

call_responseFull LLM response text

call_chunkAudio + blendshape data for playback

call_response_completeAll chunks sent

call_buffer_endClient audio playback estimated complete

turn_metricsEnd-of-turn latency summary

CALL CHUNK SCHEMA

{
  "type": "call_chunk",
  "audio": "<base64 PCM16 audio at 24kHz>",
  "blendshapes": [
    [0.0, 0.1, ...],  // frame 1: 52 ARKit blendshape weights
    [0.0, 0.2, ...],  // frame 2
    ...
  ],
  "timestamp": 1234567890
}

Frames are sent in batches of 5 at the configured FPS. Each frame contains 52 ARKit blendshape weights (0-1 range) for facial animation.

TYPICAL TURN ORDER

call_speech_started
call_interim (repeated as ASR refines)
  spec_start                      # speculative pipeline begins
    call_llm_start → call_llm_ttft → call_llm_ttfs
    call_tts_start → call_tts_ttfu
    call_a2f_start → call_a2f_ttfu
  call_interim (newer transcript)
  spec_cancel                     # old spec cancelled
  spec_start                      # new spec with updated transcript
call_fire_eou                     # VAD silence threshold met
call_utterance_end
call_transcript                   # final transcript
call_speculative_accepted
call_response                     # full AI response text
call_chunk (repeated)             # audio + blendshapes streamed
call_response_complete
call_buffer_end
turn_metrics

SERVICE BOUNDARIES

ASR, VAD, A2F, LLM, TTS, and renderer control endpoints are not customer WebSocket APIs. The customer-facing gateway protocol is the call lifecycle above: authenticate, start or stop a call, stream audio/text/image input, and receive transcript, response, chunk, lifecycle, and metrics events.

The renderer player is a separate signed URL returned as rendererPlayerUrl. Pixel Streaming telemetry and WebRTC media state stay inside that player surface, not this call WebSocket.

ERROR TYPES

auth_errorAuthentication failure

call_errorCall session error

call_asr_errorCall ASR processing error

errorGeneric / malformed message error

All error messages include a message string and timestamp.