WEBSOCKET PROTOCOL
AStack uses a 1:1 container model — each container accepts exactly one WebSocket client. All messages are JSON over text WebSocket frames.
CONNECTION LIFECYCLE
Client Server|--- WS connect -------------->||<-- { type: "connected", || clientId } || ||--- { type: "authenticate", || sessionToken } ||<-- { type: "authenticated", || sessionId, ... } || OR ||<-- { type: "auth_error", || message } || ||<------- ping (30s) ----------||-------- pong --------------->|
The server sends a WebSocket ping every 30 seconds. If the client does not respond before the next cycle, the connection is terminated. A second connection attempt is rejected with close code 4001.
AUTHENTICATION
CLIENT → SERVER
{"type": "authenticate","sessionToken": "st_abc123..."}
SERVER → CLIENT (SUCCESS)
type"authenticated"sessionIdstringAssigned session IDchannelNamestring | nullSupabase realtime channel (prod)expiresAtstring | nullSession expiration timestampsignalingMode"supabase" | "gateway"Signaling transportCALL SESSION
call_start
{"type": "call_start","fps": 30,"sampleRate": 16000,"providers": { "asr": "self", "llm": "self", "tts": "self" },"systemPrompt": "You are a helpful assistant.","priorContext": [{ "role": "user", "content": "Hello" }],"configOverrides": { "EOU_SILENCE_MS": 800 }}
fpsdefault: 30Animation frame ratesampleRatedefault: 16000Audio sample rate (Hz)providersdefault: {}ASR/LLM/TTS provider overridessystemPromptdefault: nullSystem prompt for LLMpriorContextdefault: []Prior conversation messagesconfigOverridesdefault: {}Runtime config overridesdisableA2Fdefault: falseSkip A2F, use amplitude blendshapescall_audio / call_image / call_stop
// Stream audio (base64 PCM){ "type": "call_audio", "audio": "<base64>" }// Send camera frame (base64 JPEG){ "type": "call_image", "image": "<base64>" }// Stop the call{ "type": "call_stop" }
SERVER EVENTS
Events emitted by the server during a conversation turn, in approximate order.
ASR EVENTS
call_speech_startedUser speech detectedcall_interimInterim ASR transcript (text, raw, timing)call_utterance_endEnd of utterance detectedcall_transcriptFinal accepted transcriptPIPELINE EVENTS
call_llm_startLLM generation startedcall_llm_ttftLLM first token generatedcall_llm_ttfsLLM first complete sentencecall_llm_endLLM generation completecall_tts_startTTS synthesis startedcall_tts_ttfuTTS first usable audio chunkcall_tts_endTTS synthesis completecall_a2f_startAudio2Face processing startedcall_a2f_ttfuA2F first usable blendshape framecall_a2f_endA2F processing completeRESPONSE EVENTS
call_responseFull LLM response textcall_chunkAudio + blendshape data for playbackcall_response_completeAll chunks sentcall_buffer_endClient audio playback estimated completeturn_metricsEnd-of-turn latency summaryCALL CHUNK SCHEMA
{"type": "call_chunk","audio": "<base64 PCM16 audio at 24kHz>","blendshapes": [[0.0, 0.1, ...], // frame 1: 52 ARKit blendshape weights[0.0, 0.2, ...], // frame 2...],"timestamp": 1234567890}
Frames are sent in batches of 5 at the configured FPS. Each frame contains 52 ARKit blendshape weights (0-1 range) for facial animation.
TYPICAL TURN ORDER
call_speech_startedcall_interim (repeated as ASR refines)spec_start # speculative pipeline beginscall_llm_start → call_llm_ttft → call_llm_ttfscall_tts_start → call_tts_ttfucall_a2f_start → call_a2f_ttfucall_interim (newer transcript)spec_cancel # old spec cancelledspec_start # new spec with updated transcriptcall_fire_eou # VAD silence threshold metcall_utterance_endcall_transcript # final transcriptcall_speculative_acceptedcall_response # full AI response textcall_chunk (repeated) # audio + blendshapes streamedcall_response_completecall_buffer_endturn_metrics
STANDALONE SERVICES
Each AI service (ASR, LLM, TTS) can be used independently outside of a call session.
ASR (SPEECH-TO-TEXT)
asr_startC → SStart ASR session (provider?, sampleRate?)asr_audioC → SStream audio (base64)asr_resultS → CTranscription result (transcript, confidence, isInterim)asr_stopC → SStop ASR sessionLLM (TEXT GENERATION)
llmC → SSend prompt (prompt, provider?)llm_startS → CGeneration startedllm_chunkS → CToken chunk (content)llm_endS → CGeneration completeTTS (TEXT-TO-SPEECH)
ttsC → SSend text (text, voice?, provider?)tts_startS → CSynthesis startedtts_chunkS → CAudio chunk (base64)tts_endS → CSynthesis completeERROR TYPES
auth_errorAuthentication failurecall_errorCall session errorcall_asr_errorCall ASR processing errorasr_errorStandalone ASR errorllm_errorStandalone LLM errortts_errorStandalone TTS errorerrorGeneric / malformed message errorAll error messages include a message string and timestamp.