TABLE OF CONTENTS

WEBSOCKET PROTOCOL

AStack uses a 1:1 container model — each container accepts exactly one WebSocket client. All messages are JSON over text WebSocket frames.

CONNECTION LIFECYCLE

Client Server
|--- WS connect -------------->|
|<-- { type: "connected", |
| clientId } |
| |
|--- { type: "authenticate", |
| sessionToken } |
|<-- { type: "authenticated", |
| sessionId, ... } |
| OR |
|<-- { type: "auth_error", |
| message } |
| |
|<------- ping (30s) ----------|
|-------- pong --------------->|

The server sends a WebSocket ping every 30 seconds. If the client does not respond before the next cycle, the connection is terminated. A second connection attempt is rejected with close code 4001.

AUTHENTICATION

CLIENT → SERVER

{
"type": "authenticate",
"sessionToken": "st_abc123..."
}

SERVER → CLIENT (SUCCESS)

type"authenticated"
sessionIdstringAssigned session ID
channelNamestring | nullSupabase realtime channel (prod)
expiresAtstring | nullSession expiration timestamp
signalingMode"supabase" | "gateway"Signaling transport

CALL SESSION

call_start

{
"type": "call_start",
"fps": 30,
"sampleRate": 16000,
"providers": { "asr": "self", "llm": "self", "tts": "self" },
"systemPrompt": "You are a helpful assistant.",
"priorContext": [{ "role": "user", "content": "Hello" }],
"configOverrides": { "EOU_SILENCE_MS": 800 }
}
fpsdefault: 30Animation frame rate
sampleRatedefault: 16000Audio sample rate (Hz)
providersdefault: {}ASR/LLM/TTS provider overrides
systemPromptdefault: nullSystem prompt for LLM
priorContextdefault: []Prior conversation messages
configOverridesdefault: {}Runtime config overrides
disableA2Fdefault: falseSkip A2F, use amplitude blendshapes

call_audio / call_image / call_stop

// Stream audio (base64 PCM)
{ "type": "call_audio", "audio": "<base64>" }
// Send camera frame (base64 JPEG)
{ "type": "call_image", "image": "<base64>" }
// Stop the call
{ "type": "call_stop" }

SERVER EVENTS

Events emitted by the server during a conversation turn, in approximate order.

ASR EVENTS

call_speech_startedUser speech detected
call_interimInterim ASR transcript (text, raw, timing)
call_utterance_endEnd of utterance detected
call_transcriptFinal accepted transcript

PIPELINE EVENTS

call_llm_startLLM generation started
call_llm_ttftLLM first token generated
call_llm_ttfsLLM first complete sentence
call_llm_endLLM generation complete
call_tts_startTTS synthesis started
call_tts_ttfuTTS first usable audio chunk
call_tts_endTTS synthesis complete
call_a2f_startAudio2Face processing started
call_a2f_ttfuA2F first usable blendshape frame
call_a2f_endA2F processing complete

RESPONSE EVENTS

call_responseFull LLM response text
call_chunkAudio + blendshape data for playback
call_response_completeAll chunks sent
call_buffer_endClient audio playback estimated complete
turn_metricsEnd-of-turn latency summary

CALL CHUNK SCHEMA

{
"type": "call_chunk",
"audio": "<base64 PCM16 audio at 24kHz>",
"blendshapes": [
[0.0, 0.1, ...], // frame 1: 52 ARKit blendshape weights
[0.0, 0.2, ...], // frame 2
...
],
"timestamp": 1234567890
}

Frames are sent in batches of 5 at the configured FPS. Each frame contains 52 ARKit blendshape weights (0-1 range) for facial animation.

TYPICAL TURN ORDER

call_speech_started
call_interim (repeated as ASR refines)
spec_start # speculative pipeline begins
call_llm_start → call_llm_ttft → call_llm_ttfs
call_tts_start → call_tts_ttfu
call_a2f_start → call_a2f_ttfu
call_interim (newer transcript)
spec_cancel # old spec cancelled
spec_start # new spec with updated transcript
call_fire_eou # VAD silence threshold met
call_utterance_end
call_transcript # final transcript
call_speculative_accepted
call_response # full AI response text
call_chunk (repeated) # audio + blendshapes streamed
call_response_complete
call_buffer_end
turn_metrics

STANDALONE SERVICES

Each AI service (ASR, LLM, TTS) can be used independently outside of a call session.

ASR (SPEECH-TO-TEXT)

asr_startC → SStart ASR session (provider?, sampleRate?)
asr_audioC → SStream audio (base64)
asr_resultS → CTranscription result (transcript, confidence, isInterim)
asr_stopC → SStop ASR session

LLM (TEXT GENERATION)

llmC → SSend prompt (prompt, provider?)
llm_startS → CGeneration started
llm_chunkS → CToken chunk (content)
llm_endS → CGeneration complete

TTS (TEXT-TO-SPEECH)

ttsC → SSend text (text, voice?, provider?)
tts_startS → CSynthesis started
tts_chunkS → CAudio chunk (base64)
tts_endS → CSynthesis complete

ERROR TYPES

auth_errorAuthentication failure
call_errorCall session error
call_asr_errorCall ASR processing error
asr_errorStandalone ASR error
llm_errorStandalone LLM error
tts_errorStandalone TTS error
errorGeneric / malformed message error

All error messages include a message string and timestamp.