IMMERSA Voice Chat

How it works

Connect one persistent WebSocket to wss://immersa-voice-chat-api.up.railway.app/ws/voice-chat. Every message in both directions is a JSON object with a type field that identifies it.

The pipeline is:

1. Connect

Open the WebSocket. The server immediately sends connection_established with your session_id.

2. Start session

Send start_session with a character_id and audio parameters. The server acknowledges and the session enters LISTENING state.

3. Stream audio

Encode each audio chunk as base64 and send audio_chunk messages. The server acknowledges each one and — every 5 chunks — emits a partial_transcript so you can show live captions.

4. End of utterance

When the user stops speaking, send end_of_utterance. The server finalises transcription, generates the character reply with GPT, and streams audio back from ElevenLabs.

5. Receive reply

You get: final_transcript → reply_text_done → multiple tts_audio_chunk → tts_done. After tts_done the session is back in LISTENING state, ready for the next turn.

6. Close

Send close_session to end gracefully.

All audio is WAV when going client→server and MP3 44 100 Hz 128 kbps when going server→client.

Message flow

Session states

The server tracks a state machine per session. State is included in the start_session ack.

CONNECTED→ LISTENING→ FINALIZING_TRANSCRIPT→ GENERATING_REPLY→ STREAMING_TTS→ LISTENING
LISTENING→ CLOSED (via close_session)

State	Meaning
CONNECTED	WebSocket open, waiting for start_session.
LISTENING	Session active, ready to receive audio_chunk messages.
FINALIZING_TRANSCRIPT	end_of_utterance received; running final Whisper STT.
GENERATING_REPLY	STT done; calling GPT to generate character reply.
STREAMING_TTS	LLM done; streaming ElevenLabs audio back to client.
CLOSED	Session terminated.

CLIENT → SERVER `start_session`

Initialise the conversation. Send once after connection_established.

JSON payload

{
  "type": "start_session",
  "character_id": "s1",
  "sample_rate": 16000,
  "audio_format": "wav_base64_chunks"
}

Fields

Field	Type	Required	Description
type	string	required	Always `"start_session"`
character_id	string	required	`"s1"`, `"s2"`, or `"p1"` — see Characters page
sample_rate	integer	optional	Sample rate of audio you will send. Default: `16000` Hz
audio_format	string	optional	Encoding format. Default: `"wav_base64_chunks"`

The server responds with ack (event: start_session) that includes your session_id and current state.

CLIENT → SERVER `audio_chunk`

One chunk of the user's audio, base64-encoded. Send in order, starting at chunk_index: 0.

JSON payload

{
  "type": "audio_chunk",
  "chunk_index": 0,
  "audio": "<base64-encoded WAV data>"
}

Fields

Field	Type	Required	Description
type	string	required	Always `"audio_chunk"`
chunk_index	integer ≥ 0	required	Zero-based index. Increment by 1 for each chunk.
audio	string (base64)	required	Base64-encoded WAV audio bytes.

The server performs rolling STT after every 5 chunks and emits a partial_transcript. Keep chunks small (e.g. 0.5–1 s of audio) for best latency.

CLIENT → SERVER `end_of_utterance`

Signal that the user has finished speaking. This triggers the full pipeline: final STT → LLM → TTS streaming.

JSON payload

{
  "type": "end_of_utterance"
}

CLIENT → SERVER `close_session`

Gracefully end the session. The server sends a final ack then closes the WebSocket.

JSON payload

{
  "type": "close_session"
}

SERVER → CLIENT `connection_established`

Sent immediately after the WebSocket handshake. Contains the session ID you should log for debugging.

JSON payload

{
  "type": "connection_established",
  "session_id": "3f4a1b2c-8d9e-4f5a-b6c7-d8e9f0a1b2c3",
  "message": "WebSocket connected successfully"
}

Fields

Field	Type	Description
type	string	Always `"connection_established"`
session_id	string (UUID)	Unique identifier for this WebSocket session.
message	string	Human-readable status.

SERVER → CLIENT `ack`

Generic acknowledgement for every client message. The event field tells you which message is being acked. Extra fields depend on the event.

ack for `start_session`

{
  "type": "ack",
  "event": "start_session",
  "message": "Session started successfully",
  "session_id": "3f4a1b2c-...",
  "character_id": "s1",
  "sample_rate": 16000,
  "audio_format": "wav_base64_chunks",
  "state": "LISTENING"
}

ack for `audio_chunk`

{
  "type": "ack",
  "event": "audio_chunk",
  "message": "Audio chunk received",
  "chunk_index": 0,
  "total_chunks": 1
}

ack for `end_of_utterance`

{
  "type": "ack",
  "event": "end_of_utterance",
  "message": "Processing started"
}

ack for `close_session`

{
  "type": "ack",
  "event": "close_session",
  "message": "Session closed successfully"
}

SERVER → CLIENT `partial_transcript`

Intermediate STT result emitted every 5 audio chunks while the user is still speaking. Use for live captions.

JSON payload

{
  "type": "partial_transcript",
  "text": "What is hydraulic pressure",
  "chunk_index": 4,
  "window_size": 5
}

Fields

Field	Type	Description
type	string	Always `"partial_transcript"`
text	string	Partial transcription of the rolling audio window.
chunk_index	integer	Index of the chunk that triggered this event.
window_size	integer (max 5)	Number of chunks in the rolling window.

SERVER → CLIENT `final_transcript`

Complete transcription of the user's full utterance, produced after end_of_utterance.

JSON payload

{
  "type": "final_transcript",
  "text": "What is hydraulic pressure and how is it measured?"
}

Fields

Field	Type	Description
type	string	Always `"final_transcript"`
text	string	Full transcription of the user's utterance.

SERVER → CLIENT `reply_text_done`

The character's complete text reply from GPT. Emitted before TTS audio starts — show a text bubble immediately without waiting for audio.

JSON payload

{
  "type": "reply_text_done",
  "text": "Well, hydraulic pressure is the force exerted by a confined fluid...",
  "length": 147
}

Fields

Field	Type	Description
type	string	Always `"reply_text_done"`
text	string	The character's full reply text.
length	integer	Character count of the reply.

SERVER → CLIENT `tts_audio_chunk`

One chunk of the character's voice audio, base64-encoded MP3. Buffer and play sequentially.

JSON payload

{
  "type": "tts_audio_chunk",
  "chunk_index": 0,
  "audio": "<base64-encoded MP3 data>"
}

Fields

Field	Type	Description
type	string	Always `"tts_audio_chunk"`
chunk_index	integer ≥ 0	Sequential index; play chunks in this order.
audio	string (base64)	Base64-encoded MP3 bytes. Format: MP3 · 44 100 Hz · 128 kbps.

Decode with atob() (browser) or Buffer.from(audio, 'base64') (Node). Concatenate all chunks into one MP3 blob or pipe into a streaming audio element.

SERVER → CLIENT `tts_done`

All TTS audio has been sent. The session returns to LISTENING state — you can start the next turn.

JSON payload

{
  "type": "tts_done"
}

SERVER → CLIENT `error`

Sent when something goes wrong. The session may still be alive after a non-fatal error — check the message and decide whether to retry or reconnect.

JSON payload

{
  "type": "error",
  "message": "Failed to decode base64 audio",
  "chunk_index": 3
}

Fields

Field	Type	Description
type	string	Always `"error"`
message	string	Human-readable error description.
chunk_index	integer \| null	Present when error relates to a specific audio chunk.
session_id	string (UUID)	Present when error is session-level.

Characters

Pass character_id in start_session to select which character you are speaking with.

ID	Name	Role	Department
s1	Kareem Ali El-Attar	Student	Irrigation Engineering
s2	Morad Hassan El-Shazly	Student	Mechanical Engineering
p1	Amin Saleh Shawky	Professor	Mechanical Engineering

Audio format reference

Direction	Format	Sample rate	Encoding
Client → Server	WAV	16 000 Hz (default)	Base64 string in JSON
Server → Client	MP3	44 100 Hz	Base64 string in JSON (128 kbps)

Encoding audio in the browser

// Record with MediaRecorder, then send each chunk:
const reader = new FileReader();
reader.onload = () => {
  const base64 = reader.result.split(',')[1];
  ws.send(JSON.stringify({
    type: 'audio_chunk',
    chunk_index: index++,
    audio: base64
  }));
};
reader.readAsDataURL(audioBlob);

Decoding TTS audio in the browser

// Collect chunks, then play:
const chunks = [];
ws.onmessage = (event) => {
  const msg = JSON.parse(event.data);
  if (msg.type === 'tts_audio_chunk') {
    const binary = atob(msg.audio);
    const bytes  = Uint8Array.from(binary, c => c.charCodeAt(0));
    chunks.push(bytes.buffer);
  }
  if (msg.type === 'tts_done') {
    const blob = new Blob(chunks, { type: 'audio/mpeg' });
    const url  = URL.createObjectURL(blob);
    new Audio(url).play();
    chunks.length = 0;
  }
};

How it works

1. Connect

2. Start session

3. Stream audio

4. End of utterance

5. Receive reply

6. Close

Message flow

Session states

CLIENT → SERVER start_session

JSON payload

Fields

CLIENT → SERVER audio_chunk

JSON payload

Fields

CLIENT → SERVER end_of_utterance

JSON payload

CLIENT → SERVER close_session

JSON payload

SERVER → CLIENT connection_established

JSON payload

Fields

SERVER → CLIENT ack

ack for start_session

ack for audio_chunk

ack for end_of_utterance

ack for close_session

SERVER → CLIENT partial_transcript

JSON payload

Fields

SERVER → CLIENT final_transcript

JSON payload

Fields

SERVER → CLIENT reply_text_done

JSON payload

Fields

SERVER → CLIENT tts_audio_chunk

JSON payload

Fields

SERVER → CLIENT tts_done

JSON payload

SERVER → CLIENT error

JSON payload

Fields

Characters

Audio format reference

Encoding audio in the browser

Decoding TTS audio in the browser

CLIENT → SERVER `start_session`

CLIENT → SERVER `audio_chunk`

CLIENT → SERVER `end_of_utterance`

CLIENT → SERVER `close_session`

SERVER → CLIENT `connection_established`

SERVER → CLIENT `ack`

ack for `start_session`

ack for `audio_chunk`

ack for `end_of_utterance`

ack for `close_session`

SERVER → CLIENT `partial_transcript`

SERVER → CLIENT `final_transcript`

SERVER → CLIENT `reply_text_done`

SERVER → CLIENT `tts_audio_chunk`

SERVER → CLIENT `tts_done`

SERVER → CLIENT `error`