How it works
Connect one persistent WebSocket to wss://immersa-voice-chat-api.up.railway.app/ws/voice-chat. Every message in both directions is a JSON object with a type field that identifies it.
The pipeline is:
1. Connect
Open the WebSocket. The server immediately sends connection_established with your session_id.
2. Start session
Send start_session with a character_id and audio parameters. The server acknowledges and the session enters LISTENING state.
3. Stream audio
Encode each audio chunk as base64 and send audio_chunk messages. The server acknowledges each one and — every 5 chunks — emits a partial_transcript so you can show live captions.
4. End of utterance
When the user stops speaking, send end_of_utterance. The server finalises transcription, generates the character reply with GPT, and streams audio back from ElevenLabs.
5. Receive reply
You get: final_transcript → reply_text_done → multiple tts_audio_chunk → tts_done. After tts_done the session is back in LISTENING state, ready for the next turn.
6. Close
Send close_session to end gracefully.
Message flow
Session states
The server tracks a state machine per session. State is included in the start_session ack.
LISTENING→ CLOSED (via close_session)
| State | Meaning |
|---|---|
| CONNECTED | WebSocket open, waiting for start_session. |
| LISTENING | Session active, ready to receive audio_chunk messages. |
| FINALIZING_TRANSCRIPT | end_of_utterance received; running final Whisper STT. |
| GENERATING_REPLY | STT done; calling GPT to generate character reply. |
| STREAMING_TTS | LLM done; streaming ElevenLabs audio back to client. |
| CLOSED | Session terminated. |
CLIENT → SERVER start_session
Initialise the conversation. Send once after connection_established.
JSON payload
{
"type": "start_session",
"character_id": "s1",
"sample_rate": 16000,
"audio_format": "wav_base64_chunks"
}
Fields
| Field | Type | Required | Description |
|---|---|---|---|
| type | string | required | Always "start_session" |
| character_id | string | required | "s1", "s2", or "p1" — see Characters page |
| sample_rate | integer | optional | Sample rate of audio you will send. Default: 16000 Hz |
| audio_format | string | optional | Encoding format. Default: "wav_base64_chunks" |
CLIENT → SERVER audio_chunk
One chunk of the user's audio, base64-encoded. Send in order, starting at chunk_index: 0.
JSON payload
{
"type": "audio_chunk",
"chunk_index": 0,
"audio": "<base64-encoded WAV data>"
}
Fields
| Field | Type | Required | Description |
|---|---|---|---|
| type | string | required | Always "audio_chunk" |
| chunk_index | integer ≥ 0 | required | Zero-based index. Increment by 1 for each chunk. |
| audio | string (base64) | required | Base64-encoded WAV audio bytes. |
CLIENT → SERVER end_of_utterance
Signal that the user has finished speaking. This triggers the full pipeline: final STT → LLM → TTS streaming.
JSON payload
{
"type": "end_of_utterance"
}
CLIENT → SERVER close_session
Gracefully end the session. The server sends a final ack then closes the WebSocket.
JSON payload
{
"type": "close_session"
}
SERVER → CLIENT connection_established
Sent immediately after the WebSocket handshake. Contains the session ID you should log for debugging.
JSON payload
{
"type": "connection_established",
"session_id": "3f4a1b2c-8d9e-4f5a-b6c7-d8e9f0a1b2c3",
"message": "WebSocket connected successfully"
}
Fields
| Field | Type | Description |
|---|---|---|
| type | string | Always "connection_established" |
| session_id | string (UUID) | Unique identifier for this WebSocket session. |
| message | string | Human-readable status. |
SERVER → CLIENT ack
Generic acknowledgement for every client message. The event field tells you which message is being acked. Extra fields depend on the event.
ack for start_session
{
"type": "ack",
"event": "start_session",
"message": "Session started successfully",
"session_id": "3f4a1b2c-...",
"character_id": "s1",
"sample_rate": 16000,
"audio_format": "wav_base64_chunks",
"state": "LISTENING"
}
ack for audio_chunk
{
"type": "ack",
"event": "audio_chunk",
"message": "Audio chunk received",
"chunk_index": 0,
"total_chunks": 1
}
ack for end_of_utterance
{
"type": "ack",
"event": "end_of_utterance",
"message": "Processing started"
}
ack for close_session
{
"type": "ack",
"event": "close_session",
"message": "Session closed successfully"
}
SERVER → CLIENT partial_transcript
Intermediate STT result emitted every 5 audio chunks while the user is still speaking. Use for live captions.
JSON payload
{
"type": "partial_transcript",
"text": "What is hydraulic pressure",
"chunk_index": 4,
"window_size": 5
}
Fields
| Field | Type | Description |
|---|---|---|
| type | string | Always "partial_transcript" |
| text | string | Partial transcription of the rolling audio window. |
| chunk_index | integer | Index of the chunk that triggered this event. |
| window_size | integer (max 5) | Number of chunks in the rolling window. |
SERVER → CLIENT final_transcript
Complete transcription of the user's full utterance, produced after end_of_utterance.
JSON payload
{
"type": "final_transcript",
"text": "What is hydraulic pressure and how is it measured?"
}
Fields
| Field | Type | Description |
|---|---|---|
| type | string | Always "final_transcript" |
| text | string | Full transcription of the user's utterance. |
SERVER → CLIENT reply_text_done
The character's complete text reply from GPT. Emitted before TTS audio starts — show a text bubble immediately without waiting for audio.
JSON payload
{
"type": "reply_text_done",
"text": "Well, hydraulic pressure is the force exerted by a confined fluid...",
"length": 147
}
Fields
| Field | Type | Description |
|---|---|---|
| type | string | Always "reply_text_done" |
| text | string | The character's full reply text. |
| length | integer | Character count of the reply. |
SERVER → CLIENT tts_audio_chunk
One chunk of the character's voice audio, base64-encoded MP3. Buffer and play sequentially.
JSON payload
{
"type": "tts_audio_chunk",
"chunk_index": 0,
"audio": "<base64-encoded MP3 data>"
}
Fields
| Field | Type | Description |
|---|---|---|
| type | string | Always "tts_audio_chunk" |
| chunk_index | integer ≥ 0 | Sequential index; play chunks in this order. |
| audio | string (base64) | Base64-encoded MP3 bytes. Format: MP3 · 44 100 Hz · 128 kbps. |
atob() (browser) or Buffer.from(audio, 'base64') (Node). Concatenate all chunks into one MP3 blob or pipe into a streaming audio element.SERVER → CLIENT tts_done
All TTS audio has been sent. The session returns to LISTENING state — you can start the next turn.
JSON payload
{
"type": "tts_done"
}
SERVER → CLIENT error
Sent when something goes wrong. The session may still be alive after a non-fatal error — check the message and decide whether to retry or reconnect.
JSON payload
{
"type": "error",
"message": "Failed to decode base64 audio",
"chunk_index": 3
}
Fields
| Field | Type | Description |
|---|---|---|
| type | string | Always "error" |
| message | string | Human-readable error description. |
| chunk_index | integer | null | Present when error relates to a specific audio chunk. |
| session_id | string (UUID) | Present when error is session-level. |
Characters
Pass character_id in start_session to select which character you are speaking with.
| ID | Name | Role | Department |
|---|---|---|---|
| s1 | Kareem Ali El-Attar | Student | Irrigation Engineering |
| s2 | Morad Hassan El-Shazly | Student | Mechanical Engineering |
| p1 | Amin Saleh Shawky | Professor | Mechanical Engineering |
Audio format reference
| Direction | Format | Sample rate | Encoding |
|---|---|---|---|
| Client → Server | WAV | 16 000 Hz (default) | Base64 string in JSON |
| Server → Client | MP3 | 44 100 Hz | Base64 string in JSON (128 kbps) |
Encoding audio in the browser
// Record with MediaRecorder, then send each chunk:
const reader = new FileReader();
reader.onload = () => {
const base64 = reader.result.split(',')[1];
ws.send(JSON.stringify({
type: 'audio_chunk',
chunk_index: index++,
audio: base64
}));
};
reader.readAsDataURL(audioBlob);
Decoding TTS audio in the browser
// Collect chunks, then play:
const chunks = [];
ws.onmessage = (event) => {
const msg = JSON.parse(event.data);
if (msg.type === 'tts_audio_chunk') {
const binary = atob(msg.audio);
const bytes = Uint8Array.from(binary, c => c.charCodeAt(0));
chunks.push(bytes.buffer);
}
if (msg.type === 'tts_done') {
const blob = new Blob(chunks, { type: 'audio/mpeg' });
const url = URL.createObjectURL(blob);
new Audio(url).play();
chunks.length = 0;
}
};