IMMERSA Voice Chat API

Real-time voice conversation — Whisper STT · GPT LLM · ElevenLabs TTS

wss://immersa-voice-chat-api.up.railway.app/ws/voice-chat AsyncAPI 2.6

How it works

Connect one persistent WebSocket to wss://immersa-voice-chat-api.up.railway.app/ws/voice-chat. Every message in both directions is a JSON object with a type field that identifies it.

The pipeline is:

1. Connect

Open the WebSocket. The server immediately sends connection_established with your session_id.

2. Start session

Send start_session with a character_id and audio parameters. The server acknowledges and the session enters LISTENING state.

3. Stream audio

Encode each audio chunk as base64 and send audio_chunk messages. The server acknowledges each one and — every 5 chunks — emits a partial_transcript so you can show live captions.

4. End of utterance

When the user stops speaking, send end_of_utterance. The server finalises transcription, generates the character reply with GPT, and streams audio back from ElevenLabs.

5. Receive reply

You get: final_transcriptreply_text_done → multiple tts_audio_chunktts_done. After tts_done the session is back in LISTENING state, ready for the next turn.

6. Close

Send close_session to end gracefully.

All audio is WAV when going client→server and MP3 44 100 Hz 128 kbps when going server→client.

Message flow

Client Server |--- WebSocket connect -------------------------> |<-- connection_established -------------------- | |--- start_session ----------------------------> |<-- ack (event: start_session) --------------- | | | | (repeat for each audio chunk) | |--- audio_chunk (chunk_index: 0) ------------> |<-- ack (event: audio_chunk) ----------------- | |--- audio_chunk (chunk_index: 1) ------------> |<-- ack (event: audio_chunk) ----------------- | | ... | |--- audio_chunk (chunk_index: 4) ------------> |<-- ack (event: audio_chunk) ----------------- | |<-- partial_transcript ----------------------- | ← every 5 chunks | ... | |--- end_of_utterance ------------------------> |<-- ack (event: end_of_utterance) ----------- | |<-- final_transcript ------------------------- | |<-- reply_text_done -------------------------- | |<-- tts_audio_chunk (chunk_index: 0) -------- | |<-- tts_audio_chunk (chunk_index: 1) -------- | | ... | |<-- tts_done --------------------------------- | ← back to LISTENING | | | (next turn: send audio_chunk again) | | ... | |--- close_session ---------------------------> |<-- ack (event: close_session) --------------- |

Session states

The server tracks a state machine per session. State is included in the start_session ack.

CONNECTED LISTENING FINALIZING_TRANSCRIPT GENERATING_REPLY STREAMING_TTS LISTENING
LISTENING CLOSED (via close_session)
StateMeaning
CONNECTEDWebSocket open, waiting for start_session.
LISTENINGSession active, ready to receive audio_chunk messages.
FINALIZING_TRANSCRIPTend_of_utterance received; running final Whisper STT.
GENERATING_REPLYSTT done; calling GPT to generate character reply.
STREAMING_TTSLLM done; streaming ElevenLabs audio back to client.
CLOSEDSession terminated.

CLIENT → SERVER start_session

Initialise the conversation. Send once after connection_established.

JSON payload

{
  "type": "start_session",
  "character_id": "s1",
  "sample_rate": 16000,
  "audio_format": "wav_base64_chunks"
}

Fields

FieldTypeRequiredDescription
typestringrequiredAlways "start_session"
character_idstringrequired"s1", "s2", or "p1" — see Characters page
sample_rateintegeroptionalSample rate of audio you will send. Default: 16000 Hz
audio_formatstringoptionalEncoding format. Default: "wav_base64_chunks"
The server responds with ack (event: start_session) that includes your session_id and current state.

CLIENT → SERVER audio_chunk

One chunk of the user's audio, base64-encoded. Send in order, starting at chunk_index: 0.

JSON payload

{
  "type": "audio_chunk",
  "chunk_index": 0,
  "audio": "<base64-encoded WAV data>"
}

Fields

FieldTypeRequiredDescription
typestringrequiredAlways "audio_chunk"
chunk_indexinteger ≥ 0requiredZero-based index. Increment by 1 for each chunk.
audiostring (base64)requiredBase64-encoded WAV audio bytes.
The server performs rolling STT after every 5 chunks and emits a partial_transcript. Keep chunks small (e.g. 0.5–1 s of audio) for best latency.

CLIENT → SERVER end_of_utterance

Signal that the user has finished speaking. This triggers the full pipeline: final STT → LLM → TTS streaming.

JSON payload

{
  "type": "end_of_utterance"
}

CLIENT → SERVER close_session

Gracefully end the session. The server sends a final ack then closes the WebSocket.

JSON payload

{
  "type": "close_session"
}

SERVER → CLIENT connection_established

Sent immediately after the WebSocket handshake. Contains the session ID you should log for debugging.

JSON payload

{
  "type": "connection_established",
  "session_id": "3f4a1b2c-8d9e-4f5a-b6c7-d8e9f0a1b2c3",
  "message": "WebSocket connected successfully"
}

Fields

FieldTypeDescription
typestringAlways "connection_established"
session_idstring (UUID)Unique identifier for this WebSocket session.
messagestringHuman-readable status.

SERVER → CLIENT ack

Generic acknowledgement for every client message. The event field tells you which message is being acked. Extra fields depend on the event.

ack for start_session

{
  "type": "ack",
  "event": "start_session",
  "message": "Session started successfully",
  "session_id": "3f4a1b2c-...",
  "character_id": "s1",
  "sample_rate": 16000,
  "audio_format": "wav_base64_chunks",
  "state": "LISTENING"
}

ack for audio_chunk

{
  "type": "ack",
  "event": "audio_chunk",
  "message": "Audio chunk received",
  "chunk_index": 0,
  "total_chunks": 1
}

ack for end_of_utterance

{
  "type": "ack",
  "event": "end_of_utterance",
  "message": "Processing started"
}

ack for close_session

{
  "type": "ack",
  "event": "close_session",
  "message": "Session closed successfully"
}

SERVER → CLIENT partial_transcript

Intermediate STT result emitted every 5 audio chunks while the user is still speaking. Use for live captions.

JSON payload

{
  "type": "partial_transcript",
  "text": "What is hydraulic pressure",
  "chunk_index": 4,
  "window_size": 5
}

Fields

FieldTypeDescription
typestringAlways "partial_transcript"
textstringPartial transcription of the rolling audio window.
chunk_indexintegerIndex of the chunk that triggered this event.
window_sizeinteger (max 5)Number of chunks in the rolling window.

SERVER → CLIENT final_transcript

Complete transcription of the user's full utterance, produced after end_of_utterance.

JSON payload

{
  "type": "final_transcript",
  "text": "What is hydraulic pressure and how is it measured?"
}

Fields

FieldTypeDescription
typestringAlways "final_transcript"
textstringFull transcription of the user's utterance.

SERVER → CLIENT reply_text_done

The character's complete text reply from GPT. Emitted before TTS audio starts — show a text bubble immediately without waiting for audio.

JSON payload

{
  "type": "reply_text_done",
  "text": "Well, hydraulic pressure is the force exerted by a confined fluid...",
  "length": 147
}

Fields

FieldTypeDescription
typestringAlways "reply_text_done"
textstringThe character's full reply text.
lengthintegerCharacter count of the reply.

SERVER → CLIENT tts_audio_chunk

One chunk of the character's voice audio, base64-encoded MP3. Buffer and play sequentially.

JSON payload

{
  "type": "tts_audio_chunk",
  "chunk_index": 0,
  "audio": "<base64-encoded MP3 data>"
}

Fields

FieldTypeDescription
typestringAlways "tts_audio_chunk"
chunk_indexinteger ≥ 0Sequential index; play chunks in this order.
audiostring (base64)Base64-encoded MP3 bytes. Format: MP3 · 44 100 Hz · 128 kbps.
Decode with atob() (browser) or Buffer.from(audio, 'base64') (Node). Concatenate all chunks into one MP3 blob or pipe into a streaming audio element.

SERVER → CLIENT tts_done

All TTS audio has been sent. The session returns to LISTENING state — you can start the next turn.

JSON payload

{
  "type": "tts_done"
}

SERVER → CLIENT error

Sent when something goes wrong. The session may still be alive after a non-fatal error — check the message and decide whether to retry or reconnect.

JSON payload

{
  "type": "error",
  "message": "Failed to decode base64 audio",
  "chunk_index": 3
}

Fields

FieldTypeDescription
typestringAlways "error"
messagestringHuman-readable error description.
chunk_indexinteger | nullPresent when error relates to a specific audio chunk.
session_idstring (UUID)Present when error is session-level.

Characters

Pass character_id in start_session to select which character you are speaking with.

IDNameRoleDepartment
s1Kareem Ali El-AttarStudentIrrigation Engineering
s2Morad Hassan El-ShazlyStudentMechanical Engineering
p1Amin Saleh ShawkyProfessorMechanical Engineering

Audio format reference

DirectionFormatSample rateEncoding
Client → ServerWAV16 000 Hz (default)Base64 string in JSON
Server → ClientMP344 100 HzBase64 string in JSON (128 kbps)

Encoding audio in the browser

// Record with MediaRecorder, then send each chunk:
const reader = new FileReader();
reader.onload = () => {
  const base64 = reader.result.split(',')[1];
  ws.send(JSON.stringify({
    type: 'audio_chunk',
    chunk_index: index++,
    audio: base64
  }));
};
reader.readAsDataURL(audioBlob);

Decoding TTS audio in the browser

// Collect chunks, then play:
const chunks = [];
ws.onmessage = (event) => {
  const msg = JSON.parse(event.data);
  if (msg.type === 'tts_audio_chunk') {
    const binary = atob(msg.audio);
    const bytes  = Uint8Array.from(binary, c => c.charCodeAt(0));
    chunks.push(bytes.buffer);
  }
  if (msg.type === 'tts_done') {
    const blob = new Blob(chunks, { type: 'audio/mpeg' });
    const url  = URL.createObjectURL(blob);
    new Audio(url).play();
    chunks.length = 0;
  }
};