BawtHub · voice

Microphone bytes to spoken words.

The voice pipeline is the part of BawtHub with the hardest real-time constraints. Audio in, transcript out, LLM in the middle, audio back. The whole loop has to feel like a conversation — sub-second to first audio, mid-word interruptions, no buffer glitches when a sentence breaks at a comma. The architecture is forked from Kyutai's unmute reference stack and rebuilt around a four-provider TTS registry, a Quest-managed state machine, and the OpenAI Realtime API wire protocol.

Runtime: Python 3.12 · FastAPI · fastrtc 0.0.23 STT: Kyutai stt-1b-en_fr-candle (Moshi) Default TTS: Kyutai tts-1.6b-en_fr (Moshi · 24 kHz)

01 Heritage: unmute, rebuilt.

The bones of the pipeline come from Kyutai's unmute — their open-source reference for a fully self-hosted speech-to-speech assistant. The state machine (waiting_for_user / user_speaking / bot_speaking), the msgpack wire protocol for the Moshi STT/TTS servers, and the VAD-pause-detection model are all from there. What BawtHub layers on top is:

A multi-provider TTS registry — Moshi is no longer the only TTS option; Azure, xAI Grok, and Kokoro can serve different voices in the same session.
An llm-bawt back end instead of unmute's bundled chatbot — meaning real conversation history, real memory, real tools, real personalities.
A Quest-based supervisor for the STT/TTS/LLM workers — co-managed lifecycles, retries with backoff, structured cancellation.
Production niceties — health endpoints, admin restart, voice browser, provider auto-discovery from env, container restart through a mounted Docker socket.

02 The pipeline, top to bottom.

A single turn · 24 kHz mono throughout

browsermic → Opus encoder worker → WebSocket frames

↓ input_audio_buffer.append

bawthub backendfastrtc handler · 480-sample frames · 24 kHz float32

↓ msgpack over ws · send_audio()

moshi stt :8090Word + EndWord + Step (pause prediction) messages

↓ when pause_prediction > 0.6 · flush + EOS

llm-bawt :8642OpenAI-compat completions · word-by-word stream

↓ tts.send(word) per delta · 40-char prebuffer

tts providerMoshi · Azure · Grok · Kokoro — chosen by voice id

↓ response.audio.delta · Opus chunks

browserOpus decoder worker → AudioWorklet → speakers + AnalyserNode tap

03 The handler is the state machine.

BawtHubHandler in bawthub/handler.py is an AsyncStreamHandler from fastrtc. It runs one instance per connected client and owns three things:

A QuestManager with three named workers: stt, tts, llm. Each Quest has init / run / close phases and survives partial failures.
A Chatbot with a small chat_history array. Conversation state is derived: waiting_for_user / user_speaking / bot_speaking based on whose turn it is in the history.
An asyncio output_queue that fastrtc drains to emit either audio frames, OpenAI Realtime events, or a CloseStream.

The receive loop is the heart of it. Every frame from the browser (~80 ms of audio) gets:

Forwarded to STT via stt.send_audio(array)
Checked against the long-silence timer for nudge behavior
Tested for pause: if stt.pause_prediction.value > PAUSE_THRESHOLD (default 0.6), the handler emits InputAudioBufferSpeechStopped, flushes 500 ms of silence into STT, and starts the LLM response
Tested for interrupt: if the bot is speaking and STT-VAD shows the user has resumed (and we're past the 3-second grace window), interrupt_bot() cancels the running LLM + TTS quests

✦

The 3-second uninterruptible window matters.

UNINTERRUPTIBLE_BY_VAD_TIME_SEC = 3. On Mac in particular, browser echo cancellation takes a beat to engage at turn start — so for the first 3 seconds of bot speech, the VAD signal alone can't interrupt. A real word from the STT still can. This was an empirical fix; the comment in handler.py explains it candidly.

04 STT: Kyutai Moshi over msgpack.

The STT runs in its own GPU container — Kyutai's moshi-server with the stt.toml config:

Field	Value
Model	`kyutai/stt-1b-en_fr-candle` · ~1B params, English + French
Audio tokenizer	Mimi (PyTorch 24 kHz)
ASR delay	6 tokens (~0.5 s)
Batch size	2 (helps with reconnect races)
Transformer	16 layers · 2048 d_model · 16 heads · 750-token context
Extra heads	4 heads × 6 dim — pause prediction + speaker turns

The wire is msgpack over a websocket at /api/asr-streaming. The Python client (bawthub/stt/speech_to_text.py) sends raw float32 audio frames and receives a typed stream of:

Word — a partial transcription token with start_time
EndWord — explicit end-of-word marker
Step — per-step pause-prediction probabilities (the VAD signal)
Marker — for time-aligning events
Ready / Error

The handler tracks stt.pause_prediction as an exponential moving average so the threshold check is stable across noisy frames. The "real" pause threshold is configurable via BAWTHUB_PAUSE_THRESHOLD; defaults to 0.6.

05 TTS: a provider registry, not a single engine.

BawtHub's TTS layer evolved from "Moshi only" to a pluggable registry. TTSProviderRegistry in bawthub/tts/registry.py manages a set of BaseTTSProvider instances and dispatches each request to the right one based on voice id.

Provider	Where it runs	Trigger	Notes
`moshi`	Local GPU container at :8089	Always — default fallback	1.6B params · 24 kHz · 18 voices in `voices.yaml` (VCTK + Expresso)
`azure`	Azure Speech cloud	`AZURE_SPEECH_KEY` + region/endpoint set	Filterable voice catalog (default `curated` = en-US Dragon HD) to avoid 600+ voice picker
`grok`	xAI cloud	`XAI_API_KEY` set	Default voice `eve` · latency mode toggle (quality vs ~60 ms savings)
`kokoro`	Adapter container	Compose overlay `docker-compose.kokoro.yml`	Replaces Moshi container; Python adapter translates Kokoro to the Moshi protocol

Provider selection is voice-driven. The registry inspects each provider's available_voices catalog and routes a request for voice_id to the provider that owns it. BAWTHUB_TTS_DEFAULT_PROVIDER overrides the tiebreaker for voice ids that aren't in any catalog. Auto-discovery is conservative: a provider that fails to construct (e.g. missing SDK, bad credentials) is dropped silently so the app still boots in a stripped-down environment.

Health probing is lazy and cached

Construction does not block on a network probe. The first request that asks "is this provider healthy?" pays the probe cost; subsequent requests within HEALTH_TTL_SEC (5 s default) read the cached result. Concurrent probes share a single in-flight check via an asyncio lock — so a thundering herd at startup doesn't slam Azure with twenty list_voices calls.

06 Word-by-word TTS with a prebuffer.

llm-bawt streams the LLM response word-by-word (one delta per word, no upstream sentence buffering). The handler's _generate_response_task pipes each delta into the TTS provider with one twist:

⚠

Audio underrun fix: a 40-character prebuffer.

When llm-bawt dropped upstream buffering (TASK-216), Moshi TTS could drain its audio queue faster than new words arrived, producing a 1–2 s gap after the opening phrase. The handler now collects the first ~40 chars (~6–8 short words) before flushing them to TTS in a burst, then streams word-by-word from there. TTS_PREBUFFER_CHARS = 40 in handler.py.

The TTS provider streams back two message types: TTSAudioChunk (PCM at 24 kHz, mono) and TTSWordTiming (which word just got spoken, with start/stop seconds). The handler routes audio chunks to the output queue as (sample_rate, np.ndarray) tuples, and word timings as ResponseTextDelta events that the browser uses to update the floating transcript. Each word is also appended to chat_history only after the TTS confirms it — so the displayed transcript advances in lockstep with the audio.

07 Mid-word interrupts.

The user can interrupt the bot at any time. Two signals trigger an interrupt:

STT word arrival while bot is speaking. A confident transcription token from the user during bot output → interrupt_bot().
VAD probability dropping past a threshold while bot is speaking, after the 3-second grace window.

interrupt_bot() cancels the LLM stream task, swaps the handler's output_queue to a fresh queue (so any in-flight TTS workers can't accidentally push to it), and sends BawtHubInterruptedByVAD to the browser. The browser stops playback by signalling the AudioWorklet to drop its buffered samples. Audio cuts mid-word — which is what feels conversational.

08 Wire protocol: OpenAI Realtime, deliberately.

The browser ↔ backend WebSocket speaks OpenAI Realtime API events, mapped to BawtHub's own ora module (openai_realtime_api_events.py). The choice is intentional — it keeps the door open to swap the backend for OpenAI's actual Realtime API or any compatible implementation.

Direction	Event	Use
client → server	`input_audio_buffer.append`	Base64 Opus frame from mic encoder worker
client → server	`session.update`	Voice id, bot id, user id, tts_mode toggle, instructions
server → client	`response.audio.delta`	Base64 Opus chunk for the decoder worker
server → client	`response.text.delta`	Word-aligned transcript update for the floating UI
server → client	`conversation.item.input_audio_transcription.delta`	User-side STT word arrival
server → client	`response.audio.done` / `response.text.done`	End-of-turn handshake
server → client	`bawthub.response_text_delta_ready`	BawtHub-specific: LLM word arrived (vs TTS word emitted)
server → client	`bawthub.interrupted_by_vad`	Browser flushes audio worklet
server → client	`bawthub.service_animation`	Avatar animation cue parsed from LLM response

09 About wake words.

BawtHub doesn't ship a wake-word detector. The voice page assumes the user has pressed the call button — the WebSocket is opened intentionally, the screen is wake-locked, and the mic is live until the call ends. Pause detection is purely VAD-driven via Moshi STT's extra-head pause_prediction scores.

That said, the downstream consumers of BawtHub's voice (Home Assistant via the bawthub_stt / bawthub_tts custom components in integrations/home_assistant/) are typically paired with ESPHome Voice PE devices, which run their own on-device microWakeWord models. So while wake-word handling isn't inside BawtHub itself, the pipeline is designed to be a downstream of a wake-word gate when used as a Home Assistant voice provider.

10 Health and admin.

The voice backend exposes a handful of REST endpoints distinct from the WebSocket:

Endpoint	Purpose
`GET /v1/health`	`{tts_up, stt_up, llm_up, ok}` — green/red lights in the UI
`GET /v1/voices`	Aggregated catalog from every registered provider
`GET /v1/tts/admin/config`	Current TTS service config + capabilities
`PUT /v1/tts/admin/config`	Mutate temperature, CFG, padding, default voice — written to `tts_admin_state.json`
`POST /v1/tts/preview`	Synthesize a short sample with the given voice — used by the voice browser modal
`POST /v1/tts/admin/restart`	Run `restart_tts_container.sh` via the mounted Docker socket; report progress with operation ids
`GET /v1/ha/weather`	Home Assistant weather pull for the home page widget — kept here so the credential never reaches the frontend

The TTS admin state lives at bawthub/tts_admin_state.json — a tiny JSON file on disk so changes survive container restarts. The frontend's /tools/tts page is the editor.

11 Key files.

bawthub/main_websocket.py

FastAPI app. ~1,500 lines. Lifespan hook initializes the TTS registry; CORS + health endpoints; /v1/ws WebSocket route; TTS admin endpoints; Home Assistant weather pass-through.

bawthub/handler.py

The voice state machine. 885 lines. BawtHubHandler, Quest-managed STT/TTS/LLM, interrupt logic, pause detection, the prebuffer fix, the OpenAI Realtime event emission.

bawthub/stt/speech_to_text.py

Moshi STT client. 217 lines. msgpack protocol, websocket lifecycle, EMA pause prediction, retry on capacity errors.

bawthub/tts/registry.py

The provider dispatcher. Voice-driven routing, lazy probing, TTL-cached health checks, in-flight probe coalescing.

bawthub/tts/base_provider.py

The provider contract. TTSAudioChunk + TTSWordTiming message types; prepare_text_for_tts for normalization; VoiceInfo for catalog entries.

bawthub/tts/moshi_provider.py

Moshi TTS over msgpack. Wraps the Kyutai websocket protocol behind the provider-agnostic interface. Realtime queue for chunk pacing.

bawthub/tts/azure_provider.py · bawthub/tts/grok_provider.py

Cloud TTS providers. Each one self-discovers from env vars; only registers when credentials are present.

bawthub/tts/kokoro_adapter_server.py

Kokoro shim. An adapter that exposes Kokoro behind a Moshi-compatible WebSocket — opted in via the docker-compose.kokoro.yml overlay.

bawthub/quest_manager.py

The supervisor. Quest = (name, init, run, close). Concurrent workers, cancellation safety, await-once semantics.

bawthub/openai_realtime_api_events.py

Wire protocol module. Pydantic discriminated unions for every ClientEvent / ServerEvent that crosses the WebSocket.

bawthub/llm/llm_utils.py

The llm-bawt client. StatelessLLMStream wraps an OpenAI-compat client; rechunk_to_words turns chunked tokens into a word-aligned stream.

services/moshi-server/configs/stt.toml · tts.toml

Moshi model configs. Hugging Face URIs for weights, transformer dimensions, ASR delay, default voice, padding bonus.

voices.yaml

Moshi voice catalog. ~18 voices from VCTK and Expresso datasets. Restart the backend after edits.

PreviousFrontend NextAvatar

Validated against main on 2026-05-13 Source: bawthub repo (private)