Microphone bytes to spoken words.
The voice pipeline is the part of BawtHub with the hardest real-time constraints. Audio in, transcript out, LLM in the middle, audio back. The whole loop has to feel like a conversation — sub-second to first audio, mid-word interruptions, no buffer glitches when a sentence breaks at a comma. The architecture is forked from Kyutai's unmute reference stack and rebuilt around a four-provider TTS registry, a Quest-managed state machine, and the OpenAI Realtime API wire protocol.
01 Heritage: unmute, rebuilt.
The bones of the pipeline come from Kyutai's unmute — their open-source reference for a fully self-hosted speech-to-speech assistant. The state machine (waiting_for_user / user_speaking / bot_speaking), the msgpack wire protocol for the Moshi STT/TTS servers, and the VAD-pause-detection model are all from there. What BawtHub layers on top is:
- A multi-provider TTS registry — Moshi is no longer the only TTS option; Azure, xAI Grok, and Kokoro can serve different voices in the same session.
- An llm-bawt back end instead of unmute's bundled chatbot — meaning real conversation history, real memory, real tools, real personalities.
- A Quest-based supervisor for the STT/TTS/LLM workers — co-managed lifecycles, retries with backoff, structured cancellation.
- Production niceties — health endpoints, admin restart, voice browser, provider auto-discovery from env, container restart through a mounted Docker socket.
02 The pipeline, top to bottom.
input_audio_buffer.appendsend_audio()pause_prediction > 0.6 · flush + EOStts.send(word) per delta · 40-char prebufferresponse.audio.delta · Opus chunks03 The handler is the state machine.
BawtHubHandler in bawthub/handler.py is an AsyncStreamHandler from fastrtc. It runs one instance per connected client and owns three things:
- A
QuestManagerwith three named workers:stt,tts,llm. Each Quest has init / run / close phases and survives partial failures. - A
Chatbotwith a smallchat_historyarray. Conversation state is derived:waiting_for_user/user_speaking/bot_speakingbased on whose turn it is in the history. - An asyncio
output_queuethat fastrtc drains to emit either audio frames, OpenAI Realtime events, or aCloseStream.
The receive loop is the heart of it. Every frame from the browser (~80 ms of audio) gets:
- Forwarded to STT via
stt.send_audio(array) - Checked against the long-silence timer for nudge behavior
- Tested for pause: if
stt.pause_prediction.value > PAUSE_THRESHOLD(default 0.6), the handler emitsInputAudioBufferSpeechStopped, flushes 500 ms of silence into STT, and starts the LLM response - Tested for interrupt: if the bot is speaking and STT-VAD shows the user has resumed (and we're past the 3-second grace window),
interrupt_bot()cancels the running LLM + TTS quests
UNINTERRUPTIBLE_BY_VAD_TIME_SEC = 3. On Mac in particular, browser echo cancellation takes a beat to engage at turn start — so for the first 3 seconds of bot speech, the VAD signal alone can't interrupt. A real word from the STT still can. This was an empirical fix; the comment in handler.py explains it candidly.
04 STT: Kyutai Moshi over msgpack.
The STT runs in its own GPU container — Kyutai's moshi-server with the stt.toml config:
| Field | Value |
|---|---|
| Model | kyutai/stt-1b-en_fr-candle · ~1B params, English + French |
| Audio tokenizer | Mimi (PyTorch 24 kHz) |
| ASR delay | 6 tokens (~0.5 s) |
| Batch size | 2 (helps with reconnect races) |
| Transformer | 16 layers · 2048 d_model · 16 heads · 750-token context |
| Extra heads | 4 heads × 6 dim — pause prediction + speaker turns |
The wire is msgpack over a websocket at /api/asr-streaming. The Python client (bawthub/stt/speech_to_text.py) sends raw float32 audio frames and receives a typed stream of:
Word— a partial transcription token withstart_timeEndWord— explicit end-of-word markerStep— per-step pause-prediction probabilities (the VAD signal)Marker— for time-aligning eventsReady/Error
The handler tracks stt.pause_prediction as an exponential moving average so the threshold check is stable across noisy frames. The "real" pause threshold is configurable via BAWTHUB_PAUSE_THRESHOLD; defaults to 0.6.
05 TTS: a provider registry, not a single engine.
BawtHub's TTS layer evolved from "Moshi only" to a pluggable registry. TTSProviderRegistry in bawthub/tts/registry.py manages a set of BaseTTSProvider instances and dispatches each request to the right one based on voice id.
| Provider | Where it runs | Trigger | Notes |
|---|---|---|---|
moshi | Local GPU container at :8089 | Always — default fallback | 1.6B params · 24 kHz · 18 voices in voices.yaml (VCTK + Expresso) |
azure | Azure Speech cloud | AZURE_SPEECH_KEY + region/endpoint set | Filterable voice catalog (default curated = en-US Dragon HD) to avoid 600+ voice picker |
grok | xAI cloud | XAI_API_KEY set | Default voice eve · latency mode toggle (quality vs ~60 ms savings) |
kokoro | Adapter container | Compose overlay docker-compose.kokoro.yml | Replaces Moshi container; Python adapter translates Kokoro to the Moshi protocol |
Provider selection is voice-driven. The registry inspects each provider's available_voices catalog and routes a request for voice_id to the provider that owns it. BAWTHUB_TTS_DEFAULT_PROVIDER overrides the tiebreaker for voice ids that aren't in any catalog. Auto-discovery is conservative: a provider that fails to construct (e.g. missing SDK, bad credentials) is dropped silently so the app still boots in a stripped-down environment.
Health probing is lazy and cached
Construction does not block on a network probe. The first request that asks "is this provider healthy?" pays the probe cost; subsequent requests within HEALTH_TTL_SEC (5 s default) read the cached result. Concurrent probes share a single in-flight check via an asyncio lock — so a thundering herd at startup doesn't slam Azure with twenty list_voices calls.
06 Word-by-word TTS with a prebuffer.
llm-bawt streams the LLM response word-by-word (one delta per word, no upstream sentence buffering). The handler's _generate_response_task pipes each delta into the TTS provider with one twist:
When llm-bawt dropped upstream buffering (TASK-216), Moshi TTS could drain its audio queue faster than new words arrived, producing a 1–2 s gap after the opening phrase. The handler now collects the first ~40 chars (~6–8 short words) before flushing them to TTS in a burst, then streams word-by-word from there. TTS_PREBUFFER_CHARS = 40 in handler.py.
The TTS provider streams back two message types: TTSAudioChunk (PCM at 24 kHz, mono) and TTSWordTiming (which word just got spoken, with start/stop seconds). The handler routes audio chunks to the output queue as (sample_rate, np.ndarray) tuples, and word timings as ResponseTextDelta events that the browser uses to update the floating transcript. Each word is also appended to chat_history only after the TTS confirms it — so the displayed transcript advances in lockstep with the audio.
07 Mid-word interrupts.
The user can interrupt the bot at any time. Two signals trigger an interrupt:
- STT word arrival while bot is speaking. A confident transcription token from the user during bot output →
interrupt_bot(). - VAD probability dropping past a threshold while bot is speaking, after the 3-second grace window.
interrupt_bot() cancels the LLM stream task, swaps the handler's output_queue to a fresh queue (so any in-flight TTS workers can't accidentally push to it), and sends BawtHubInterruptedByVAD to the browser. The browser stops playback by signalling the AudioWorklet to drop its buffered samples. Audio cuts mid-word — which is what feels conversational.
08 Wire protocol: OpenAI Realtime, deliberately.
The browser ↔ backend WebSocket speaks OpenAI Realtime API events, mapped to BawtHub's own ora module (openai_realtime_api_events.py). The choice is intentional — it keeps the door open to swap the backend for OpenAI's actual Realtime API or any compatible implementation.
| Direction | Event | Use |
|---|---|---|
| client → server | input_audio_buffer.append | Base64 Opus frame from mic encoder worker |
| client → server | session.update | Voice id, bot id, user id, tts_mode toggle, instructions |
| server → client | response.audio.delta | Base64 Opus chunk for the decoder worker |
| server → client | response.text.delta | Word-aligned transcript update for the floating UI |
| server → client | conversation.item.input_audio_transcription.delta | User-side STT word arrival |
| server → client | response.audio.done / response.text.done | End-of-turn handshake |
| server → client | bawthub.response_text_delta_ready | BawtHub-specific: LLM word arrived (vs TTS word emitted) |
| server → client | bawthub.interrupted_by_vad | Browser flushes audio worklet |
| server → client | bawthub.service_animation | Avatar animation cue parsed from LLM response |
09 About wake words.
BawtHub doesn't ship a wake-word detector. The voice page assumes the user has pressed the call button — the WebSocket is opened intentionally, the screen is wake-locked, and the mic is live until the call ends. Pause detection is purely VAD-driven via Moshi STT's extra-head pause_prediction scores.
That said, the downstream consumers of BawtHub's voice (Home Assistant via the bawthub_stt / bawthub_tts custom components in integrations/home_assistant/) are typically paired with ESPHome Voice PE devices, which run their own on-device microWakeWord models. So while wake-word handling isn't inside BawtHub itself, the pipeline is designed to be a downstream of a wake-word gate when used as a Home Assistant voice provider.
10 Health and admin.
The voice backend exposes a handful of REST endpoints distinct from the WebSocket:
| Endpoint | Purpose |
|---|---|
GET /v1/health | {tts_up, stt_up, llm_up, ok} — green/red lights in the UI |
GET /v1/voices | Aggregated catalog from every registered provider |
GET /v1/tts/admin/config | Current TTS service config + capabilities |
PUT /v1/tts/admin/config | Mutate temperature, CFG, padding, default voice — written to tts_admin_state.json |
POST /v1/tts/preview | Synthesize a short sample with the given voice — used by the voice browser modal |
POST /v1/tts/admin/restart | Run restart_tts_container.sh via the mounted Docker socket; report progress with operation ids |
GET /v1/ha/weather | Home Assistant weather pull for the home page widget — kept here so the credential never reaches the frontend |
The TTS admin state lives at bawthub/tts_admin_state.json — a tiny JSON file on disk so changes survive container restarts. The frontend's /tools/tts page is the editor.
11 Key files.
bawthub/main_websocket.py/v1/ws WebSocket route; TTS admin endpoints; Home Assistant weather pass-through.bawthub/handler.pyBawtHubHandler, Quest-managed STT/TTS/LLM, interrupt logic, pause detection, the prebuffer fix, the OpenAI Realtime event emission.bawthub/stt/speech_to_text.pybawthub/tts/registry.pybawthub/tts/base_provider.pyTTSAudioChunk + TTSWordTiming message types; prepare_text_for_tts for normalization; VoiceInfo for catalog entries.bawthub/tts/moshi_provider.pybawthub/tts/azure_provider.py · bawthub/tts/grok_provider.pybawthub/tts/kokoro_adapter_server.pydocker-compose.kokoro.yml overlay.bawthub/quest_manager.pybawthub/openai_realtime_api_events.pybawthub/llm/llm_utils.pyStatelessLLMStream wraps an OpenAI-compat client; rechunk_to_words turns chunked tokens into a word-aligned stream.services/moshi-server/configs/stt.toml · tts.tomlvoices.yamlmain on 2026-05-13
Source: bawthub repo (private)