BawtHub
⌕ Search ⌘K Source ↗ Open app →
BawtHub · voice

Microphone bytes to spoken words.

The voice pipeline is the part of BawtHub with the hardest real-time constraints. Audio in, transcript out, LLM in the middle, audio back. The whole loop has to feel like a conversation — sub-second to first audio, mid-word interruptions, no buffer glitches when a sentence breaks at a comma. The architecture is forked from Kyutai's unmute reference stack and rebuilt around a four-provider TTS registry, a Quest-managed state machine, and the OpenAI Realtime API wire protocol.

Runtime: Python 3.12 · FastAPI · fastrtc 0.0.23 STT: Kyutai stt-1b-en_fr-candle (Moshi) Default TTS: Kyutai tts-1.6b-en_fr (Moshi · 24 kHz)

01 Heritage: unmute, rebuilt.

The bones of the pipeline come from Kyutai's unmute — their open-source reference for a fully self-hosted speech-to-speech assistant. The state machine (waiting_for_user / user_speaking / bot_speaking), the msgpack wire protocol for the Moshi STT/TTS servers, and the VAD-pause-detection model are all from there. What BawtHub layers on top is:

02 The pipeline, top to bottom.

A single turn · 24 kHz mono throughout
browsermic → Opus encoder worker → WebSocket frames
input_audio_buffer.append
bawthub backendfastrtc handler · 480-sample frames · 24 kHz float32
↓ msgpack over ws · send_audio()
moshi stt :8090Word + EndWord + Step (pause prediction) messages
↓ when pause_prediction > 0.6 · flush + EOS
llm-bawt :8642OpenAI-compat completions · word-by-word stream
tts.send(word) per delta · 40-char prebuffer
tts providerMoshi · Azure · Grok · Kokoro — chosen by voice id
response.audio.delta · Opus chunks
browserOpus decoder worker → AudioWorklet → speakers + AnalyserNode tap

03 The handler is the state machine.

BawtHubHandler in bawthub/handler.py is an AsyncStreamHandler from fastrtc. It runs one instance per connected client and owns three things:

The receive loop is the heart of it. Every frame from the browser (~80 ms of audio) gets:

  1. Forwarded to STT via stt.send_audio(array)
  2. Checked against the long-silence timer for nudge behavior
  3. Tested for pause: if stt.pause_prediction.value > PAUSE_THRESHOLD (default 0.6), the handler emits InputAudioBufferSpeechStopped, flushes 500 ms of silence into STT, and starts the LLM response
  4. Tested for interrupt: if the bot is speaking and STT-VAD shows the user has resumed (and we're past the 3-second grace window), interrupt_bot() cancels the running LLM + TTS quests
The 3-second uninterruptible window matters.

UNINTERRUPTIBLE_BY_VAD_TIME_SEC = 3. On Mac in particular, browser echo cancellation takes a beat to engage at turn start — so for the first 3 seconds of bot speech, the VAD signal alone can't interrupt. A real word from the STT still can. This was an empirical fix; the comment in handler.py explains it candidly.

04 STT: Kyutai Moshi over msgpack.

The STT runs in its own GPU container — Kyutai's moshi-server with the stt.toml config:

FieldValue
Modelkyutai/stt-1b-en_fr-candle · ~1B params, English + French
Audio tokenizerMimi (PyTorch 24 kHz)
ASR delay6 tokens (~0.5 s)
Batch size2 (helps with reconnect races)
Transformer16 layers · 2048 d_model · 16 heads · 750-token context
Extra heads4 heads × 6 dim — pause prediction + speaker turns

The wire is msgpack over a websocket at /api/asr-streaming. The Python client (bawthub/stt/speech_to_text.py) sends raw float32 audio frames and receives a typed stream of:

The handler tracks stt.pause_prediction as an exponential moving average so the threshold check is stable across noisy frames. The "real" pause threshold is configurable via BAWTHUB_PAUSE_THRESHOLD; defaults to 0.6.

05 TTS: a provider registry, not a single engine.

BawtHub's TTS layer evolved from "Moshi only" to a pluggable registry. TTSProviderRegistry in bawthub/tts/registry.py manages a set of BaseTTSProvider instances and dispatches each request to the right one based on voice id.

ProviderWhere it runsTriggerNotes
moshiLocal GPU container at :8089Always — default fallback1.6B params · 24 kHz · 18 voices in voices.yaml (VCTK + Expresso)
azureAzure Speech cloudAZURE_SPEECH_KEY + region/endpoint setFilterable voice catalog (default curated = en-US Dragon HD) to avoid 600+ voice picker
grokxAI cloudXAI_API_KEY setDefault voice eve · latency mode toggle (quality vs ~60 ms savings)
kokoroAdapter containerCompose overlay docker-compose.kokoro.ymlReplaces Moshi container; Python adapter translates Kokoro to the Moshi protocol

Provider selection is voice-driven. The registry inspects each provider's available_voices catalog and routes a request for voice_id to the provider that owns it. BAWTHUB_TTS_DEFAULT_PROVIDER overrides the tiebreaker for voice ids that aren't in any catalog. Auto-discovery is conservative: a provider that fails to construct (e.g. missing SDK, bad credentials) is dropped silently so the app still boots in a stripped-down environment.

Health probing is lazy and cached

Construction does not block on a network probe. The first request that asks "is this provider healthy?" pays the probe cost; subsequent requests within HEALTH_TTL_SEC (5 s default) read the cached result. Concurrent probes share a single in-flight check via an asyncio lock — so a thundering herd at startup doesn't slam Azure with twenty list_voices calls.

06 Word-by-word TTS with a prebuffer.

llm-bawt streams the LLM response word-by-word (one delta per word, no upstream sentence buffering). The handler's _generate_response_task pipes each delta into the TTS provider with one twist:

Audio underrun fix: a 40-character prebuffer.

When llm-bawt dropped upstream buffering (TASK-216), Moshi TTS could drain its audio queue faster than new words arrived, producing a 1–2 s gap after the opening phrase. The handler now collects the first ~40 chars (~6–8 short words) before flushing them to TTS in a burst, then streams word-by-word from there. TTS_PREBUFFER_CHARS = 40 in handler.py.

The TTS provider streams back two message types: TTSAudioChunk (PCM at 24 kHz, mono) and TTSWordTiming (which word just got spoken, with start/stop seconds). The handler routes audio chunks to the output queue as (sample_rate, np.ndarray) tuples, and word timings as ResponseTextDelta events that the browser uses to update the floating transcript. Each word is also appended to chat_history only after the TTS confirms it — so the displayed transcript advances in lockstep with the audio.

07 Mid-word interrupts.

The user can interrupt the bot at any time. Two signals trigger an interrupt:

  1. STT word arrival while bot is speaking. A confident transcription token from the user during bot output → interrupt_bot().
  2. VAD probability dropping past a threshold while bot is speaking, after the 3-second grace window.

interrupt_bot() cancels the LLM stream task, swaps the handler's output_queue to a fresh queue (so any in-flight TTS workers can't accidentally push to it), and sends BawtHubInterruptedByVAD to the browser. The browser stops playback by signalling the AudioWorklet to drop its buffered samples. Audio cuts mid-word — which is what feels conversational.

08 Wire protocol: OpenAI Realtime, deliberately.

The browser ↔ backend WebSocket speaks OpenAI Realtime API events, mapped to BawtHub's own ora module (openai_realtime_api_events.py). The choice is intentional — it keeps the door open to swap the backend for OpenAI's actual Realtime API or any compatible implementation.

DirectionEventUse
client → serverinput_audio_buffer.appendBase64 Opus frame from mic encoder worker
client → serversession.updateVoice id, bot id, user id, tts_mode toggle, instructions
server → clientresponse.audio.deltaBase64 Opus chunk for the decoder worker
server → clientresponse.text.deltaWord-aligned transcript update for the floating UI
server → clientconversation.item.input_audio_transcription.deltaUser-side STT word arrival
server → clientresponse.audio.done / response.text.doneEnd-of-turn handshake
server → clientbawthub.response_text_delta_readyBawtHub-specific: LLM word arrived (vs TTS word emitted)
server → clientbawthub.interrupted_by_vadBrowser flushes audio worklet
server → clientbawthub.service_animationAvatar animation cue parsed from LLM response

09 About wake words.

BawtHub doesn't ship a wake-word detector. The voice page assumes the user has pressed the call button — the WebSocket is opened intentionally, the screen is wake-locked, and the mic is live until the call ends. Pause detection is purely VAD-driven via Moshi STT's extra-head pause_prediction scores.

That said, the downstream consumers of BawtHub's voice (Home Assistant via the bawthub_stt / bawthub_tts custom components in integrations/home_assistant/) are typically paired with ESPHome Voice PE devices, which run their own on-device microWakeWord models. So while wake-word handling isn't inside BawtHub itself, the pipeline is designed to be a downstream of a wake-word gate when used as a Home Assistant voice provider.

10 Health and admin.

The voice backend exposes a handful of REST endpoints distinct from the WebSocket:

EndpointPurpose
GET /v1/health{tts_up, stt_up, llm_up, ok} — green/red lights in the UI
GET /v1/voicesAggregated catalog from every registered provider
GET /v1/tts/admin/configCurrent TTS service config + capabilities
PUT /v1/tts/admin/configMutate temperature, CFG, padding, default voice — written to tts_admin_state.json
POST /v1/tts/previewSynthesize a short sample with the given voice — used by the voice browser modal
POST /v1/tts/admin/restartRun restart_tts_container.sh via the mounted Docker socket; report progress with operation ids
GET /v1/ha/weatherHome Assistant weather pull for the home page widget — kept here so the credential never reaches the frontend

The TTS admin state lives at bawthub/tts_admin_state.json — a tiny JSON file on disk so changes survive container restarts. The frontend's /tools/tts page is the editor.

11 Key files.

bawthub/main_websocket.py
FastAPI app. ~1,500 lines. Lifespan hook initializes the TTS registry; CORS + health endpoints; /v1/ws WebSocket route; TTS admin endpoints; Home Assistant weather pass-through.
bawthub/handler.py
The voice state machine. 885 lines. BawtHubHandler, Quest-managed STT/TTS/LLM, interrupt logic, pause detection, the prebuffer fix, the OpenAI Realtime event emission.
bawthub/stt/speech_to_text.py
Moshi STT client. 217 lines. msgpack protocol, websocket lifecycle, EMA pause prediction, retry on capacity errors.
bawthub/tts/registry.py
The provider dispatcher. Voice-driven routing, lazy probing, TTL-cached health checks, in-flight probe coalescing.
bawthub/tts/base_provider.py
The provider contract. TTSAudioChunk + TTSWordTiming message types; prepare_text_for_tts for normalization; VoiceInfo for catalog entries.
bawthub/tts/moshi_provider.py
Moshi TTS over msgpack. Wraps the Kyutai websocket protocol behind the provider-agnostic interface. Realtime queue for chunk pacing.
bawthub/tts/azure_provider.py · bawthub/tts/grok_provider.py
Cloud TTS providers. Each one self-discovers from env vars; only registers when credentials are present.
bawthub/tts/kokoro_adapter_server.py
Kokoro shim. An adapter that exposes Kokoro behind a Moshi-compatible WebSocket — opted in via the docker-compose.kokoro.yml overlay.
bawthub/quest_manager.py
The supervisor. Quest = (name, init, run, close). Concurrent workers, cancellation safety, await-once semantics.
bawthub/openai_realtime_api_events.py
Wire protocol module. Pydantic discriminated unions for every ClientEvent / ServerEvent that crosses the WebSocket.
bawthub/llm/llm_utils.py
The llm-bawt client. StatelessLLMStream wraps an OpenAI-compat client; rechunk_to_words turns chunked tokens into a word-aligned stream.
services/moshi-server/configs/stt.toml · tts.toml
Moshi model configs. Hugging Face URIs for weights, transformer dimensions, ASR delay, default voice, padding bonus.
voices.yaml
Moshi voice catalog. ~18 voices from VCTK and Expresso datasets. Restart the backend after edits.
Validated against main on 2026-05-13 Source: bawthub repo (private)