OpenAI-compatible, but more.
The FastAPI service in src/llm_bawt/service/ speaks the standard OpenAI chat-completions protocol on the wire — POST a messages array, get streaming SSE chunks back. That gets you compatibility with anything that knows how to talk to OpenAI. Underneath, the same service exposes 18 route modules covering bot CRUD, persistent memory inspection, conversation history, prompt templates, the agent task system, turn-log replay, media generation, runtime settings, and a WebSocket bridge to the OpenClaw gateway.
01 Entry point.
The app factory lives in service/api.py. The lifespan handler at module load does five things in order:
- Wire
MCP_SERVER_URLso memory operations are routed via MCP tools (loggable, traceable) rather than direct DB access. - Seed and merge model definitions from
ModelDefinitionStore— DB always wins over the YAML. - Warm the sentence-transformer embedding model on a background thread. The first inference takes ~6 s; warming it off-thread keeps the first chat after a restart fast.
- Start the
BackgroundServiceworker, theJobScheduler(ifLLM_BAWT_SCHEDULER_ENABLED), and the OpenClaw Redis subscriber if configured. - Mount every router from
service/routes/__init__.py:health, ha_weather, nextcloud, models, openclaw_ws, botchat, chat, tasks, turn_logs, jobs, history, memory, prompts, settings, profiles, llm, media
02 The chat-completions surface.
The OpenAI-compatible endpoint is small. routes/chat.py is 214 lines and defines three POSTs:
| Endpoint | Purpose |
|---|---|
POST /v1/chat/completions | The OpenAI chat completion. Supports stream, all standard fields, plus the llm-bawt extensions described below. |
POST /v1/chat/abort | Abort an in-flight turn by turn_id. Routes a chat.abort RPC to the agent bridge if applicable; marks the turn-log row as aborted regardless. |
POST /v1/chat/session/reset | For agent-backend bots: send session.reset to the bridge to clear the SDK-side thread and start fresh on next message. |
The request schema (ChatCompletionRequest in service/schemas.py) accepts the standard OpenAI fields and adds:
| Field | Default | Effect |
|---|---|---|
bot_id | null → service default | Selects which bot personality, memory namespace, tool set, and agent backend to use. |
augment_memory | true | Whether the pipeline injects retrieved memories into the system prompt. |
extract_memory | true | Whether to enqueue fact extraction from this turn after the response completes. |
include_summaries | true | Whether to inject rolled-up session summaries into context. |
tts_mode | false | Append TTS-friendly formatting rules to the system prompt (short sentences, no markdown, etc.). |
user_message_id | null | Frontend-generated UUID used as trigger_message_id in turn logs and tool events. |
animations | null | Avatar animation catalog (name + description) when the bot may invoke an animation tool. Owned by the bawthub frontend, not the backend. |
avatar_visible | null | Gate flag for animation work — don't run avatar classification when no avatar is rendered. |
03 Streaming the response.
Non-streaming requests return a normal ChatCompletionResponse. Streaming requests return SSE: a StreamingResponse wrapping BackgroundService.chat_completion_stream, which is implemented across service/chat_streaming.py and service/chat_stream_worker.py.
Three things happen in parallel during a streaming turn:
- The pipeline runs in a worker thread. It's CPU- and IO-bound (memory search, tool calls, eventual LLM call); offloading keeps the event loop responsive.
- Token deltas flow into an
asyncio.Queue. The streaming generator awaits the queue and formats each chunk as a standard OpenAI{"choices":[{"delta":{"content":"..."}}]}SSE event. - Tool events fan out through a separate channel. Each tool invocation emits a
tool_callevent before execution and atool_resultevent after; the streaming generator interleaves these with the content deltas so frontends can render tool progress in real time.
Beyond the OpenAI-compatible data: {choices:[...]} chunks, the stream emits tool_call, tool_result, turn_metadata, and turn_complete event types. Frontends that only know about OpenAI deltas still get a compliant stream; richer frontends (BawtHub) parse the extras to render tool calls, model badges, and token usage pills.
04 The 18 route modules.
Each file under service/routes/ registers one APIRouter. routes/__init__.py imports them and exposes all_routers for the app factory.
| Module | Lines | Prefix | What it does |
|---|---|---|---|
chat.py | 214 | /v1/chat/* | OpenAI chat completions, abort, session reset. |
botchat.py | 155 | /v1/bots/{id}/chat | Bot-scoped chat with isolated memory; lighter wrapper around the chat surface. |
models.py | 376 | /v1/models, /v1/bots | OpenAI GET /v1/models, upstream provider model list, runtime model switching, DB-resident model definitions CRUD. |
settings.py | 923 | /v1/settings, /v1/bots/*, /v1/admin/* | Runtime settings, bot profile CRUD (PUT/PATCH/DELETE per slug), bot data purge, soul sync/push, orphan cleanup. The biggest router by far. |
history.py | 763 | /v1/history/* | Conversation history search, summarization preview / run / rebuild, summary listing and deletion. |
memory.py | 399 | /v1/memory/* | Memory stats, search, get-by-message, delete, patch, forget/restore, preview windows, regenerate embeddings, consolidate. |
profiles.py | 372 | /v1/profiles/* | User + bot profile attribute CRUD with confidence scoring, by-entity lookup, attribute-level patch. |
prompts.py | 354 | /v1/prompts/* | Versioned prompt templates: list, fetch, PUT/PATCH, version history, validate, preview, reset to default. |
tasks.py | 177 | /v1/tasks/* | Task submission, get-by-id, list. Bridge to the agent task pipeline. |
turn_logs.py | 338 | /v1/turn-logs, /v1/tool-calls | Time-travel debugging: list turns, fetch full turn detail with assembled messages and tool calls, query tool call events. |
jobs.py | 247 | /v1/jobs/* | Background job inspection — list scheduled jobs, list job runs, manual trigger by type. |
media.py | 404 | /v1/media/* | Image/audio generation CRUD; binary content + thumbnail serving from the media_generations store. |
llm.py | 80 | /v1/llm/complete | Raw single-shot completion bypassing the pipeline — for tooling that needs the LLM but not the orchestration. |
nextcloud.py | 88 | /webhook/nextcloud, /admin/nextcloud-talk/* | Inbound Nextcloud Talk webhook; provisioning + reload for the talk-bot integration. |
openclaw_ws.py | 154 | /v1/ws | WebSocket endpoint for the OpenClaw browser/native bridge. Bidirectional event passing for in-browser agent UIs. |
ha_weather.py | 86 | /v1/ha/weather | Home Assistant weather pass-through (used by voice-mode bots that want forecast data without a tool call). |
health.py | 59 | /health, /status, /v1/status | Three health surfaces: liveness, service status (loaded models, defaults), system status (DB, scheduler, bridges). |
__init__.py | 44 | — | Aggregates all_routers for mounting. |
05 Models and bots are CRUDable at runtime.
Two of the heaviest routers (settings.py and models.py) implement live editing of the things that bots.yaml and the model definition YAMLs seed. The DB always wins:
- Bot profiles:
PUT /v1/bots/{slug}/profilefor full replace,PATCHfor partial. System prompt, default model, tool flags, agent-backend config, runtime settings — every field thatbots.yamlseeds is editable here. APOST /v1/admin/reload-botsforces a re-read after manual DB edits. - Model definitions:
PUT /v1/models/definitions/{alias}sets a model alias'stype,model_id,base_url, context window, max tokens, and adapter. APOST /v1/models/definitions/seedre-seeds from YAML. - Prompt templates: Every system-prompt fragment used by extraction, summarization, consolidation, and animation classification is a versioned
PromptTemplateeditable viaroutes/prompts.py. Each PUT creates a new version;GET /v1/prompts/{key}/versionslists them;POST .../resetrolls back to the default.
06 Turn logs are first-class.
Every chat turn writes a row to turn_logs capturing the request, the assembled message list (system prompt + history + retrieved memories + tool results), the final response, the model used, timing, status, and the full set of tool calls with their arguments and results. The streaming pipeline flushes partial response text periodically so a client reconnecting mid-stream can show progress.
routes/turn_logs.py exposes GET /v1/turn-logs (filterable list) and GET /v1/turn-logs/{turn_id} (full reconstruction). The BawtHub UI's debug pane uses these to replay any turn — to see exactly what the LLM saw — which is invaluable when a bot misbehaves.
Because every retrieved memory, every injected summary, and every tool result is captured in the assembled-message snapshot, the turn_logs table can contain anything the bot has access to — including private user facts. The endpoint is unauthenticated by default; production deployments should put it behind the same auth layer as the rest of the API.
07 The OpenClaw lifespan integration.
When OPENCLAW_WS_ENABLED and REDIS_URL are set, the lifespan handler constructs a RedisSubscriber from the openclaw-bridge package and starts two background tasks:
- Stale consumer-group cleanup. Every 5 minutes, destroys idle
ui:*Redis consumer groups so reconnecting browsers don't accumulate orphans. - Tool-event persistence. Subscribes to the bridge's
tool_start/tool_endevents on Redis streams and writes each one totool_call_recordsfor later inspection viaGET /v1/tool-calls.
A bot-id-to-session-key mapping is built at startup by walking every bot with agent_backend: openclaw and harvesting the session_key from its agent_backend_config. This is logged at startup so you can see exactly which sessions the service is listening for.
08 Running it.
The service is started by llm-service (entry point at service/server.py:main). Flags:
| Flag | Effect |
|---|---|
--host / --port | Bind address. Defaults from LLM_BAWT_SERVICE_HOST / _PORT. |
--reload | uvicorn auto-reload for development. Excludes __pycache__, .logs, .run, and models from the watcher to prevent feedback loops. |
--restart | If a service is already listening on the port, SIGTERM it (then SIGKILL) before starting fresh. |
--stop | Kill the running service and exit. Looks up the PID via lsof -ti tcp:<port>. |
--verbose / --debug | Verbose enables payload logging; debug enables raw uvicorn DEBUG output. |
09 Key files.
service/api.pyservice/server.pyapp object so uvicorn llm_bawt.service.server:app resolves correctly.service/background_service.pyBackgroundService. The long-running orchestrator class. Composes ChatStreamingMixin + TurnLifecycleMixin; caches one LLMBawt per bot; manages the worker thread.service/chat_streaming.pyservice/chat_stream_worker.pyservice/turn_lifecycle.pyservice/schemas.pyChatCompletionRequest, response shapes, OpenAI-compatible types plus llm-bawt extensions.service/scheduler.pySCHEDULER_CHECK_INTERVAL_SECONDS; runs extraction, consolidation, decay pruning, recurrence detection.service/routes/__init__.py aggregates them as all_routers.main on 2026-05-13
Source: llm-bawt/src/llm_bawt/service