llm-bawt · api

OpenAI-compatible, but more.

The FastAPI service in src/llm_bawt/service/ speaks the standard OpenAI chat-completions protocol on the wire — POST a messages array, get streaming SSE chunks back. That gets you compatibility with anything that knows how to talk to OpenAI. Underneath, the same service exposes 18 route modules covering bot CRUD, persistent memory inspection, conversation history, prompt templates, the agent task system, turn-log replay, media generation, runtime settings, and a WebSocket bridge to the OpenClaw gateway.

Framework: FastAPI + uvicorn Port: 8642 Streaming: SSE (text/event-stream) Route modules: 18 (~5,200 lines)

01 Entry point.

The app factory lives in service/api.py. The lifespan handler at module load does five things in order:

Wire MCP_SERVER_URL so memory operations are routed via MCP tools (loggable, traceable) rather than direct DB access.
Seed and merge model definitions from ModelDefinitionStore — DB always wins over the YAML.
Warm the sentence-transformer embedding model on a background thread. The first inference takes ~6 s; warming it off-thread keeps the first chat after a restart fast.
Start the BackgroundService worker, the JobScheduler (if LLM_BAWT_SCHEDULER_ENABLED), and the OpenClaw Redis subscriber if configured.
Mount every router from service/routes/__init__.py:
health, ha_weather, nextcloud, models, openclaw_ws, botchat, chat, tasks, turn_logs, jobs, history, memory, prompts, settings, profiles, llm, media

02 The chat-completions surface.

The OpenAI-compatible endpoint is small. routes/chat.py is 214 lines and defines three POSTs:

Endpoint	Purpose
`POST /v1/chat/completions`	The OpenAI chat completion. Supports `stream`, all standard fields, plus the llm-bawt extensions described below.
`POST /v1/chat/abort`	Abort an in-flight turn by `turn_id`. Routes a `chat.abort` RPC to the agent bridge if applicable; marks the turn-log row as `aborted` regardless.
`POST /v1/chat/session/reset`	For agent-backend bots: send `session.reset` to the bridge to clear the SDK-side thread and start fresh on next message.

The request schema (ChatCompletionRequest in service/schemas.py) accepts the standard OpenAI fields and adds:

Field	Default	Effect
`bot_id`	`null` → service default	Selects which bot personality, memory namespace, tool set, and agent backend to use.
`augment_memory`	`true`	Whether the pipeline injects retrieved memories into the system prompt.
`extract_memory`	`true`	Whether to enqueue fact extraction from this turn after the response completes.
`include_summaries`	`true`	Whether to inject rolled-up session summaries into context.
`tts_mode`	`false`	Append TTS-friendly formatting rules to the system prompt (short sentences, no markdown, etc.).
`user_message_id`	`null`	Frontend-generated UUID used as `trigger_message_id` in turn logs and tool events.
`animations`	`null`	Avatar animation catalog (name + description) when the bot may invoke an animation tool. Owned by the bawthub frontend, not the backend.
`avatar_visible`	`null`	Gate flag for animation work — don't run avatar classification when no avatar is rendered.

03 Streaming the response.

Non-streaming requests return a normal ChatCompletionResponse. Streaming requests return SSE: a StreamingResponse wrapping BackgroundService.chat_completion_stream, which is implemented across service/chat_streaming.py and service/chat_stream_worker.py.

Three things happen in parallel during a streaming turn:

The pipeline runs in a worker thread. It's CPU- and IO-bound (memory search, tool calls, eventual LLM call); offloading keeps the event loop responsive.
Token deltas flow into an asyncio.Queue. The streaming generator awaits the queue and formats each chunk as a standard OpenAI {"choices":[{"delta":{"content":"..."}}]} SSE event.
Tool events fan out through a separate channel. Each tool invocation emits a tool_call event before execution and a tool_result event after; the streaming generator interleaves these with the content deltas so frontends can render tool progress in real time.

✦

Custom SSE events on top of the OpenAI stream.

Beyond the OpenAI-compatible data: {choices:[...]} chunks, the stream emits tool_call, tool_result, turn_metadata, and turn_complete event types. Frontends that only know about OpenAI deltas still get a compliant stream; richer frontends (BawtHub) parse the extras to render tool calls, model badges, and token usage pills.

04 The 18 route modules.

Each file under service/routes/ registers one APIRouter. routes/__init__.py imports them and exposes all_routers for the app factory.

Module	Lines	Prefix	What it does
`chat.py`	214	`/v1/chat/*`	OpenAI chat completions, abort, session reset.
`botchat.py`	155	`/v1/bots/{id}/chat`	Bot-scoped chat with isolated memory; lighter wrapper around the chat surface.
`models.py`	376	`/v1/models`, `/v1/bots`	OpenAI `GET /v1/models`, upstream provider model list, runtime model switching, DB-resident model definitions CRUD.
`settings.py`	923	`/v1/settings`, `/v1/bots/`, `/v1/admin/`	Runtime settings, bot profile CRUD (PUT/PATCH/DELETE per slug), bot data purge, soul sync/push, orphan cleanup. The biggest router by far.
`history.py`	763	`/v1/history/*`	Conversation history search, summarization preview / run / rebuild, summary listing and deletion.
`memory.py`	399	`/v1/memory/*`	Memory stats, search, get-by-message, delete, patch, forget/restore, preview windows, regenerate embeddings, consolidate.
`profiles.py`	372	`/v1/profiles/*`	User + bot profile attribute CRUD with confidence scoring, by-entity lookup, attribute-level patch.
`prompts.py`	354	`/v1/prompts/*`	Versioned prompt templates: list, fetch, PUT/PATCH, version history, validate, preview, reset to default.
`tasks.py`	177	`/v1/tasks/*`	Task submission, get-by-id, list. Bridge to the agent task pipeline.
`turn_logs.py`	338	`/v1/turn-logs`, `/v1/tool-calls`	Time-travel debugging: list turns, fetch full turn detail with assembled messages and tool calls, query tool call events.
`jobs.py`	247	`/v1/jobs/*`	Background job inspection — list scheduled jobs, list job runs, manual trigger by type.
`media.py`	404	`/v1/media/*`	Image/audio generation CRUD; binary content + thumbnail serving from the `media_generations` store.
`llm.py`	80	`/v1/llm/complete`	Raw single-shot completion bypassing the pipeline — for tooling that needs the LLM but not the orchestration.
`nextcloud.py`	88	`/webhook/nextcloud`, `/admin/nextcloud-talk/*`	Inbound Nextcloud Talk webhook; provisioning + reload for the talk-bot integration.
`openclaw_ws.py`	154	`/v1/ws`	WebSocket endpoint for the OpenClaw browser/native bridge. Bidirectional event passing for in-browser agent UIs.
`ha_weather.py`	86	`/v1/ha/weather`	Home Assistant weather pass-through (used by voice-mode bots that want forecast data without a tool call).
`health.py`	59	`/health`, `/status`, `/v1/status`	Three health surfaces: liveness, service status (loaded models, defaults), system status (DB, scheduler, bridges).
`__init__.py`	44	—	Aggregates `all_routers` for mounting.

05 Models and bots are CRUDable at runtime.

Two of the heaviest routers (settings.py and models.py) implement live editing of the things that bots.yaml and the model definition YAMLs seed. The DB always wins:

Bot profiles: PUT /v1/bots/{slug}/profile for full replace, PATCH for partial. System prompt, default model, tool flags, agent-backend config, runtime settings — every field that bots.yaml seeds is editable here. A POST /v1/admin/reload-bots forces a re-read after manual DB edits.
Model definitions: PUT /v1/models/definitions/{alias} sets a model alias's type, model_id, base_url, context window, max tokens, and adapter. A POST /v1/models/definitions/seed re-seeds from YAML.
Prompt templates: Every system-prompt fragment used by extraction, summarization, consolidation, and animation classification is a versioned PromptTemplate editable via routes/prompts.py. Each PUT creates a new version; GET /v1/prompts/{key}/versions lists them; POST .../reset rolls back to the default.

06 Turn logs are first-class.

Every chat turn writes a row to turn_logs capturing the request, the assembled message list (system prompt + history + retrieved memories + tool results), the final response, the model used, timing, status, and the full set of tool calls with their arguments and results. The streaming pipeline flushes partial response text periodically so a client reconnecting mid-stream can show progress.

routes/turn_logs.py exposes GET /v1/turn-logs (filterable list) and GET /v1/turn-logs/{turn_id} (full reconstruction). The BawtHub UI's debug pane uses these to replay any turn — to see exactly what the LLM saw — which is invaluable when a bot misbehaves.

⚠

Turn logs contain prompt assembly output.

Because every retrieved memory, every injected summary, and every tool result is captured in the assembled-message snapshot, the turn_logs table can contain anything the bot has access to — including private user facts. The endpoint is unauthenticated by default; production deployments should put it behind the same auth layer as the rest of the API.

07 The OpenClaw lifespan integration.

When OPENCLAW_WS_ENABLED and REDIS_URL are set, the lifespan handler constructs a RedisSubscriber from the openclaw-bridge package and starts two background tasks:

Stale consumer-group cleanup. Every 5 minutes, destroys idle ui:* Redis consumer groups so reconnecting browsers don't accumulate orphans.
Tool-event persistence. Subscribes to the bridge's tool_start / tool_end events on Redis streams and writes each one to tool_call_records for later inspection via GET /v1/tool-calls.

A bot-id-to-session-key mapping is built at startup by walking every bot with agent_backend: openclaw and harvesting the session_key from its agent_backend_config. This is logged at startup so you can see exactly which sessions the service is listening for.

08 Running it.

The service is started by llm-service (entry point at service/server.py:main). Flags:

Flag	Effect
`--host` / `--port`	Bind address. Defaults from `LLM_BAWT_SERVICE_HOST` / `_PORT`.
`--reload`	uvicorn auto-reload for development. Excludes `__pycache__`, `.logs`, `.run`, and `models` from the watcher to prevent feedback loops.
`--restart`	If a service is already listening on the port, SIGTERM it (then SIGKILL) before starting fresh.
`--stop`	Kill the running service and exit. Looks up the PID via `lsof -ti tcp:<port>`.
`--verbose` / `--debug`	Verbose enables payload logging; debug enables raw uvicorn DEBUG output.

09 Key files.

service/api.py

App factory + lifespan. 438 lines. Mounts all routers, warms embeddings, starts scheduler + Redis subscriber, drains tool events, owns SIGTERM/SIGKILL stop logic.

service/server.py

uvicorn entry point. Re-exports the FastAPI app object so uvicorn llm_bawt.service.server:app resolves correctly.

service/background_service.py

BackgroundService. The long-running orchestrator class. Composes ChatStreamingMixin + TurnLifecycleMixin; caches one LLMBawt per bot; manages the worker thread.

service/chat_streaming.py

SSE generator. Native pipeline streaming + OpenClaw bridge streaming. Periodic partial-text flushes so reconnecting clients see progress.

service/chat_stream_worker.py

Thread → async bridge. The pipeline runs in a worker thread; this module shuttles chunks across the thread boundary into the async SSE generator.

service/turn_lifecycle.py

Turn-log persistence + cancellation. Mixin handling DB writes for every turn, the cancel/done event pair for in-flight aborts, and SSE event buffering.

service/schemas.py

All Pydantic models. ChatCompletionRequest, response shapes, OpenAI-compatible types plus llm-bawt extensions.

service/scheduler.py

Background job runner. Polls scheduled jobs at SCHEDULER_CHECK_INTERVAL_SECONDS; runs extraction, consolidation, decay pruning, recurrence detection.

service/routes/

18 route modules. Listed in detail above. __init__.py aggregates them as all_routers.

PreviousOverview NextRequest pipeline

Validated against main on 2026-05-13 Source: llm-bawt/src/llm_bawt/service