One backend, every model.
llm-bawt is the brain. It speaks the OpenAI chat-completions protocol on the wire so any frontend can use it, but inside it's a layered system: a seven-stage request pipeline, a pluggable model-client abstraction that fronts OpenAI, xAI, Anthropic, llama.cpp, vLLM and external agent bridges, a 60-tool MCP server, and the five-layer memory described separately. This page is the map.
01 What it is.
llm-bawt is a self-hosted Python package — the llm_bawt module under src/ — that runs as a FastAPI service exposing an OpenAI-compatible chat API. Externally it looks like the OpenAI /v1/chat/completions endpoint: a frontend POSTs a messages array and gets back streaming SSE chunks. Internally, every request is decorated with persistent memory, tool calls, profile attributes, and (optionally) routed to a remote agent over a Redis-mediated bridge instead of a hosted LLM.
The package is monolithic but the surfaces are clean. The cli/ module is a Rich-based terminal client that talks to the service or runs a model in-process. The service/ module is the FastAPI app that owns the HTTP surface. The mcp_server/ module is a separate FastMCP server that exposes the same memory and task subsystems to external MCP clients. The core/ module is what binds them together — the request pipeline, the system-prompt builder, the model lifecycle.
02 The subsystem map.
Every page in this cluster documents one of these subdirectories under src/llm_bawt/:
service/ — the OpenAI-compatible HTTP layer. 18 route modules, streaming SSE, bot CRUD, memory and history endpoints, turn-log inspection, scheduler triggers, OpenClaw WS bridge.core/ — seven discrete stages from PRE_PROCESS to POST_PROCESS, a composable PromptBuilder, and a singleton ModelLifecycleManager for swapping primary models at runtime.clients/ — one abstract LLMClient base class with concrete implementations for OpenAI, xAI Grok (Responses API), llama.cpp, vLLM, and the agent-backend wrapper that delegates to external SDKs.memory/ — five layers: raw messages, distilled semantic memories with pgvector, rolling summaries, profile attributes, sessions. Decay, supersede chains, and local-only consolidation.tools/ — the multi-turn tool loop, native OpenAI function-calling, ReAct fallback parser, and a streaming variant that emits tool events as SSE.agent_backends/ — pluggable bridges for Claude Code, OpenAI Codex, and the OpenClaw gateway. Each appears to the rest of llm-bawt as a model alias.mcp_server/ — 60 tools over streamable-http (port 8001) covering memory, messages, sessions, profiles, bot-to-bot messaging, and the full agent task pipeline.adapters/ — per-model chat templates and output sanitizers. Pygmalion, Dolphin, default. Auto-detected from model alias or repo id.03 Request lifecycle (high level).
A chat completion request lands on FastAPI's POST /v1/chat/completions in service/routes/chat.py. The route delegates to BackgroundService.chat_completion_stream, which builds an LLMBawt instance scoped to the requested bot_id and model alias, attaches a history manager and memory client, and runs the request through RequestPipeline in core/pipeline.py.
POST /v1/chat/completions → routes/chat.py → BackgroundServiceLLMClient subclass — OpenAI / Grok / vLLM / llama.cpp / AgentBackendClient — streams text chunksToolLoop dispatches via ToolExecutor (native or ReAct), feeds result backThe pipeline executes synchronously inside an asyncio task. Streaming output is fed through an asyncio.Queue back to the SSE generator in service/chat_streaming.py, which formats each chunk as an OpenAI-compatible data: {...} event. Tool calls, tool results, turn metadata, and final usage are emitted as additional SSE event types alongside the standard delta chunks — frontends that ignore the extensions still see a compliant stream.
04 Bots are personalities, not models.
A request is parameterized by two independent things: model (which client to use) and bot_id (which personality, memory namespace, tool set, and system prompt to apply). Bots are defined in bots.yaml as seeds and then persisted to Postgres so they can be edited at runtime via PUT/PATCH /v1/bots/{slug}/profile. Each bot owns its own memory tables (see the memory page) and declares flags for what subsystems it touches:
| Flag | Effect |
|---|---|
requires_memory | Enables the memory client; without it the bot is stateless |
uses_tools | Enables the tool loop; otherwise LLMClient.query is called directly |
uses_search | Enables the search-provider tool family (Brave / Tavily / DDGS / Reddit) |
voice_optimized | Hints to the prompt builder to bias toward TTS-friendly output |
agent_backend | Replaces native model dispatch with a bridge — claude-code, codex, openclaw |
Three seed bots ship in bots.yaml: nova (full-featured, memory + tools + search), spark (no database, stateless), and mira (voice-optimized, conversational, high memory budget). The bots.yaml header explicitly warns that the DB is authoritative — the YAML is a seed and a fallback, not a source of truth.
05 Two protocols, one service.
The service process opens two HTTP listeners:
The MCP server runs in-process inside the same uvicorn worker (see service/api.py lifespan and service/background_service._ensure_mcp_server) but is reachable as if it were a separate service. This is deliberate: when a bot invokes a tool internally, the tool loop dispatches via ToolExecutor directly against the Python functions; when an external agent invokes the same tool, it goes over JSON-RPC. Same code path, different transport.
06 External agents are first-class bots.
Claude Code and OpenAI Codex are not separate products — they're agent backends registered under llm_bawt.agent_backends. A bot with agent_backend: claude-code appears in /v1/models as a model named claude-code. When a request targets it, the pipeline still runs (history, memory injection, system prompt, post-processing) but the EXECUTE stage calls AgentBackendClient.query instead of an HTTP API. That client sends a Redis command to the corresponding bridge process (claude-code-bridge, codex-bridge, or openclaw-bridge), which proxies the prompt to the agent SDK and streams events back over a Redis stream.
Claude Code running through the bridge gets the same persistent memory, summary injection, and post-turn fact extraction that a Grok-powered nova turn gets — because the agent backend is just another LLMClient. The agent's own tool calls (Bash, Read, Write, etc.) get captured as turn-log tool events from the bridge's event stream.
07 Where state lives.
All persistent state is in PostgreSQL with the pgvector extension. The service expects an externally-managed Postgres instance; the docker-compose stack does not bundle one. SQLAlchemy and SQLModel handle ORM. Tables are created on first run.
- Per-bot tables:
{bot}_messages,{bot}_memories,{bot}_summaries,{bot}_forgotten. Sanitized bot slugs become namespace prefixes. - Shared tables:
profile_attributes,sessions,turn_logs,tool_call_records,model_definitions,prompt_templates,runtime_settings,media_generations,scheduled_jobs,scheduled_job_runs. - External services: Redis (for agent bridges and the SSE turn-log fan-out), Crawl4AI (for the
web_fetchtool), Home Assistant MCP (for thehometool).
08 Top-level layout.
main.pyllm CLI entry point. Wraps cli/main.py. Argument parsing, model resolution, interactive mode loop. Talks to the service or runs in-process.service/api.pyservice/background_service.pyChatStreamingMixin + TurnLifecycleMixin.core/base.pyBaseLLMBawt. 877 lines. Shared CLI + service logic: bot resolution, memory client init, history manager, system prompt assembly, pipeline invocation.core/pipeline.pyRequestPipeline. 605 lines. Seven stages, per-stage hooks, decision-point overrides, timing instrumentation.mcp_server/server.pytask_tools.py.bots.pyBotManager. YAML seed → DB persistence; per-bot system prompt, tool flags, agent backend config; runtime CRUD via the API.bots.yamlruntime_settings.pyRuntimeSettingsResolver for per-bot setting overlays; ModelDefinitionStore for DB-resident model aliases that override the YAML.model_manager.pydefined_models.yaml, merges DB overrides, exposes aliases like grok-4-fast → {type: grok, model_id: grok-4-fast}.09 Suggested reading order.
If you're trying to understand the codebase end-to-end, this cluster is best read in roughly the order of a request: API (where requests land) → pipeline (the seven stages) → clients (model dispatch) → memory (what gets injected and stored) → tools (in-turn dispatch) → agent backends (the bridge pattern) → MCP server (the external surface) → adapters (output cleanup, model quirks).
main on 2026-05-13
Source: llm-bawt/src/llm_bawt