llm-bawt · overview

One backend, every model.

llm-bawt is the brain. It speaks the OpenAI chat-completions protocol on the wire so any frontend can use it, but inside it's a layered system: a seven-stage request pipeline, a pluggable model-client abstraction that fronts OpenAI, xAI, Anthropic, llama.cpp, vLLM and external agent bridges, a 60-tool MCP server, and the five-layer memory described separately. This page is the map.

Language: Python 3.12+ HTTP: FastAPI on port 8642 MCP: FastMCP on port 8001 Storage: PostgreSQL + pgvector

01 What it is.

llm-bawt is a self-hosted Python package — the llm_bawt module under src/ — that runs as a FastAPI service exposing an OpenAI-compatible chat API. Externally it looks like the OpenAI /v1/chat/completions endpoint: a frontend POSTs a messages array and gets back streaming SSE chunks. Internally, every request is decorated with persistent memory, tool calls, profile attributes, and (optionally) routed to a remote agent over a Redis-mediated bridge instead of a hosted LLM.

The package is monolithic but the surfaces are clean. The cli/ module is a Rich-based terminal client that talks to the service or runs a model in-process. The service/ module is the FastAPI app that owns the HTTP surface. The mcp_server/ module is a separate FastMCP server that exposes the same memory and task subsystems to external MCP clients. The core/ module is what binds them together — the request pipeline, the system-prompt builder, the model lifecycle.

02 The subsystem map.

Every page in this cluster documents one of these subdirectories under src/llm_bawt/:

FastAPI service

service/ — the OpenAI-compatible HTTP layer. 18 route modules, streaming SSE, bot CRUD, memory and history endpoints, turn-log inspection, scheduler triggers, OpenClaw WS bridge.

Request pipeline

core/ — seven discrete stages from PRE_PROCESS to POST_PROCESS, a composable PromptBuilder, and a singleton ModelLifecycleManager for swapping primary models at runtime.

Model clients

clients/ — one abstract LLMClient base class with concrete implementations for OpenAI, xAI Grok (Responses API), llama.cpp, vLLM, and the agent-backend wrapper that delegates to external SDKs.

Memory system

memory/ — five layers: raw messages, distilled semantic memories with pgvector, rolling summaries, profile attributes, sessions. Decay, supersede chains, and local-only consolidation.

Tool execution

tools/ — the multi-turn tool loop, native OpenAI function-calling, ReAct fallback parser, and a streaming variant that emits tool events as SSE.

Agent backends

agent_backends/ — pluggable bridges for Claude Code, OpenAI Codex, and the OpenClaw gateway. Each appears to the rest of llm-bawt as a model alias.

MCP server

mcp_server/ — 60 tools over streamable-http (port 8001) covering memory, messages, sessions, profiles, bot-to-bot messaging, and the full agent task pipeline.

Model adapters

adapters/ — per-model chat templates and output sanitizers. Pygmalion, Dolphin, default. Auto-detected from model alias or repo id.

03 Request lifecycle (high level).

A chat completion request lands on FastAPI's POST /v1/chat/completions in service/routes/chat.py. The route delegates to BackgroundService.chat_completion_stream, which builds an LLMBawt instance scoped to the requested bot_id and model alias, attaches a history manager and memory client, and runs the request through RequestPipeline in core/pipeline.py.

Request flow · single chat turn

HTTP

POST /v1/chat/completions → routes/chat.py → BackgroundService

Pipeline

pre-process → context build → memory retrieval → history filter → message assembly → execute → post-process

Client

LLMClient subclass — OpenAI / Grok / vLLM / llama.cpp / AgentBackendClient — streams text chunks

Tools

if model calls a tool, ToolLoop dispatches via ToolExecutor (native or ReAct), feeds result back

Post

history persistence, tool-result snapshot, turn-log write, optional scheduler-driven fact extraction

The pipeline executes synchronously inside an asyncio task. Streaming output is fed through an asyncio.Queue back to the SSE generator in service/chat_streaming.py, which formats each chunk as an OpenAI-compatible data: {...} event. Tool calls, tool results, turn metadata, and final usage are emitted as additional SSE event types alongside the standard delta chunks — frontends that ignore the extensions still see a compliant stream.

04 Bots are personalities, not models.

A request is parameterized by two independent things: model (which client to use) and bot_id (which personality, memory namespace, tool set, and system prompt to apply). Bots are defined in bots.yaml as seeds and then persisted to Postgres so they can be edited at runtime via PUT/PATCH /v1/bots/{slug}/profile. Each bot owns its own memory tables (see the memory page) and declares flags for what subsystems it touches:

Flag	Effect
`requires_memory`	Enables the memory client; without it the bot is stateless
`uses_tools`	Enables the tool loop; otherwise `LLMClient.query` is called directly
`uses_search`	Enables the search-provider tool family (Brave / Tavily / DDGS / Reddit)
`voice_optimized`	Hints to the prompt builder to bias toward TTS-friendly output
`agent_backend`	Replaces native model dispatch with a bridge — `claude-code`, `codex`, `openclaw`

Three seed bots ship in bots.yaml: nova (full-featured, memory + tools + search), spark (no database, stateless), and mira (voice-optimized, conversational, high memory budget). The bots.yaml header explicitly warns that the DB is authoritative — the YAML is a seed and a fallback, not a source of truth.

05 Two protocols, one service.

The service process opens two HTTP listeners:

Listeners · same process, different protocols

Port 8642 · FastAPI / OpenAI-compatible

Chat completions, model list, bot CRUD, memory CRUD, history, prompts, turn logs, scheduler triggers, media generation. Consumed by frontends (BawtHub, unmute, custom UIs).

Port 8001 · FastMCP / streamable-http

60 tools for memory, messages, sessions, profile, inter-bot messaging, and the full agent task pipeline. Consumed by external agents (VS Code MCP, Claude Desktop, agent bridges).

The MCP server runs in-process inside the same uvicorn worker (see service/api.py lifespan and service/background_service._ensure_mcp_server) but is reachable as if it were a separate service. This is deliberate: when a bot invokes a tool internally, the tool loop dispatches via ToolExecutor directly against the Python functions; when an external agent invokes the same tool, it goes over JSON-RPC. Same code path, different transport.

06 External agents are first-class bots.

Claude Code and OpenAI Codex are not separate products — they're agent backends registered under llm_bawt.agent_backends. A bot with agent_backend: claude-code appears in /v1/models as a model named claude-code. When a request targets it, the pipeline still runs (history, memory injection, system prompt, post-processing) but the EXECUTE stage calls AgentBackendClient.query instead of an HTTP API. That client sends a Redis command to the corresponding bridge process (claude-code-bridge, codex-bridge, or openclaw-bridge), which proxies the prompt to the agent SDK and streams events back over a Redis stream.

✦

The bridge pattern means every memory subsystem applies to remote agents.

Claude Code running through the bridge gets the same persistent memory, summary injection, and post-turn fact extraction that a Grok-powered nova turn gets — because the agent backend is just another LLMClient. The agent's own tool calls (Bash, Read, Write, etc.) get captured as turn-log tool events from the bridge's event stream.

07 Where state lives.

All persistent state is in PostgreSQL with the pgvector extension. The service expects an externally-managed Postgres instance; the docker-compose stack does not bundle one. SQLAlchemy and SQLModel handle ORM. Tables are created on first run.

Per-bot tables: {bot}_messages, {bot}_memories, {bot}_summaries, {bot}_forgotten. Sanitized bot slugs become namespace prefixes.
Shared tables: profile_attributes, sessions, turn_logs, tool_call_records, model_definitions, prompt_templates, runtime_settings, media_generations, scheduled_jobs, scheduled_job_runs.
External services: Redis (for agent bridges and the SSE turn-log fan-out), Crawl4AI (for the web_fetch tool), Home Assistant MCP (for the home tool).

08 Top-level layout.

main.py

The llm CLI entry point. Wraps cli/main.py. Argument parsing, model resolution, interactive mode loop. Talks to the service or runs in-process.

service/api.py

FastAPI app factory + lifespan. Mounts all routers, warms the embedding model in a background thread, starts the scheduler, wires the OpenClaw Redis subscriber.

service/background_service.py

Long-running service orchestrator. Owns the LLMBawt-per-bot cache, the turn log store, the Redis subscriber. Composes the ChatStreamingMixin + TurnLifecycleMixin.

core/base.py

BaseLLMBawt. 877 lines. Shared CLI + service logic: bot resolution, memory client init, history manager, system prompt assembly, pipeline invocation.

core/pipeline.py

RequestPipeline. 605 lines. Seven stages, per-stage hooks, decision-point overrides, timing instrumentation.

mcp_server/server.py

FastMCP server. 1,242 lines. 41 tools for memory/messages/sessions/profile/bot dispatch; another 19 task-system tools live in task_tools.py.

bots.py

BotManager. YAML seed → DB persistence; per-bot system prompt, tool flags, agent backend config; runtime CRUD via the API.

bots.yaml

Seed bot definitions. nova, spark, mira and friends. The header warns: the DB is the source of truth — this file is fallback only.

runtime_settings.py

Runtime config layer. RuntimeSettingsResolver for per-bot setting overlays; ModelDefinitionStore for DB-resident model aliases that override the YAML.

model_manager.py

Model alias resolution. Loads defined_models.yaml, merges DB overrides, exposes aliases like grok-4-fast → {type: grok, model_id: grok-4-fast}.

09 Suggested reading order.

If you're trying to understand the codebase end-to-end, this cluster is best read in roughly the order of a request: API (where requests land) → pipeline (the seven stages) → clients (model dispatch) → memory (what gets injected and stored) → tools (in-turn dispatch) → agent backends (the bridge pattern) → MCP server (the external surface) → adapters (output cleanup, model quirks).

PreviousSystem map NextFastAPI service

Validated against main on 2026-05-13 Source: llm-bawt/src/llm_bawt