Seven stages, one turn.
Every chat turn — CLI, API, or agent bridge — flows through the same seven-stage RequestPipeline in core/pipeline.py. Each stage has a clearly defined input/output contract, exposes a hook point for extensions, and records timing and debug output into a shared PipelineContext dataclass. The result is that "what did the LLM actually see" is always reconstructable.
01 The shape of a pipeline.
A RequestPipeline is instantiated once per turn (cheap — no global state) with a bot, an LLM client, an adapter, and a fistful of optional helpers: memory client, profile manager, search client, home client, history manager, model lifecycle manager. It then receives a PipelineContext and walks the seven stages in order:
stages = [
(PipelineStage.PRE_PROCESS, self._stage_pre_process),
(PipelineStage.CONTEXT_BUILD, self._stage_context_build),
(PipelineStage.MEMORY_RETRIEVAL, self._stage_memory_retrieval),
(PipelineStage.HISTORY_FILTER, self._stage_history_filter),
(PipelineStage.MESSAGE_ASSEMBLY, self._stage_message_assembly),
(PipelineStage.EXECUTE, self._stage_execute),
(PipelineStage.POST_PROCESS, self._stage_post_process),
]
Each stage:
- Reads from
PipelineContext(input fields + outputs from earlier stages). - Writes to
PipelineContext(its own stage outputs). - Runs any user-registered hooks via
add_hook(stage, callable)after the default logic. - Records elapsed time into
ctx.stage_timings[stage.name]. - Records a JSON-friendly snapshot into
ctx.stage_outputs[stage.name]when in--debugmode.
02 The shared context object.
PipelineContext is a frozen-shape dataclass; every stage knows exactly which keys are populated by the time it runs. Input fields: prompt, user_id, bot_id, stream, plaintext_output. Stage outputs accumulate:
| Field | Populated by | What it holds |
|---|---|---|
prompt_builder | CONTEXT_BUILD | The composable system-prompt builder with all named sections. |
memory_results | MEMORY_RETRIEVAL | Cold-start memory hits (only when history is sparse). |
include_history | HISTORY_FILTER | Whether to include conversation history this turn. |
messages | MESSAGE_ASSEMBLY | The final ordered list of Message objects sent to the LLM. |
tool_definitions | CONTEXT_BUILD | The tool catalog the model will see this turn. |
tool_format | CONTEXT_BUILD | native_openai, react, or xml — picked from model + bot config. |
response | EXECUTE | The LLM's final text response. |
tool_context | EXECUTE | Summary of any tool interactions for post-process history persistence. |
tool_call_details | EXECUTE | Per-call detail: name, arguments, result, iteration — used by debug + turn logs. |
Three decision flags govern conditional logic and can be overridden by the caller (handy for tests and for the "raw completion" endpoint):
use_memory— set by default frombot.requires_memory and memory_client is not None.use_tools— set frombot.uses_tools and any tool-providing client is present.use_search— set frombot.uses_search and search_client is not None.
03 Stage 1 — pre-process.
The lightest stage. Resolves the three decision flags from bot config + available clients, applies any explicit overrides from the caller, and logs the decisions if --debug. No I/O.
This is also where any future input validation, prompt rewriting, or PII redaction would slot in. Today it's intentionally minimal.
04 Stage 2 — context build.
The most architecturally interesting stage. The PromptBuilder in core/prompt_builder.py is a positional, named-section assembler — every section has a slot in the final system prompt:
SectionPosition constants-1
temporal_context
Current date/time + recent activity gist.
0
user_context
profile_manager.get_user_profile_summary(user_id) — "About the User".
1
bot_traits
The bot's developed personality from profile_attributes where entity_type=BOT.
2
base_prompt
The bot's system_prompt from bots.yaml / DB.
3
memory_context
Cold-start memory hits (only when history < 4 messages).
4
tools
Tool catalog rendered for the active tool_format (or just the memory tool for non-tool memory bots).
5
client_context
System context passed by the calling client (e.g. avatar info from bawthub).
6
global_instructions
GLOBAL_SYSTEM_PROMPT — conversation recall + cross-bot memory guidance.
Sections are added with add_section(name, content, position=...). Empty content is silently dropped. Sections can be disabled (kept in the builder but excluded from the build) — useful for hooks that want to suppress a default section without removing it. Calling builder.build() joins enabled sections by \n\n in position order.
Tools deserve a note here. Three branches:
- Tool bot with memory: full tool catalog, including
memory,history,profile,self, plus the optionalsearch,home,model,news,web_fetch, and HA-native tools where their clients are present. - Tool bot without memory: same catalog minus
memory,history,profile,self(filtered out by name). - Memory bot without
uses_tools: the pipeline still injects a single read-onlymemorytool so the bot can search memories on demand, and flipsuse_tools=Trueso the tool loop runs. This is howspark-style "no tools but still smart about memory" bots work.
05 Stage 3 — memory retrieval (cold-start only).
This stage is deliberately small. With history present, the model can call memory_search on demand via the tool loop — which gives it explicit, query-shaped access rather than a static dump. The retrieval stage only fires for cold starts:
If history_manager.messages has 3 or fewer entries, the pipeline runs a single semantic search for the user's prompt with n_results=3 and min_relevance=config.MEMORY_MIN_RELEVANCE and renders the hits via memory.context_builder.build_memory_context_string, then injects them as a cold_start_memory section in the prompt builder. With more history, this stage does nothing — the model will recall on demand via the tool loop. No prophylactic injection.
06 Stage 4 — history filter.
Very simple by design: include_history defaults to True. The token-budget logic for trimming history lives one stage later, in message assembly, where we know the client's effective context window. The filter stage is the hook point if you want to override per-turn (e.g. for raw single-shot completions).
07 Stage 5 — message assembly.
Builds the final list[Message] the client will send to the LLM:
- System message. One
Message(role="system", content=prompt_builder.build()). - History. If included, fetched via
history_manager.get_context_messages(max_tokens=budget). The budget defaults toconfig.MAX_CONTEXT_TOKENS, but if that's0(auto), it's computed asllm_client.effective_context_window - llm_client.effective_max_tokens— i.e. the input budget is whatever's left after reserving the output budget. - User prompt. Only appended explicitly if history is not being included (otherwise the history manager has already received the user message via
add_messagebefore pipeline execution).
History messages flow through filtered: only roles user, assistant, and summary survive. The summary role is converted to system at the API boundary — but the pipeline keeps it tagged so summaries are visually distinguishable in turn logs.
08 Stage 6 — execute.
The branching point. If use_tools is set and any tool-providing client (memory, home, HA-native, news, web-fetch) is present, the stage calls tools.query_with_tools(...) — that constructs a ToolLoop (see the tools page) and runs the multi-turn dispatch until the model produces a non-tool response or hits the iteration limit. Otherwise it calls llm_client.query(...) directly.
Either way, the response is stored on ctx.response; tool interactions accumulate in ctx.tool_context (summary string for history persistence) and ctx.tool_call_details (per-call dicts for the turn log).
09 Stage 7 — post-process.
Writes the assistant response to history via history_manager.add_message("assistant", response). If tools ran, appends a [Tool Results @ {timestamp}] system message immediately after, so the next turn's history view sees the tool context inline. Records final stage output for the turn log.
Notably, fact extraction does not happen here. Despite the docstring's hint, the extraction pipeline is owned by the background scheduler (service/scheduler.py) and runs asynchronously against finished turns. This keeps the response latency clean — extraction can take seconds and uses an external LLM call.
10 Hooks and decision overrides.
Both extension surfaces exist:
pipeline.add_hook(stage, callable)— register a function that receivesPipelineContextafter the default stage logic. Used internally for, e.g., thebotchatroute adding extra system context.pipeline.override_decision(name, value)— forceuse_tools/use_memory/use_search/skip_historyindependent of bot config. Used by the raw/v1/llm/completeendpoint and by tests.
11 Model lifecycle.
A peer to the pipeline lives in core/model_lifecycle.py: ModelLifecycleManager is a thread-safe singleton that tracks the currently loaded primary model and provides switching primitives:
- One primary model loaded at a time. This matters for GGUF and vLLM — both hold significant GPU memory.
unload_current_model()calls the client's ownunload(), then forcesgc.collect().LlamaCppClientoverridesunloadto release the C++Llamainstance and clear CUDA caches.- The
modeltool exposed to bots lets them request a switch: "load me the bigger one for this task". The lifecycle manager owns the actual swap; the next request picks up the new client. - OpenAI / Grok / agent-backend clients are lightweight wrappers around HTTP and don't participate in unloading — switching between them is free.
12 Status surface.
core/status.py (695 lines) is the llm --status machinery. It checks every dependency: Postgres + pgvector reachability, MCP server health, Redis ping, Crawl4AI HTTP, HA MCP, search providers, model alias availability, embedding model presence, and the entry-point loading of agent backends. Its output drives both the CLI status pane and the GET /v1/status JSON response.
13 Key files.
core/pipeline.pyPipelineStage enum, PipelineContext dataclass, RequestPipeline with one method per stage and a verbose log summary.core/prompt_builder.pyPromptBuilder. 276 lines. Named, positioned sections. SectionPosition constants. build(), build_with_debug(), get_verbose_summary(). The GLOBAL_SYSTEM_PROMPT string with conversation-recall and cross-bot guidance.core/base.pyBaseLLMBawt. 877 lines. Shared CLI/service init: bot resolution, memory client, profile manager, search client, history manager, adapter selection, pipeline assembly. Subclassed by LLMBawt (CLI) and ServiceLLMBawt (service).core/client.pyLLMBawt. 107 lines. OpenAI + Grok only — local models route through the service. Picks GrokClient vs OpenAIClient from model type.core/model_lifecycle.pyModelLifecycleManager. 288 lines. Thread-safe singleton. Tracks current model, handles unloading, fires callbacks on switch.core/status.pyllm --status and /v1/status.main on 2026-05-13
Source: llm-bawt/src/llm_bawt/core