llm-bawt · pipeline

Seven stages, one turn.

Every chat turn — CLI, API, or agent bridge — flows through the same seven-stage RequestPipeline in core/pipeline.py. Each stage has a clearly defined input/output contract, exposes a hook point for extensions, and records timing and debug output into a shared PipelineContext dataclass. The result is that "what did the LLM actually see" is always reconstructable.

File: core/pipeline.py (605 lines) Stages: 7 Hooks per stage: N (registered, run after default logic)

01 The shape of a pipeline.

A RequestPipeline is instantiated once per turn (cheap — no global state) with a bot, an LLM client, an adapter, and a fistful of optional helpers: memory client, profile manager, search client, home client, history manager, model lifecycle manager. It then receives a PipelineContext and walks the seven stages in order:

stages = [
    (PipelineStage.PRE_PROCESS,      self._stage_pre_process),
    (PipelineStage.CONTEXT_BUILD,    self._stage_context_build),
    (PipelineStage.MEMORY_RETRIEVAL, self._stage_memory_retrieval),
    (PipelineStage.HISTORY_FILTER,   self._stage_history_filter),
    (PipelineStage.MESSAGE_ASSEMBLY, self._stage_message_assembly),
    (PipelineStage.EXECUTE,          self._stage_execute),
    (PipelineStage.POST_PROCESS,     self._stage_post_process),
]

Each stage:

Reads from PipelineContext (input fields + outputs from earlier stages).
Writes to PipelineContext (its own stage outputs).
Runs any user-registered hooks via add_hook(stage, callable) after the default logic.
Records elapsed time into ctx.stage_timings[stage.name].
Records a JSON-friendly snapshot into ctx.stage_outputs[stage.name] when in --debug mode.

02 The shared context object.

PipelineContext is a frozen-shape dataclass; every stage knows exactly which keys are populated by the time it runs. Input fields: prompt, user_id, bot_id, stream, plaintext_output. Stage outputs accumulate:

Field	Populated by	What it holds
`prompt_builder`	CONTEXT_BUILD	The composable system-prompt builder with all named sections.
`memory_results`	MEMORY_RETRIEVAL	Cold-start memory hits (only when history is sparse).
`include_history`	HISTORY_FILTER	Whether to include conversation history this turn.
`messages`	MESSAGE_ASSEMBLY	The final ordered list of `Message` objects sent to the LLM.
`tool_definitions`	CONTEXT_BUILD	The tool catalog the model will see this turn.
`tool_format`	CONTEXT_BUILD	`native_openai`, `react`, or `xml` — picked from model + bot config.
`response`	EXECUTE	The LLM's final text response.
`tool_context`	EXECUTE	Summary of any tool interactions for post-process history persistence.
`tool_call_details`	EXECUTE	Per-call detail: name, arguments, result, iteration — used by debug + turn logs.

Three decision flags govern conditional logic and can be overridden by the caller (handy for tests and for the "raw completion" endpoint):

use_memory — set by default from bot.requires_memory and memory_client is not None.
use_tools — set from bot.uses_tools and any tool-providing client is present.
use_search — set from bot.uses_search and search_client is not None.

03 Stage 1 — pre-process.

The lightest stage. Resolves the three decision flags from bot config + available clients, applies any explicit overrides from the caller, and logs the decisions if --debug. No I/O.

This is also where any future input validation, prompt rewriting, or PII redaction would slot in. Today it's intentionally minimal.

04 Stage 2 — context build.

The most architecturally interesting stage. The PromptBuilder in core/prompt_builder.py is a positional, named-section assembler — every section has a slot in the final system prompt:

System prompt section order · SectionPosition constants

-1 temporal_context Current date/time + recent activity gist.

0 user_context profile_manager.get_user_profile_summary(user_id) — "About the User".

1 bot_traits The bot's developed personality from profile_attributes where entity_type=BOT.

2 base_prompt The bot's system_prompt from bots.yaml / DB.

3 memory_context Cold-start memory hits (only when history < 4 messages).

4 tools Tool catalog rendered for the active tool_format (or just the memory tool for non-tool memory bots).

5 client_context System context passed by the calling client (e.g. avatar info from bawthub).

6 global_instructions GLOBAL_SYSTEM_PROMPT — conversation recall + cross-bot memory guidance.

Sections are added with add_section(name, content, position=...). Empty content is silently dropped. Sections can be disabled (kept in the builder but excluded from the build) — useful for hooks that want to suppress a default section without removing it. Calling builder.build() joins enabled sections by \n\n in position order.

Tools deserve a note here. Three branches:

Tool bot with memory: full tool catalog, including memory, history, profile, self, plus the optional search, home, model, news, web_fetch, and HA-native tools where their clients are present.
Tool bot without memory: same catalog minus memory, history, profile, self (filtered out by name).
Memory bot without uses_tools: the pipeline still injects a single read-only memory tool so the bot can search memories on demand, and flips use_tools=True so the tool loop runs. This is how spark-style "no tools but still smart about memory" bots work.

05 Stage 3 — memory retrieval (cold-start only).

This stage is deliberately small. With history present, the model can call memory_search on demand via the tool loop — which gives it explicit, query-shaped access rather than a static dump. The retrieval stage only fires for cold starts:

✦

Cold-start priming.

If history_manager.messages has 3 or fewer entries, the pipeline runs a single semantic search for the user's prompt with n_results=3 and min_relevance=config.MEMORY_MIN_RELEVANCE and renders the hits via memory.context_builder.build_memory_context_string, then injects them as a cold_start_memory section in the prompt builder. With more history, this stage does nothing — the model will recall on demand via the tool loop. No prophylactic injection.

06 Stage 4 — history filter.

Very simple by design: include_history defaults to True. The token-budget logic for trimming history lives one stage later, in message assembly, where we know the client's effective context window. The filter stage is the hook point if you want to override per-turn (e.g. for raw single-shot completions).

07 Stage 5 — message assembly.

Builds the final list[Message] the client will send to the LLM:

System message. One Message(role="system", content=prompt_builder.build()).
History. If included, fetched via history_manager.get_context_messages(max_tokens=budget). The budget defaults to config.MAX_CONTEXT_TOKENS, but if that's 0 (auto), it's computed as llm_client.effective_context_window - llm_client.effective_max_tokens — i.e. the input budget is whatever's left after reserving the output budget.
User prompt. Only appended explicitly if history is not being included (otherwise the history manager has already received the user message via add_message before pipeline execution).

History messages flow through filtered: only roles user, assistant, and summary survive. The summary role is converted to system at the API boundary — but the pipeline keeps it tagged so summaries are visually distinguishable in turn logs.

08 Stage 6 — execute.

The branching point. If use_tools is set and any tool-providing client (memory, home, HA-native, news, web-fetch) is present, the stage calls tools.query_with_tools(...) — that constructs a ToolLoop (see the tools page) and runs the multi-turn dispatch until the model produces a non-tool response or hits the iteration limit. Otherwise it calls llm_client.query(...) directly.

Either way, the response is stored on ctx.response; tool interactions accumulate in ctx.tool_context (summary string for history persistence) and ctx.tool_call_details (per-call dicts for the turn log).

09 Stage 7 — post-process.

Writes the assistant response to history via history_manager.add_message("assistant", response). If tools ran, appends a [Tool Results @ {timestamp}] system message immediately after, so the next turn's history view sees the tool context inline. Records final stage output for the turn log.

Notably, fact extraction does not happen here. Despite the docstring's hint, the extraction pipeline is owned by the background scheduler (service/scheduler.py) and runs asynchronously against finished turns. This keeps the response latency clean — extraction can take seconds and uses an external LLM call.

10 Hooks and decision overrides.

Both extension surfaces exist:

pipeline.add_hook(stage, callable) — register a function that receives PipelineContext after the default stage logic. Used internally for, e.g., the botchat route adding extra system context.
pipeline.override_decision(name, value) — force use_tools / use_memory / use_search / skip_history independent of bot config. Used by the raw /v1/llm/complete endpoint and by tests.

11 Model lifecycle.

A peer to the pipeline lives in core/model_lifecycle.py: ModelLifecycleManager is a thread-safe singleton that tracks the currently loaded primary model and provides switching primitives:

One primary model loaded at a time. This matters for GGUF and vLLM — both hold significant GPU memory.
unload_current_model() calls the client's own unload(), then forces gc.collect(). LlamaCppClient overrides unload to release the C++ Llama instance and clear CUDA caches.
The model tool exposed to bots lets them request a switch: "load me the bigger one for this task". The lifecycle manager owns the actual swap; the next request picks up the new client.
OpenAI / Grok / agent-backend clients are lightweight wrappers around HTTP and don't participate in unloading — switching between them is free.

12 Status surface.

core/status.py (695 lines) is the llm --status machinery. It checks every dependency: Postgres + pgvector reachability, MCP server health, Redis ping, Crawl4AI HTTP, HA MCP, search providers, model alias availability, embedding model presence, and the entry-point loading of agent backends. Its output drives both the CLI status pane and the GET /v1/status JSON response.

13 Key files.

core/pipeline.py

The seven stages. 605 lines. PipelineStage enum, PipelineContext dataclass, RequestPipeline with one method per stage and a verbose log summary.

core/prompt_builder.py

PromptBuilder. 276 lines. Named, positioned sections. SectionPosition constants. build(), build_with_debug(), get_verbose_summary(). The GLOBAL_SYSTEM_PROMPT string with conversation-recall and cross-bot guidance.

core/base.py

BaseLLMBawt. 877 lines. Shared CLI/service init: bot resolution, memory client, profile manager, search client, history manager, adapter selection, pipeline assembly. Subclassed by LLMBawt (CLI) and ServiceLLMBawt (service).

core/client.py

CLI LLMBawt. 107 lines. OpenAI + Grok only — local models route through the service. Picks GrokClient vs OpenAIClient from model type.

core/model_lifecycle.py

ModelLifecycleManager. 288 lines. Thread-safe singleton. Tracks current model, handles unloading, fires callbacks on switch.

core/status.py

Status engine. 695 lines. Health-check fan-out across every dependency, used by both llm --status and /v1/status.

PreviousFastAPI service NextModel clients

Validated against main on 2026-05-13 Source: llm-bawt/src/llm_bawt/core