BawtHub
⌕ Search ⌘K Source ↗ Open app →
llm-bawt · pipeline

Seven stages, one turn.

Every chat turn — CLI, API, or agent bridge — flows through the same seven-stage RequestPipeline in core/pipeline.py. Each stage has a clearly defined input/output contract, exposes a hook point for extensions, and records timing and debug output into a shared PipelineContext dataclass. The result is that "what did the LLM actually see" is always reconstructable.

File: core/pipeline.py (605 lines) Stages: 7 Hooks per stage: N (registered, run after default logic)

01 The shape of a pipeline.

A RequestPipeline is instantiated once per turn (cheap — no global state) with a bot, an LLM client, an adapter, and a fistful of optional helpers: memory client, profile manager, search client, home client, history manager, model lifecycle manager. It then receives a PipelineContext and walks the seven stages in order:

stages = [
    (PipelineStage.PRE_PROCESS,      self._stage_pre_process),
    (PipelineStage.CONTEXT_BUILD,    self._stage_context_build),
    (PipelineStage.MEMORY_RETRIEVAL, self._stage_memory_retrieval),
    (PipelineStage.HISTORY_FILTER,   self._stage_history_filter),
    (PipelineStage.MESSAGE_ASSEMBLY, self._stage_message_assembly),
    (PipelineStage.EXECUTE,          self._stage_execute),
    (PipelineStage.POST_PROCESS,     self._stage_post_process),
]

Each stage:

02 The shared context object.

PipelineContext is a frozen-shape dataclass; every stage knows exactly which keys are populated by the time it runs. Input fields: prompt, user_id, bot_id, stream, plaintext_output. Stage outputs accumulate:

FieldPopulated byWhat it holds
prompt_builderCONTEXT_BUILDThe composable system-prompt builder with all named sections.
memory_resultsMEMORY_RETRIEVALCold-start memory hits (only when history is sparse).
include_historyHISTORY_FILTERWhether to include conversation history this turn.
messagesMESSAGE_ASSEMBLYThe final ordered list of Message objects sent to the LLM.
tool_definitionsCONTEXT_BUILDThe tool catalog the model will see this turn.
tool_formatCONTEXT_BUILDnative_openai, react, or xml — picked from model + bot config.
responseEXECUTEThe LLM's final text response.
tool_contextEXECUTESummary of any tool interactions for post-process history persistence.
tool_call_detailsEXECUTEPer-call detail: name, arguments, result, iteration — used by debug + turn logs.

Three decision flags govern conditional logic and can be overridden by the caller (handy for tests and for the "raw completion" endpoint):

03 Stage 1 — pre-process.

The lightest stage. Resolves the three decision flags from bot config + available clients, applies any explicit overrides from the caller, and logs the decisions if --debug. No I/O.

This is also where any future input validation, prompt rewriting, or PII redaction would slot in. Today it's intentionally minimal.

04 Stage 2 — context build.

The most architecturally interesting stage. The PromptBuilder in core/prompt_builder.py is a positional, named-section assembler — every section has a slot in the final system prompt:

System prompt section order · SectionPosition constants
-1 temporal_context Current date/time + recent activity gist.
0 user_context profile_manager.get_user_profile_summary(user_id) — "About the User".
1 bot_traits The bot's developed personality from profile_attributes where entity_type=BOT.
2 base_prompt The bot's system_prompt from bots.yaml / DB.
3 memory_context Cold-start memory hits (only when history < 4 messages).
4 tools Tool catalog rendered for the active tool_format (or just the memory tool for non-tool memory bots).
5 client_context System context passed by the calling client (e.g. avatar info from bawthub).
6 global_instructions GLOBAL_SYSTEM_PROMPT — conversation recall + cross-bot memory guidance.

Sections are added with add_section(name, content, position=...). Empty content is silently dropped. Sections can be disabled (kept in the builder but excluded from the build) — useful for hooks that want to suppress a default section without removing it. Calling builder.build() joins enabled sections by \n\n in position order.

Tools deserve a note here. Three branches:

05 Stage 3 — memory retrieval (cold-start only).

This stage is deliberately small. With history present, the model can call memory_search on demand via the tool loop — which gives it explicit, query-shaped access rather than a static dump. The retrieval stage only fires for cold starts:

Cold-start priming.

If history_manager.messages has 3 or fewer entries, the pipeline runs a single semantic search for the user's prompt with n_results=3 and min_relevance=config.MEMORY_MIN_RELEVANCE and renders the hits via memory.context_builder.build_memory_context_string, then injects them as a cold_start_memory section in the prompt builder. With more history, this stage does nothing — the model will recall on demand via the tool loop. No prophylactic injection.

06 Stage 4 — history filter.

Very simple by design: include_history defaults to True. The token-budget logic for trimming history lives one stage later, in message assembly, where we know the client's effective context window. The filter stage is the hook point if you want to override per-turn (e.g. for raw single-shot completions).

07 Stage 5 — message assembly.

Builds the final list[Message] the client will send to the LLM:

  1. System message. One Message(role="system", content=prompt_builder.build()).
  2. History. If included, fetched via history_manager.get_context_messages(max_tokens=budget). The budget defaults to config.MAX_CONTEXT_TOKENS, but if that's 0 (auto), it's computed as llm_client.effective_context_window - llm_client.effective_max_tokens — i.e. the input budget is whatever's left after reserving the output budget.
  3. User prompt. Only appended explicitly if history is not being included (otherwise the history manager has already received the user message via add_message before pipeline execution).

History messages flow through filtered: only roles user, assistant, and summary survive. The summary role is converted to system at the API boundary — but the pipeline keeps it tagged so summaries are visually distinguishable in turn logs.

08 Stage 6 — execute.

The branching point. If use_tools is set and any tool-providing client (memory, home, HA-native, news, web-fetch) is present, the stage calls tools.query_with_tools(...) — that constructs a ToolLoop (see the tools page) and runs the multi-turn dispatch until the model produces a non-tool response or hits the iteration limit. Otherwise it calls llm_client.query(...) directly.

Either way, the response is stored on ctx.response; tool interactions accumulate in ctx.tool_context (summary string for history persistence) and ctx.tool_call_details (per-call dicts for the turn log).

09 Stage 7 — post-process.

Writes the assistant response to history via history_manager.add_message("assistant", response). If tools ran, appends a [Tool Results @ {timestamp}] system message immediately after, so the next turn's history view sees the tool context inline. Records final stage output for the turn log.

Notably, fact extraction does not happen here. Despite the docstring's hint, the extraction pipeline is owned by the background scheduler (service/scheduler.py) and runs asynchronously against finished turns. This keeps the response latency clean — extraction can take seconds and uses an external LLM call.

10 Hooks and decision overrides.

Both extension surfaces exist:

11 Model lifecycle.

A peer to the pipeline lives in core/model_lifecycle.py: ModelLifecycleManager is a thread-safe singleton that tracks the currently loaded primary model and provides switching primitives:

12 Status surface.

core/status.py (695 lines) is the llm --status machinery. It checks every dependency: Postgres + pgvector reachability, MCP server health, Redis ping, Crawl4AI HTTP, HA MCP, search providers, model alias availability, embedding model presence, and the entry-point loading of agent backends. Its output drives both the CLI status pane and the GET /v1/status JSON response.

13 Key files.

core/pipeline.py
The seven stages. 605 lines. PipelineStage enum, PipelineContext dataclass, RequestPipeline with one method per stage and a verbose log summary.
core/prompt_builder.py
PromptBuilder. 276 lines. Named, positioned sections. SectionPosition constants. build(), build_with_debug(), get_verbose_summary(). The GLOBAL_SYSTEM_PROMPT string with conversation-recall and cross-bot guidance.
core/base.py
BaseLLMBawt. 877 lines. Shared CLI/service init: bot resolution, memory client, profile manager, search client, history manager, adapter selection, pipeline assembly. Subclassed by LLMBawt (CLI) and ServiceLLMBawt (service).
core/client.py
CLI LLMBawt. 107 lines. OpenAI + Grok only — local models route through the service. Picks GrokClient vs OpenAIClient from model type.
core/model_lifecycle.py
ModelLifecycleManager. 288 lines. Thread-safe singleton. Tracks current model, handles unloading, fires callbacks on switch.
core/status.py
Status engine. 695 lines. Health-check fan-out across every dependency, used by both llm --status and /v1/status.
Validated against main on 2026-05-13 Source: llm-bawt/src/llm_bawt/core