One interface, every provider.
Every model — hosted, local, or external-agent — sits behind the same abstract LLMClient base class. The pipeline doesn't know whether it's streaming from OpenAI's /v1/responses API, a llama-cpp Llama instance loaded in-process, or an OpenClaw bridge proxying a Claude Code session. It just calls client.query(messages) or client.stream_raw(messages) and gets text. Tool calling adds one more method on the same surface.
01 The contract.
LLMClient in clients/base.py is a small abstract class. Two abstract methods, a couple of optional overrides, and some Rich-based terminal rendering helpers that the CLI uses for non-streaming output:
| Method | Required? | Purpose |
|---|---|---|
query(messages, plaintext_output=False, **kwargs) -> str | Abstract | Single-shot synchronous query. Returns the full text response. |
get_styling() -> (title, border_style) | Abstract | How the CLI renders this client's panel. |
supports_native_tools() -> bool | Override | Returns True for clients that support an OpenAI-style tools schema. Default: False. |
query_with_tools(messages, tools_schema, tool_choice, ...) -> (content, tool_calls) | Override | Tool-aware query. Default falls back to plain query. |
stream_raw(messages, **kwargs) -> Iterator[str] | Override | Yield text chunks for SSE streaming. Default yields the whole response as one chunk. |
unload() -> None | Override | Free model resources. Override for GGUF / vLLM; no-op for HTTP clients. |
effective_context_window / effective_max_tokens | Override | Properties used by message-assembly token budgeting. Default reads from model_definition. |
Every client carries a model_definition dict — the entry from the model alias table (YAML seeded, DB authoritative). That dict drives behavior: context_window, max_tokens, temperature, chat_format, adapter, n_gpu_layers, base_url, api_key. Each client reads whatever fields apply to it. The SUPPORTS_STREAMING class attribute is a coarse capability flag the service consults before deciding to use stream_raw.
02 OpenAI and OpenAI-compatible endpoints.
OpenAIClient in clients/openai_client.py (574 lines) is the workhorse. It uses the OpenAI Python SDK and supports two modes:
- OpenAI proper. No
base_url; usesOPENAI_API_KEYfrom the env. - OpenAI-compatible. Any server that speaks the OpenAI Chat API — Ollama, vLLM in OpenAI-compat mode, llama.cpp server, LM Studio, your favorite local stack. The
base_urlfrom the model definition is passed toOpenAI(base_url=...). API keys are optional for local servers; a dummy"not-needed"placeholder is supplied.
The client knows a few quirks about specific OpenAI models. o1, o3, o4, the -chat-latest / -search-preview / -audio-preview variants don't accept temperature or top_p; the client suppresses those fields for matching model ids. Native tool calling is enabled (supports_native_tools() = True).
03 Grok via the Responses API.
xAI's API is OpenAI-compatible but exposes the newer Responses API at /v1/responses (not /v1/chat/completions). llm-bawt models this in two layers:
ResponsesClientinclients/responses_client.py(410 lines) — generic Responses-API client built on the OpenAI SDK. Handles the message-to-Responses-input conversion (system messages become theinstructionsparam; user/assistant turns becomeinput_items), tool calling, streaming, and result extraction.GrokClientinclients/grok_client.py— a 56-line wrapper that inheritsResponsesClient, setsbase_url=https://api.x.ai/v1, and resolvesXAI_API_KEY/LLM_BAWT_XAI_API_KEY.
Grok is also the default MAINTENANCE_MODEL in the seed config — the model used by the scheduler for fact extraction and (when configured) by the consolidation safety check.
The Responses API separates system instructions from the message array (instructions is a dedicated top-level field, not a message). This matches the way the pipeline already builds its system prompt — one block from PromptBuilder — and avoids ambiguity around system-message ordering with some providers. It also gives access to xAI-specific features that aren't in the legacy Chat Completions shape.
04 Local GGUF via llama-cpp-python.
LlamaCppClient in clients/llama_cpp_client.py (472 lines) loads GGUF models in-process via the llama-cpp-python binding. Optional dependency — the import is guarded so the service stays runnable without GPU dependencies installed.
Key behaviors:
- VRAM-aware sizing. Before instantiating the
Llamaobject,utils.vram.auto_size_context_windowinspects available VRAM, the model file size, and the configured globalLLAMA_CPP_N_CTXto pick a context window that won't OOM. The result includes which input drove the decision. - Configurable chat formats.
chat_formatcan come from the model definition or be auto-detected from GGUF metadata. For models with unusual formats (MythoMax, etc.) you set it explicitly; otherwise the binding handles it. - GPU offload.
n_gpu_layersdefaults to-1(all).flash_attn=Truereduces VRAM for long contexts. - Native tool calling: off. Even though llama-cpp-python has a
chatml-function-callingmode, the tool loop deliberately doesn't use it — empty-response bugs in multi-turn tool conversations make ReAct-style text parsing more reliable for local models. See the tools page for the rationale. - Proper unload. Overrides
LLMClient.unloadto release theLlamainstance and trigger garbage collection so the next model load doesn't double up VRAM.
05 vLLM for HuggingFace inference.
VLLMClient in clients/vllm_client.py (610 lines) is the alternative local path: vLLM running in-process with PagedAttention for high-throughput serving. Also an optional dependency.
The client sets VLLM_ENABLE_V1_MULTIPROCESSING=0 at import time — the multiprocess engine spawns engine cores via ZMQ/shared memory and that fails inside Docker and WSL2. With the flag off, vLLM runs the engine in-process. Model load is slow (30–150 s for CUDA graph compilation), so the lifecycle manager pays attention to whether you're switching between vLLM models versus loading one for the first time.
Mistral-style tool calling is supported. The client generates 9-character alphanumeric tool call IDs to match Mistral's format. Streaming uses vLLM's async SamplingParams with sampling configured from the model definition.
06 Agent backends as virtual clients.
AgentBackendClient in clients/agent_backend_client.py (229 lines) is the most important client to understand for the BawtHub architecture: it lets external agent SDKs (Claude Agent SDK, OpenAI Codex SDK, OpenClaw gateway) appear to the rest of llm-bawt as just another model.
The mechanism:
- The service startup looks at every bot's
agent_backendfield. For each one it sees, it injects a syntheticmodel_definitionwithtype: "agent_backend"andbackend: <name>intoconfig.defined_models. The model alias is the backend name itself:openclaw,claude-code,codex. - When a request targets that alias, the pipeline constructs an
AgentBackendClient. The constructor callsregistry.get_backend(name)fromagent_backends/registry.pyto fetch the backend implementation. query()extracts the latest user message from the messages array and synchronously runs the backend's asyncchat()coroutine viaasyncio.runor the running loop's executor.- The streaming path (
stream_raw) iterates the backend's ownstream_rawgenerator — which yields a mix of text-delta strings and dict-shaped events (tool_call,tool_result,token_usage,metadata) — and shuttles them through to the SSE layer. - Structured metadata (model used, provider, duration, usage, tool calls) is stashed in
last_resultfor the caller to consume after the query finishes.
Because this is a regular LLMClient, every other subsystem still applies: history persistence, memory injection, post-turn extraction, turn-log writes. A Claude Code session running through the bridge gets the same five-layer memory that a Grok-powered nova turn gets. See the agent backends page for the bridge-side protocol details.
07 Streaming semantics.
stream_raw is the canonical streaming surface. The default implementation in LLMClient falls back to query() and yields a single chunk — sufficient for non-streaming clients to work in a streaming pipeline. Real streaming clients override:
OpenAIClient.stream_raw— OpenAI SDK'sstream=True, yielding each delta's content as it arrives.ResponsesClient.stream_raw/GrokClient— Responses-API stream events, with separate handling for text deltas and tool-call deltas.VLLMClient.stream_raw— vLLM's async generator, decoded per token.LlamaCppClient.stream_raw— llama-cpp-python's iterator interface.AgentBackendClient.stream_raw— delegates to the backend's own iterator and shapes its dict events.
The service's SSE generator (service/chat_streaming.py) calls stream_raw through the chat-stream worker, which bridges the sync iterator into an asyncio.Queue so the async SSE generator can consume it without blocking the event loop.
08 Native vs ReAct tool calling — picked here.
The pipeline's config.get_tool_format(...) picks the tool format per model: native_openai, react, or xml. Whether the client actually uses native calling at runtime is the AND of two things:
- The picked format is
native_openai. - The client returns
Truefromsupports_native_tools().
Today that means: OpenAIClient and ResponsesClient (and so GrokClient) support native; everything else falls back to text-based ReAct parsing even though llama-cpp-python and vLLM have native-tools implementations of their own. The text-parsing fallback is dramatically more reliable for multi-turn tool conversations against local models — see the tools page.
09 Context-window math.
Every client exposes two properties that the message-assembly stage uses for token budgeting:
effective_max_tokens— output budget. Readsmodel_definition.max_tokensif set, else falls back toconfig.MAX_OUTPUT_TOKENS(default 4096).effective_context_window— total context. Readsmodel_definition.context_windowif set, else uses sensible defaults: 128 000 for OpenAI and Grok;config.LLAMA_CPP_N_CTX(default 32 768) for everything else.
When config.MAX_CONTEXT_TOKENS is 0 (auto), the input budget is computed as effective_context_window - effective_max_tokens. That's the number of tokens the history manager is allowed to fill with conversation history before the user prompt is appended.
10 Key files.
clients/base.pyLLMClient. 185 lines. Abstract base with the two-method contract, the context-window math, the Rich console rendering helpers, and the StubClient used for history-only operations that don't need a real LLM.clients/openai_client.pyOpenAIClient. 574 lines. Chat Completions API. OpenAI proper + any OpenAI-compatible endpoint via base_url. Handles the o1/o3/o4 temperature suppression quirk. Native tools enabled.clients/responses_client.pyResponsesClient. 410 lines. The newer /v1/responses API. Message → input/instructions conversion, streaming, tool calling. The base for GrokClient.clients/grok_client.pyGrokClient. 56 lines. Thin ResponsesClient subclass for xAI's API. Resolves XAI_API_KEY.clients/llama_cpp_client.pyLlamaCppClient. 472 lines. Local GGUF via llama-cpp-python. VRAM-aware context sizing, configurable chat format, GPU offload, proper unload.clients/vllm_client.pyVLLMClient. 610 lines. vLLM in-process. Multiprocess engine forcibly disabled for Docker/WSL2 compatibility. Mistral-format tool IDs.clients/agent_backend_client.pyAgentBackendClient. 229 lines. The bridge to agent_backends/*. Lets Claude Code, Codex, and OpenClaw appear as model aliases. last_result for structured metadata.main on 2026-05-13
Source: llm-bawt/src/llm_bawt/clients