llm-bawt · clients

One interface, every provider.

Every model — hosted, local, or external-agent — sits behind the same abstract LLMClient base class. The pipeline doesn't know whether it's streaming from OpenAI's /v1/responses API, a llama-cpp Llama instance loaded in-process, or an OpenClaw bridge proxying a Claude Code session. It just calls client.query(messages) or client.stream_raw(messages) and gets text. Tool calling adds one more method on the same surface.

Files: 7 clients + 1 base (clients/) Total: ~2,500 lines Providers: OpenAI, xAI, vLLM, llama.cpp, Anthropic, Codex, OpenClaw

01 The contract.

LLMClient in clients/base.py is a small abstract class. Two abstract methods, a couple of optional overrides, and some Rich-based terminal rendering helpers that the CLI uses for non-streaming output:

Method	Required?	Purpose
`query(messages, plaintext_output=False, **kwargs) -> str`	Abstract	Single-shot synchronous query. Returns the full text response.
`get_styling() -> (title, border_style)`	Abstract	How the CLI renders this client's panel.
`supports_native_tools() -> bool`	Override	Returns True for clients that support an OpenAI-style tools schema. Default: False.
`query_with_tools(messages, tools_schema, tool_choice, ...) -> (content, tool_calls)`	Override	Tool-aware query. Default falls back to plain `query`.
`stream_raw(messages, **kwargs) -> Iterator[str]`	Override	Yield text chunks for SSE streaming. Default yields the whole response as one chunk.
`unload() -> None`	Override	Free model resources. Override for GGUF / vLLM; no-op for HTTP clients.
`effective_context_window` / `effective_max_tokens`	Override	Properties used by message-assembly token budgeting. Default reads from `model_definition`.

Every client carries a model_definition dict — the entry from the model alias table (YAML seeded, DB authoritative). That dict drives behavior: context_window, max_tokens, temperature, chat_format, adapter, n_gpu_layers, base_url, api_key. Each client reads whatever fields apply to it. The SUPPORTS_STREAMING class attribute is a coarse capability flag the service consults before deciding to use stream_raw.

02 OpenAI and OpenAI-compatible endpoints.

OpenAIClient in clients/openai_client.py (574 lines) is the workhorse. It uses the OpenAI Python SDK and supports two modes:

OpenAI proper. No base_url; uses OPENAI_API_KEY from the env.
OpenAI-compatible. Any server that speaks the OpenAI Chat API — Ollama, vLLM in OpenAI-compat mode, llama.cpp server, LM Studio, your favorite local stack. The base_url from the model definition is passed to OpenAI(base_url=...). API keys are optional for local servers; a dummy "not-needed" placeholder is supplied.

The client knows a few quirks about specific OpenAI models. o1, o3, o4, the -chat-latest / -search-preview / -audio-preview variants don't accept temperature or top_p; the client suppresses those fields for matching model ids. Native tool calling is enabled (supports_native_tools() = True).

03 Grok via the Responses API.

xAI's API is OpenAI-compatible but exposes the newer Responses API at /v1/responses (not /v1/chat/completions). llm-bawt models this in two layers:

ResponsesClient in clients/responses_client.py (410 lines) — generic Responses-API client built on the OpenAI SDK. Handles the message-to-Responses-input conversion (system messages become the instructions param; user/assistant turns become input_items), tool calling, streaming, and result extraction.
GrokClient in clients/grok_client.py — a 56-line wrapper that inherits ResponsesClient, sets base_url=https://api.x.ai/v1, and resolves XAI_API_KEY / LLM_BAWT_XAI_API_KEY.

Grok is also the default MAINTENANCE_MODEL in the seed config — the model used by the scheduler for fact extraction and (when configured) by the consolidation safety check.

✦

Why the Responses API and not Chat Completions.

The Responses API separates system instructions from the message array (instructions is a dedicated top-level field, not a message). This matches the way the pipeline already builds its system prompt — one block from PromptBuilder — and avoids ambiguity around system-message ordering with some providers. It also gives access to xAI-specific features that aren't in the legacy Chat Completions shape.

04 Local GGUF via llama-cpp-python.

LlamaCppClient in clients/llama_cpp_client.py (472 lines) loads GGUF models in-process via the llama-cpp-python binding. Optional dependency — the import is guarded so the service stays runnable without GPU dependencies installed.

Key behaviors:

VRAM-aware sizing. Before instantiating the Llama object, utils.vram.auto_size_context_window inspects available VRAM, the model file size, and the configured global LLAMA_CPP_N_CTX to pick a context window that won't OOM. The result includes which input drove the decision.
Configurable chat formats. chat_format can come from the model definition or be auto-detected from GGUF metadata. For models with unusual formats (MythoMax, etc.) you set it explicitly; otherwise the binding handles it.
GPU offload. n_gpu_layers defaults to -1 (all). flash_attn=True reduces VRAM for long contexts.
Native tool calling: off. Even though llama-cpp-python has a chatml-function-calling mode, the tool loop deliberately doesn't use it — empty-response bugs in multi-turn tool conversations make ReAct-style text parsing more reliable for local models. See the tools page for the rationale.
Proper unload. Overrides LLMClient.unload to release the Llama instance and trigger garbage collection so the next model load doesn't double up VRAM.

05 vLLM for HuggingFace inference.

VLLMClient in clients/vllm_client.py (610 lines) is the alternative local path: vLLM running in-process with PagedAttention for high-throughput serving. Also an optional dependency.

The client sets VLLM_ENABLE_V1_MULTIPROCESSING=0 at import time — the multiprocess engine spawns engine cores via ZMQ/shared memory and that fails inside Docker and WSL2. With the flag off, vLLM runs the engine in-process. Model load is slow (30–150 s for CUDA graph compilation), so the lifecycle manager pays attention to whether you're switching between vLLM models versus loading one for the first time.

Mistral-style tool calling is supported. The client generates 9-character alphanumeric tool call IDs to match Mistral's format. Streaming uses vLLM's async SamplingParams with sampling configured from the model definition.

06 Agent backends as virtual clients.

AgentBackendClient in clients/agent_backend_client.py (229 lines) is the most important client to understand for the BawtHub architecture: it lets external agent SDKs (Claude Agent SDK, OpenAI Codex SDK, OpenClaw gateway) appear to the rest of llm-bawt as just another model.

The mechanism:

The service startup looks at every bot's agent_backend field. For each one it sees, it injects a synthetic model_definition with type: "agent_backend" and backend: <name> into config.defined_models. The model alias is the backend name itself: openclaw, claude-code, codex.
When a request targets that alias, the pipeline constructs an AgentBackendClient. The constructor calls registry.get_backend(name) from agent_backends/registry.py to fetch the backend implementation.
query() extracts the latest user message from the messages array and synchronously runs the backend's async chat() coroutine via asyncio.run or the running loop's executor.
The streaming path (stream_raw) iterates the backend's own stream_raw generator — which yields a mix of text-delta strings and dict-shaped events (tool_call, tool_result, token_usage, metadata) — and shuttles them through to the SSE layer.
Structured metadata (model used, provider, duration, usage, tool calls) is stashed in last_result for the caller to consume after the query finishes.

Because this is a regular LLMClient, every other subsystem still applies: history persistence, memory injection, post-turn extraction, turn-log writes. A Claude Code session running through the bridge gets the same five-layer memory that a Grok-powered nova turn gets. See the agent backends page for the bridge-side protocol details.

07 Streaming semantics.

stream_raw is the canonical streaming surface. The default implementation in LLMClient falls back to query() and yields a single chunk — sufficient for non-streaming clients to work in a streaming pipeline. Real streaming clients override:

OpenAIClient.stream_raw — OpenAI SDK's stream=True, yielding each delta's content as it arrives.
ResponsesClient.stream_raw / GrokClient — Responses-API stream events, with separate handling for text deltas and tool-call deltas.
VLLMClient.stream_raw — vLLM's async generator, decoded per token.
LlamaCppClient.stream_raw — llama-cpp-python's iterator interface.
AgentBackendClient.stream_raw — delegates to the backend's own iterator and shapes its dict events.

The service's SSE generator (service/chat_streaming.py) calls stream_raw through the chat-stream worker, which bridges the sync iterator into an asyncio.Queue so the async SSE generator can consume it without blocking the event loop.

08 Native vs ReAct tool calling — picked here.

The pipeline's config.get_tool_format(...) picks the tool format per model: native_openai, react, or xml. Whether the client actually uses native calling at runtime is the AND of two things:

The picked format is native_openai.
The client returns True from supports_native_tools().

Today that means: OpenAIClient and ResponsesClient (and so GrokClient) support native; everything else falls back to text-based ReAct parsing even though llama-cpp-python and vLLM have native-tools implementations of their own. The text-parsing fallback is dramatically more reliable for multi-turn tool conversations against local models — see the tools page.

09 Context-window math.

Every client exposes two properties that the message-assembly stage uses for token budgeting:

effective_max_tokens — output budget. Reads model_definition.max_tokens if set, else falls back to config.MAX_OUTPUT_TOKENS (default 4096).
effective_context_window — total context. Reads model_definition.context_window if set, else uses sensible defaults: 128 000 for OpenAI and Grok; config.LLAMA_CPP_N_CTX (default 32 768) for everything else.

When config.MAX_CONTEXT_TOKENS is 0 (auto), the input budget is computed as effective_context_window - effective_max_tokens. That's the number of tokens the history manager is allowed to fill with conversation history before the user prompt is appended.

10 Key files.

clients/base.py

LLMClient. 185 lines. Abstract base with the two-method contract, the context-window math, the Rich console rendering helpers, and the StubClient used for history-only operations that don't need a real LLM.

clients/openai_client.py

OpenAIClient. 574 lines. Chat Completions API. OpenAI proper + any OpenAI-compatible endpoint via base_url. Handles the o1/o3/o4 temperature suppression quirk. Native tools enabled.

clients/responses_client.py

ResponsesClient. 410 lines. The newer /v1/responses API. Message → input/instructions conversion, streaming, tool calling. The base for GrokClient.

clients/grok_client.py

GrokClient. 56 lines. Thin ResponsesClient subclass for xAI's API. Resolves XAI_API_KEY.

clients/llama_cpp_client.py

LlamaCppClient. 472 lines. Local GGUF via llama-cpp-python. VRAM-aware context sizing, configurable chat format, GPU offload, proper unload.

clients/vllm_client.py

VLLMClient. 610 lines. vLLM in-process. Multiprocess engine forcibly disabled for Docker/WSL2 compatibility. Mistral-format tool IDs.

clients/agent_backend_client.py

AgentBackendClient. 229 lines. The bridge to agent_backends/*. Lets Claude Code, Codex, and OpenClaw appear as model aliases. last_result for structured metadata.

PreviousRequest pipeline NextMemory

Validated against main on 2026-05-13 Source: llm-bawt/src/llm_bawt/clients