BawtHub
⌕ Search ⌘K Source ↗ Open app →
llm-bawt · clients

One interface, every provider.

Every model — hosted, local, or external-agent — sits behind the same abstract LLMClient base class. The pipeline doesn't know whether it's streaming from OpenAI's /v1/responses API, a llama-cpp Llama instance loaded in-process, or an OpenClaw bridge proxying a Claude Code session. It just calls client.query(messages) or client.stream_raw(messages) and gets text. Tool calling adds one more method on the same surface.

Files: 7 clients + 1 base (clients/) Total: ~2,500 lines Providers: OpenAI, xAI, vLLM, llama.cpp, Anthropic, Codex, OpenClaw

01 The contract.

LLMClient in clients/base.py is a small abstract class. Two abstract methods, a couple of optional overrides, and some Rich-based terminal rendering helpers that the CLI uses for non-streaming output:

MethodRequired?Purpose
query(messages, plaintext_output=False, **kwargs) -> strAbstractSingle-shot synchronous query. Returns the full text response.
get_styling() -> (title, border_style)AbstractHow the CLI renders this client's panel.
supports_native_tools() -> boolOverrideReturns True for clients that support an OpenAI-style tools schema. Default: False.
query_with_tools(messages, tools_schema, tool_choice, ...) -> (content, tool_calls)OverrideTool-aware query. Default falls back to plain query.
stream_raw(messages, **kwargs) -> Iterator[str]OverrideYield text chunks for SSE streaming. Default yields the whole response as one chunk.
unload() -> NoneOverrideFree model resources. Override for GGUF / vLLM; no-op for HTTP clients.
effective_context_window / effective_max_tokensOverrideProperties used by message-assembly token budgeting. Default reads from model_definition.

Every client carries a model_definition dict — the entry from the model alias table (YAML seeded, DB authoritative). That dict drives behavior: context_window, max_tokens, temperature, chat_format, adapter, n_gpu_layers, base_url, api_key. Each client reads whatever fields apply to it. The SUPPORTS_STREAMING class attribute is a coarse capability flag the service consults before deciding to use stream_raw.

02 OpenAI and OpenAI-compatible endpoints.

OpenAIClient in clients/openai_client.py (574 lines) is the workhorse. It uses the OpenAI Python SDK and supports two modes:

  1. OpenAI proper. No base_url; uses OPENAI_API_KEY from the env.
  2. OpenAI-compatible. Any server that speaks the OpenAI Chat API — Ollama, vLLM in OpenAI-compat mode, llama.cpp server, LM Studio, your favorite local stack. The base_url from the model definition is passed to OpenAI(base_url=...). API keys are optional for local servers; a dummy "not-needed" placeholder is supplied.

The client knows a few quirks about specific OpenAI models. o1, o3, o4, the -chat-latest / -search-preview / -audio-preview variants don't accept temperature or top_p; the client suppresses those fields for matching model ids. Native tool calling is enabled (supports_native_tools() = True).

03 Grok via the Responses API.

xAI's API is OpenAI-compatible but exposes the newer Responses API at /v1/responses (not /v1/chat/completions). llm-bawt models this in two layers:

Grok is also the default MAINTENANCE_MODEL in the seed config — the model used by the scheduler for fact extraction and (when configured) by the consolidation safety check.

Why the Responses API and not Chat Completions.

The Responses API separates system instructions from the message array (instructions is a dedicated top-level field, not a message). This matches the way the pipeline already builds its system prompt — one block from PromptBuilder — and avoids ambiguity around system-message ordering with some providers. It also gives access to xAI-specific features that aren't in the legacy Chat Completions shape.

04 Local GGUF via llama-cpp-python.

LlamaCppClient in clients/llama_cpp_client.py (472 lines) loads GGUF models in-process via the llama-cpp-python binding. Optional dependency — the import is guarded so the service stays runnable without GPU dependencies installed.

Key behaviors:

05 vLLM for HuggingFace inference.

VLLMClient in clients/vllm_client.py (610 lines) is the alternative local path: vLLM running in-process with PagedAttention for high-throughput serving. Also an optional dependency.

The client sets VLLM_ENABLE_V1_MULTIPROCESSING=0 at import time — the multiprocess engine spawns engine cores via ZMQ/shared memory and that fails inside Docker and WSL2. With the flag off, vLLM runs the engine in-process. Model load is slow (30–150 s for CUDA graph compilation), so the lifecycle manager pays attention to whether you're switching between vLLM models versus loading one for the first time.

Mistral-style tool calling is supported. The client generates 9-character alphanumeric tool call IDs to match Mistral's format. Streaming uses vLLM's async SamplingParams with sampling configured from the model definition.

06 Agent backends as virtual clients.

AgentBackendClient in clients/agent_backend_client.py (229 lines) is the most important client to understand for the BawtHub architecture: it lets external agent SDKs (Claude Agent SDK, OpenAI Codex SDK, OpenClaw gateway) appear to the rest of llm-bawt as just another model.

The mechanism:

  1. The service startup looks at every bot's agent_backend field. For each one it sees, it injects a synthetic model_definition with type: "agent_backend" and backend: <name> into config.defined_models. The model alias is the backend name itself: openclaw, claude-code, codex.
  2. When a request targets that alias, the pipeline constructs an AgentBackendClient. The constructor calls registry.get_backend(name) from agent_backends/registry.py to fetch the backend implementation.
  3. query() extracts the latest user message from the messages array and synchronously runs the backend's async chat() coroutine via asyncio.run or the running loop's executor.
  4. The streaming path (stream_raw) iterates the backend's own stream_raw generator — which yields a mix of text-delta strings and dict-shaped events (tool_call, tool_result, token_usage, metadata) — and shuttles them through to the SSE layer.
  5. Structured metadata (model used, provider, duration, usage, tool calls) is stashed in last_result for the caller to consume after the query finishes.

Because this is a regular LLMClient, every other subsystem still applies: history persistence, memory injection, post-turn extraction, turn-log writes. A Claude Code session running through the bridge gets the same five-layer memory that a Grok-powered nova turn gets. See the agent backends page for the bridge-side protocol details.

07 Streaming semantics.

stream_raw is the canonical streaming surface. The default implementation in LLMClient falls back to query() and yields a single chunk — sufficient for non-streaming clients to work in a streaming pipeline. Real streaming clients override:

The service's SSE generator (service/chat_streaming.py) calls stream_raw through the chat-stream worker, which bridges the sync iterator into an asyncio.Queue so the async SSE generator can consume it without blocking the event loop.

08 Native vs ReAct tool calling — picked here.

The pipeline's config.get_tool_format(...) picks the tool format per model: native_openai, react, or xml. Whether the client actually uses native calling at runtime is the AND of two things:

  1. The picked format is native_openai.
  2. The client returns True from supports_native_tools().

Today that means: OpenAIClient and ResponsesClient (and so GrokClient) support native; everything else falls back to text-based ReAct parsing even though llama-cpp-python and vLLM have native-tools implementations of their own. The text-parsing fallback is dramatically more reliable for multi-turn tool conversations against local models — see the tools page.

09 Context-window math.

Every client exposes two properties that the message-assembly stage uses for token budgeting:

When config.MAX_CONTEXT_TOKENS is 0 (auto), the input budget is computed as effective_context_window - effective_max_tokens. That's the number of tokens the history manager is allowed to fill with conversation history before the user prompt is appended.

10 Key files.

clients/base.py
LLMClient. 185 lines. Abstract base with the two-method contract, the context-window math, the Rich console rendering helpers, and the StubClient used for history-only operations that don't need a real LLM.
clients/openai_client.py
OpenAIClient. 574 lines. Chat Completions API. OpenAI proper + any OpenAI-compatible endpoint via base_url. Handles the o1/o3/o4 temperature suppression quirk. Native tools enabled.
clients/responses_client.py
ResponsesClient. 410 lines. The newer /v1/responses API. Message → input/instructions conversion, streaming, tool calling. The base for GrokClient.
clients/grok_client.py
GrokClient. 56 lines. Thin ResponsesClient subclass for xAI's API. Resolves XAI_API_KEY.
clients/llama_cpp_client.py
LlamaCppClient. 472 lines. Local GGUF via llama-cpp-python. VRAM-aware context sizing, configurable chat format, GPU offload, proper unload.
clients/vllm_client.py
VLLMClient. 610 lines. vLLM in-process. Multiprocess engine forcibly disabled for Docker/WSL2 compatibility. Mistral-format tool IDs.
clients/agent_backend_client.py
AgentBackendClient. 229 lines. The bridge to agent_backends/*. Lets Claude Code, Codex, and OpenClaw appear as model aliases. last_result for structured metadata.
Validated against main on 2026-05-13 Source: llm-bawt/src/llm_bawt/clients