Coding agent → gateway → local model · every token observable · runs on your own machine
Pick either. Both are provider-agnostic — memory · skills · cron
Why: a harness that points at one endpoint and keeps context across sessions. OpenCode is the easy on-ramp; Hermes adds scheduled autonomous jobs. Swap freely — both speak the same OpenAI API.
One OpenAI-compatible endpoint · logs tokens & cost
Why: the single choke point — every request, token and dollar passes through here, so nothing is invisible. Routes to any backend.
High-throughput local inference
Why: continuous batching on CUDA cores — real throughput a llama.cpp backend can't match. The model you can run scales with your GPU's VRAM.
A second machine's model as localhost:1234
Why: optional — surfaces a model running on another box locally over an encrypted Tailscale mesh, no open ports. Skip this if you only have one machine.
Self-hosted · MIT · free
Why: agent-step traces, evals & metrics in one stack you own — the deep view LiteLLM logs alone can't give.
Copy · paste · run — bottom-up, one endpoint at a time
vLLM gives continuous batching on your NVIDIA GPU's CUDA cores. It runs inside WSL2 (Linux on Windows). Do steps 1–2 once, then 3–4 to serve. Model choice depends on VRAM — a 7B coder fits ~8 GB; bigger MoE coders need more (quantize to fit).
# Install WSL2 + Ubuntu, then reboot wsl --install -d Ubuntu
# 1. confirm your NVIDIA GPU is visible inside WSL2 nvidia-smi # 2. fast python env manager curl -LsSf https://astral.sh/uv/install.sh | sh source ~/.bashrc # 3. create env + install vLLM uv venv --python 3.12 .venv source .venv/bin/activate uv pip install vllm
# OpenAI-compatible server on http://localhost:8000/v1 vllm serve Qwen/Qwen2.5-Coder-7B-Instruct \ --port 8000 \ --max-model-len 16384 \ --gpu-memory-utilization 0.92
Leave this running. Backend endpoint = http://localhost:8000/v1. Got more VRAM? Swap in a bigger MoE coder like Qwen/Qwen3-Coder-30B-A3B-Instruct with an AWQ/GPTQ quant sized to your card.
mkdir -p ~/Documents/new-arch/langfuse && cd ~/Documents/new-arch/langfuse # official compose: postgres + clickhouse + redis + minio + web curl -o docker-compose.yml \ https://raw.githubusercontent.com/langfuse/langfuse/main/docker-compose.yml # headless init: pre-creates org/project/user + API keys (no UI clicking) cat > .env <<'EOF' LANGFUSE_INIT_ORG_ID=local LANGFUSE_INIT_ORG_NAME=Local LANGFUSE_INIT_PROJECT_ID=local-ai LANGFUSE_INIT_PROJECT_NAME=local-ai LANGFUSE_INIT_PROJECT_PUBLIC_KEY=pk-lf-1234567890abcdef LANGFUSE_INIT_PROJECT_SECRET_KEY=sk-lf-1234567890abcdef LANGFUSE_INIT_USER_EMAIL=you@example.com LANGFUSE_INIT_USER_NAME=Local Admin LANGFUSE_INIT_USER_PASSWORD=langfuse123 EOF # boot it (first run pulls images — a few minutes) docker compose up -d
you@example.com / langfuse123 (the values you set above). Project local-ai is already there with the keys above.⚠ The compose file ships insecure CHANGEME default secrets — fine for a local-only box. Regenerate them (openssl rand -hex 32) before exposing this to any network.
mkdir -p ~/Documents/new-arch/litellm && cd ~/Documents/new-arch/litellm python3 -m venv .venv && source .venv/bin/activate # pin langfuse<3 — LiteLLM's callback imports the v2 SDK (talks fine to a v3 server) # prisma — only needed for the dashboard UI (keys, spend); skip it for trace-only pip install "litellm[proxy]" "langfuse<3" prisma
cat > config.yaml <<'EOF' model_list: - model_name: qwen-coder # what the agent asks for litellm_params: # --- Windows / vLLM --- model: openai/Qwen/Qwen2.5-Coder-7B-Instruct api_base: http://localhost:8000/v1 # --- Mac / LM Studio (use these two instead) --- # model: openai/qwen2.5-coder # api_base: http://localhost:1234/v1 api_key: dummy # local backends ignore it litellm_settings: # the entire integration: ship every call to Langfuse success_callback: ["langfuse"] failure_callback: ["langfuse"] EOF
# these match the keys from step 2 export LANGFUSE_PUBLIC_KEY=pk-lf-1234567890abcdef export LANGFUSE_SECRET_KEY=sk-lf-1234567890abcdef export LANGFUSE_HOST=http://localhost:3000 # the proxy refuses anonymous calls — set a key clients must send export LITELLM_MASTER_KEY=sk-local-master litellm --config config.yaml --port 4000
Gateway is now live at http://localhost:4000/v1. That's the whole pipeline — routing + traces. The dashboard UI (:4000/ui) is optional and needs a database, below.
:4000/ui login fails and management calls return "No connected db". The fix: give LiteLLM a Postgres DB. You already have one — the Langfuse stack ships a Postgres container, so just create a separate litellm database inside it (no extra RAM).# 1. create a litellm DB inside the Langfuse Postgres container docker exec langfuse-postgres-1 \ psql -U postgres -c "CREATE DATABASE litellm" # 2. generate the prisma client + create the tables (one time) export PATH="$PWD/.venv/bin:$PATH" export DATABASE_URL="postgresql://postgres:postgres@localhost:5432/litellm" prisma generate --schema .venv/lib/python*/site-packages/litellm_proxy_extras/schema.prisma prisma db push --schema .venv/lib/python*/site-packages/litellm_proxy_extras/schema.prisma
export LITELLM_MASTER_KEY=sk-local-master export DATABASE_URL="postgresql://postgres:postgres@localhost:5432/litellm" # the username/password you'll type at :4000/ui export UI_USERNAME=admin export UI_PASSWORD=admin litellm --config config.yaml --port 4000
admin / admin. First boot runs DB migrations (a minute or two) before it accepts logins.curl http://localhost:4000/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer sk-local-master" \
-d '{
"model": "qwen-coder",
"messages": [{"role":"user","content":"write a haiku about local LLMs"}]
}'
The golden rule: the agent talks only to LiteLLM. That's how every token lands in Langfuse — the agent never knows (or cares) whether vLLM or LM Studio answered.
# install curl -fsSL https://opencode.ai/install | bash # tell it to use your local gateway (key = your LiteLLM master key) export OPENAI_API_BASE=http://localhost:4000/v1 export OPENAI_API_KEY=sk-local-master opencode
model_name from your LiteLLM config).Hermes — point its model at the gateway and register the master key in its credential pool (it won't read a plain env var if a pooled credential already exists):
# ~/.hermes/config.yaml → model.base_url: http://127.0.0.1:4000/v1 # register the gateway's master key in Hermes' credential pool hermes auth add lmstudio --type api-key \ --label litellm-master --api-key sk-local-master # run a one-shot through the gateway (every call lands in Langfuse) hermes -z "write a haiku about local LLMs" -m qwen-coder --yolo
Every call flows agent → LiteLLM → model, and LiteLLM fires a copy to Langfuse. So the same request shows up in both places, but each tool keeps a different slice of it.
:4000/uiLog in (admin/admin), then:
:3000Log in (the email & password you set on first run), then:
Backend (vLLM / LM Studio) → :8000 / :1234 · LiteLLM gateway → :4000 · Langfuse UI → :3000
From your keystroke to the Langfuse trace · click a stop or press ▶ Play
Open your agent and send a prompt. It talks only to LiteLLM — never to the model directly.
opencode # then type your prompt and hit Enter
hermes -m qwen-coder \ -z "Reverse a string in Python."
qwen-coder → Ollama (works now); qwen3.5-…-mlx → LM Studio (load a model in its app first).The gateway reads the model name you sent and picks the backend — same request shape, different engine. This single choke point is why every token is observable: routing and the Langfuse callback both fire right here.
qwen-coder / seed-coder → Ollama (:11434)qwen3.5-…-mlx → LM Studio (:1234)success_callback in config.yaml).Your local model does the actual inference, on your own hardware. Nothing leaves the machine.
Received POST /v1/chat/completions appears the instant you sent the prompt. Ollama logs to its own console. This is the proof the gateway routed your request into the engine.Open the gateway dashboard and log in (username admin, password admin — type exactly five letters).
prompt / completion / total tokens, latency, and which backend answered.Open Langfuse, log in, pick your project, then click Tracing in the left sidebar.