Local AI Coding Stack — the 101

Coding agent → gateway → local model · every token observable · runs on your own machine

OpenAI API route route callback
Agent

Coding Agent — OpenCode or Hermes

Pick either. Both are provider-agnostic — memory · skills · cron

Why: a harness that points at one endpoint and keeps context across sessions. OpenCode is the easy on-ramp; Hermes adds scheduled autonomous jobs. Swap freely — both speak the same OpenAI API.

Gateway

LiteLLM Gateway

One OpenAI-compatible endpoint · logs tokens & cost

Why: the single choke point — every request, token and dollar passes through here, so nothing is invisible. Routes to any backend.

Compute · WSL2 (CUDA)

vLLM

High-throughput local inference

Qwen2.5 / Qwen3 Codersize the model to your VRAM (quantize to fit)

Why: continuous batching on CUDA cores — real throughput a llama.cpp backend can't match. The model you can run scales with your GPU's VRAM.

Observability

Langfuse

Self-hosted · MIT · free

Postgres
ClickHouse
Redis
MinIO/S3

Why: agent-step traces, evals & metrics in one stack you own — the deep view LiteLLM logs alone can't give.

Build it from scratch

Copy · paste · run — bottom-up, one endpoint at a time

1

Pick your model backend

The thing that actually runs the LLM and speaks OpenAI /v1. Choose your machine.

per machine
🪟 Windows + NVIDIA → vLLM
🍎 macOS (Apple Silicon) → LM Studio

vLLM gives continuous batching on your NVIDIA GPU's CUDA cores. It runs inside WSL2 (Linux on Windows). Do steps 1–2 once, then 3–4 to serve. Model choice depends on VRAM — a 7B coder fits ~8 GB; bigger MoE coders need more (quantize to fit).

Click first: install the latest NVIDIA Game Ready / Studio driver on Windows (the host). WSL2 borrows the host driver — you do not install CUDA drivers inside Ubuntu.
PowerShell (Admin) — one time
# Install WSL2 + Ubuntu, then reboot
wsl --install -d Ubuntu
Ubuntu (WSL2) — verify GPU + install vLLM
# 1. confirm your NVIDIA GPU is visible inside WSL2
nvidia-smi

# 2. fast python env manager
curl -LsSf https://astral.sh/uv/install.sh | sh
source ~/.bashrc

# 3. create env + install vLLM
uv venv --python 3.12 .venv
source .venv/bin/activate
uv pip install vllm
Ubuntu (WSL2) — serve the coder model
# OpenAI-compatible server on http://localhost:8000/v1
vllm serve Qwen/Qwen2.5-Coder-7B-Instruct \
  --port 8000 \
  --max-model-len 16384 \
  --gpu-memory-utilization 0.92

Leave this running. Backend endpoint = http://localhost:8000/v1. Got more VRAM? Swap in a bigger MoE coder like Qwen/Qwen3-Coder-30B-A3B-Instruct with an AWQ/GPTQ quant sized to your card.

2

Stand up Langfuse (observability)

Self-hosted, MIT, free. Stores every trace, token count & latency. Same on Mac or Windows.

shared
Click first: install & launch Docker Desktop and wait until the whale icon stops animating (engine running).
Terminal — get the stack + headless keys
mkdir -p ~/Documents/new-arch/langfuse && cd ~/Documents/new-arch/langfuse

# official compose: postgres + clickhouse + redis + minio + web
curl -o docker-compose.yml \
  https://raw.githubusercontent.com/langfuse/langfuse/main/docker-compose.yml

# headless init: pre-creates org/project/user + API keys (no UI clicking)
cat > .env <<'EOF'
LANGFUSE_INIT_ORG_ID=local
LANGFUSE_INIT_ORG_NAME=Local
LANGFUSE_INIT_PROJECT_ID=local-ai
LANGFUSE_INIT_PROJECT_NAME=local-ai
LANGFUSE_INIT_PROJECT_PUBLIC_KEY=pk-lf-1234567890abcdef
LANGFUSE_INIT_PROJECT_SECRET_KEY=sk-lf-1234567890abcdef
LANGFUSE_INIT_USER_EMAIL=you@example.com
LANGFUSE_INIT_USER_NAME=Local Admin
LANGFUSE_INIT_USER_PASSWORD=langfuse123
EOF

# boot it (first run pulls images — a few minutes)
docker compose up -d
Open http://localhost:3000 → log in with you@example.com / langfuse123 (the values you set above). Project local-ai is already there with the keys above.

⚠ The compose file ships insecure CHANGEME default secrets — fine for a local-only box. Regenerate them (openssl rand -hex 32) before exposing this to any network.

3

LiteLLM proxy → with Langfuse callback

One OpenAI endpoint in front of your model; every call is shipped to Langfuse. This is the whole integration.

shared
Terminal — install
mkdir -p ~/Documents/new-arch/litellm && cd ~/Documents/new-arch/litellm
python3 -m venv .venv && source .venv/bin/activate
# pin langfuse<3 — LiteLLM's callback imports the v2 SDK (talks fine to a v3 server)
# prisma — only needed for the dashboard UI (keys, spend); skip it for trace-only
pip install "litellm[proxy]" "langfuse<3" prisma
config.yaml — uncomment the line for YOUR backend
cat > config.yaml <<'EOF'
model_list:
  - model_name: qwen-coder            # what the agent asks for
    litellm_params:
      # --- Windows / vLLM ---
      model: openai/Qwen/Qwen2.5-Coder-7B-Instruct
      api_base: http://localhost:8000/v1
      # --- Mac / LM Studio (use these two instead) ---
      # model: openai/qwen2.5-coder
      # api_base: http://localhost:1234/v1
      api_key: dummy                  # local backends ignore it

litellm_settings:
  # the entire integration: ship every call to Langfuse
  success_callback: ["langfuse"]
  failure_callback: ["langfuse"]
EOF
Terminal — point at Langfuse & start the proxy
# these match the keys from step 2
export LANGFUSE_PUBLIC_KEY=pk-lf-1234567890abcdef
export LANGFUSE_SECRET_KEY=sk-lf-1234567890abcdef
export LANGFUSE_HOST=http://localhost:3000

# the proxy refuses anonymous calls — set a key clients must send
export LITELLM_MASTER_KEY=sk-local-master

litellm --config config.yaml --port 4000

Gateway is now live at http://localhost:4000/v1. That's the whole pipeline — routing + traces. The dashboard UI (:4000/ui) is optional and needs a database, below.

Optional — the LiteLLM dashboard UI. Without a database, :4000/ui login fails and management calls return "No connected db". The fix: give LiteLLM a Postgres DB. You already have one — the Langfuse stack ships a Postgres container, so just create a separate litellm database inside it (no extra RAM).
Terminal — one-time DB setup for the UI
# 1. create a litellm DB inside the Langfuse Postgres container
docker exec langfuse-postgres-1 \
  psql -U postgres -c "CREATE DATABASE litellm"

# 2. generate the prisma client + create the tables (one time)
export PATH="$PWD/.venv/bin:$PATH"
export DATABASE_URL="postgresql://postgres:postgres@localhost:5432/litellm"
prisma generate --schema .venv/lib/python*/site-packages/litellm_proxy_extras/schema.prisma
prisma db push --schema .venv/lib/python*/site-packages/litellm_proxy_extras/schema.prisma
Terminal — restart the proxy WITH the database + UI login
export LITELLM_MASTER_KEY=sk-local-master
export DATABASE_URL="postgresql://postgres:postgres@localhost:5432/litellm"
# the username/password you'll type at :4000/ui
export UI_USERNAME=admin
export UI_PASSWORD=admin

litellm --config config.yaml --port 4000
Open http://localhost:4000/ui → log in with admin / admin. First boot runs DB migrations (a minute or two) before it accepts logins.
4

Send a test call → see the trace

Prove the whole chain works end to end.

verify
Terminal — hit the gateway
curl http://localhost:4000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer sk-local-master" \
  -d '{
    "model": "qwen-coder",
    "messages": [{"role":"user","content":"write a haiku about local LLMs"}]
  }'
Open http://localhost:3000Tracing → your call appears with the prompt, response, token counts and latency. If it's there, the spine works. ✅
5

Wire up the coding agent

Any OpenAI-compatible agent works — just point its base URL at LiteLLM (:4000), not the model.

agent

The golden rule: the agent talks only to LiteLLM. That's how every token lands in Langfuse — the agent never knows (or cares) whether vLLM or LM Studio answered.

OpenCode — install & point at the gateway
# install
curl -fsSL https://opencode.ai/install | bash

# tell it to use your local gateway (key = your LiteLLM master key)
export OPENAI_API_BASE=http://localhost:4000/v1
export OPENAI_API_KEY=sk-local-master

opencode
In OpenCode, pick the model named qwen-coder (the model_name from your LiteLLM config).

Hermes — point its model at the gateway and register the master key in its credential pool (it won't read a plain env var if a pooled credential already exists):

Hermes — route through LiteLLM + auth
# ~/.hermes/config.yaml  → model.base_url: http://127.0.0.1:4000/v1

# register the gateway's master key in Hermes' credential pool
hermes auth add lmstudio --type api-key \
  --label litellm-master --api-key sk-local-master

# run a one-shot through the gateway (every call lands in Langfuse)
hermes -z "write a haiku about local LLMs" -m qwen-coder --yolo
Heads-up (resources): the full Langfuse v3 stack (ClickHouse + MinIO + Redis + Postgres) plus a local model is heavy. On a RAM-tight machine, run one model at a time and expect ClickHouse trace queries to lag if memory is maxed.
6

Where do I see my data? (two screens, two jobs)

You have two dashboards. They answer different questions — don't look for traces in LiteLLM or for cost in Langfuse.

read it

Every call flows agent → LiteLLM → model, and LiteLLM fires a copy to Langfuse. So the same request shows up in both places, but each tool keeps a different slice of it.

LiteLLM UI — :4000/ui

The operations view: who spent what, which keys exist, requests per model.

Log in (admin/admin), then:

  • Usage — spend, request & token counts per model/key/day
  • Virtual Keys — issue scoped keys with budgets & rate limits
  • Models — what's registered & live health
  • Logs — a flat list of recent requests (status, latency, cost)

Langfuse — :3000

The deep view: the actual prompt, the actual response, step by step.

Log in (the email & password you set on first run), then:

  • Tracing → Traces — every call: full prompt + completion text
  • Click a trace — tokens in/out, latency, model, cost per call
  • Sessions — multi-turn agent runs grouped together
  • Dashboards — latency & volume trends over time
Rule of thumb: “How much did I spend / which key?” → LiteLLM. “What exactly did the model say, and how long did it take?” → Langfuse. Traces only appear once you've sent a call (step 4).

Backend (vLLM / LM Studio) → :8000 / :1234  ·  LiteLLM gateway → :4000  ·  Langfuse UI → :3000

Follow one message, end to end

From your keystroke to the Langfuse trace · click a stop or press ▶ Play

You send a message agent → :4000

Open your agent and send a prompt. It talks only to LiteLLM — never to the model directly.

OpenCode
opencode
# then type your prompt and hit Enter
Hermes (one-shot)
hermes -m qwen-coder \
  -z "Reverse a string in Python."
What you'll see: the answer streams back in your terminal. Which model? qwen-coder → Ollama (works now); qwen3.5-…-mlx → LM Studio (load a model in its app first).

LiteLLM routes it gateway · :4000

The gateway reads the model name you sent and picks the backend — same request shape, different engine. This single choke point is why every token is observable: routing and the Langfuse callback both fire right here.

  • qwen-coder / seed-coder  →  Ollama (:11434)
  • qwen3.5-…-mlx  →  LM Studio (:1234)
What you'll see: nothing to click — routing is instant. But the gateway has already fired a copy of this call to Langfuse (the success_callback in config.yaml).

The engine generates Ollama :11434 · LM Studio :1234

Your local model does the actual inference, on your own hardware. Nothing leaves the machine.

What you'll see: in LM Studio, open the Developer (Local Server) tab — a line Received POST /v1/chat/completions appears the instant you sent the prompt. Ollama logs to its own console. This is the proof the gateway routed your request into the engine.

Watch it in LiteLLM :4000/ui · admin / admin

Open the gateway dashboard and log in (username admin, password admin — type exactly five letters).

  • Logs → the newest row is your call. Click it for the full request & response + token counts.
  • Usage → spend, requests & tokens per model / key / day.
What you'll see: your prompt, the reply, prompt / completion / total tokens, latency, and which backend answered.

See the deep trace in Langfuse :3000 · Tracing

Open Langfuse, log in, pick your project, then click Tracing in the left sidebar.

  • The top row (newest first) is your call — click it.
  • Full input + output text, model, latency, tokens in/out, cost.
What you'll see: the same call you saw in LiteLLM, but with the full prompt/response and a timing span. If it shows in both places, the spine works end to end. ✅ Cost reads ~$0 for local models — that's correct.