Skip to content

LM Studio

The LM Studio provider routes the spawned claude CLI through Claudette’s in-process gateway. The gateway forwards Claude Code’s Anthropic-shape requests to LM Studio’s native /v1/messages endpoint (added in LM Studio 0.4.1+) — there’s no wire-format translation work, the body goes through verbatim and 2xx response bytes stream straight back to the CLI.

The gateway is in the path because LM Studio classifies hard input errors like context-window overflow as HTTP 500 with an Anthropic-shape body whose error.type is api_error. That’s a transient classification — the Anthropic SDK retries it with exponential backoff — so a permanent input failure ends up as a multi-minute spinner with no surfaced error unless we demote the status code on the way back. Claudette’s gateway does exactly that: 2xx pass through unchanged, non-2xx with permanent-failure-shaped messages are rewritten to 4xx so the SDK fails fast and the user sees the actual upstream message.

The capability profile is the local-first one (no extended thinking / effort / fast mode / 1M-context auto-upgrade in the chat header) — the loaded model and LM Studio’s tokenizer do whatever they support, but Claudette doesn’t expose those Anthropic-only knobs. Tool use and vision work whenever the underlying model handles them.

  1. Make sure the Alternative Claude Code backends experimental flag is on.
  2. Install LM Studio:
  3. Install the lms CLI (bundled with LM Studio; verify with lms --version).
Terminal window
lms server start --port 1234

The server defaults to port 1234; pass --port if you need a different one. Confirm it’s reachable:

Terminal window
curl http://localhost:1234/v1/models

You should get back a JSON {"data":[…]} payload listing whatever you’ve loaded into LM Studio.

LM Studio doesn’t auto-load models — pick one via the desktop UI’s My Models tab and load it, or use the CLI:

Terminal window
lms get qwen2.5-coder-7b-instruct # tool-tuned coding model
lms load qwen2.5-coder-7b-instruct # load it into VRAM

Quantized GGUF builds give the best speed-to-quality trade-off on consumer hardware. Tool-tuned models (Qwen Coder, Llama Instruct, Mistral Instruct) handle Claude-style tool calls; chat-only models will likely ignore tool definitions.

Pick a context length that fits Claude Code

Section titled “Pick a context length that fits Claude Code”

LM Studio defaults newly-loaded models to 4096 tokens of context regardless of what the underlying model supports. That’s a conservative VRAM floor — but Claude Code’s system prompt + bundled tool definitions alone are around 40–55 k tokens before you’ve typed anything, so a 4 k context will hard-fail every send.

Drag the Context Length slider in LM Studio’s My Models panel for the loaded model and reload it. Recommended floors:

Conversation typeMinimum loaded context
ping-style smoke tests (system prompt + tools, ≤1 turn)64 k
Light coding tasks (a few file reads + edits)96 k
Heavy MCP tool usage (Playwright / browser sessions, lots of tool defs)128 k+
Long sessions with substantial transcript history192 k+

The model’s max_context_length is the absolute ceiling (whatever the underlying weights support — often 256 k for newer Qwen / Llama builds). The loaded_context_length is what you’ve actually committed VRAM to, and that’s what Claudette’s pre-flight check measures against. If your machine can’t fit the recommended floor, switch to a smaller model with a longer-context build (e.g. a 7B Qwen at 128 k beats a 35B Qwen capped at 4 k for Claude Code workflows).

You can change the slider mid-session — Claudette polls LM Studio every ~8 seconds while at least one LM Studio backend is enabled, so the composer’s token-capacity indicator and the pre-flight gate update automatically without needing a Settings → Refresh click.

  1. Open Settings > Models. The lm-studio backend appears in the list.
  2. The base URL is pre-filled to http://localhost:1234. Change it if your server listens elsewhere (different port, LAN host, etc.). Don’t include the /v1 suffix — Claudette appends it.
  3. (Optional) Add a bearer token in the API key field if you’ve put LM Studio behind an authenticating proxy. Local installs don’t need one — Claudette substitutes a placeholder so the upstream still receives an Authorization header.
  4. Click Test — Claudette pings the server to confirm it’s reachable and counts loaded models.
  5. Click Refresh models — Claudette queries LM Studio’s /api/v0/models (and falls back to /v1/models on older builds), populating the model list with each model’s loaded_context_length / max_context_length.
  6. (Optional) Set a default model.
  7. Toggle Enabled on. The provider is now selectable in the chat header.

Once enabled, the model picker in the chat header includes LM Studio models alongside Claude. Pick one and the next turn runs through LM Studio.

The chat header hides the Extended thinking, Effort, Fast mode, and 1M-context toggles when an LM Studio model is active — AgentBackendConfig::builtin_lm_studio() declares those capabilities as false (gateway profile). Tool use and vision stay available, but actual support depends on the loaded model.

  • Tool use — works on tool-tuned models (Qwen Coder, Llama Instruct, etc.). Plain chat models may ignore tool definitions silently.
  • Vision — works on multimodal models (LLaVA, MiniCPM-V, Qwen2-VL). Text-only models drop image attachments without complaining.

If a model misbehaves, try switching to a different one before assuming Claudette is broken.

LM Studio uses Claudette’s in-process gateway with an Anthropic-shape pass-through (NOT the OpenAI-Responses translation path the OpenAI / Codex backends use):

  1. Claudette spins up an HTTP listener on 127.0.0.1:0 per backend and exports ANTHROPIC_BASE_URL=<gateway>, ANTHROPIC_AUTH_TOKEN=<random>, CLAUDE_CODE_ENABLE_GATEWAY_MODEL_DISCOVERY=1, and CLAUDE_CODE_ATTRIBUTION_HEADER=0 on the spawned claude subprocess.
  2. The CLI POSTs /v1/messages to the gateway. The gateway recognizes LM Studio backends and forwards the request body verbatim to http://localhost:1234/v1/messages — LM Studio 0.4.1+ speaks the Anthropic Messages API natively, so no translation is needed.
  3. Successful (2xx) responses stream through unchanged. The gateway mirrors LM Studio’s Content-Type and writes upstream bytes to the CLI as they arrive — per-chunk SSE events flow straight through without buffering, so first-token latency stays close to LM Studio’s own TTFT.
  4. Non-2xx responses are intercepted. The gateway parses the upstream body (LM Studio returns Anthropic-shape errors), checks the message text, and demotes any HTTP 5xx whose body matches a permanent-failure pattern (context length, model not loaded, etc.) to HTTP 400 with error.type = invalid_request_error. Without that demotion the Anthropic SDK retries 5xx with exponential backoff and the user sees a multi-minute spinner instead of the actual error.

CLAUDE_CODE_ATTRIBUTION_HEADER=0 keeps Claude Code from adding a rotating per-request user-attribution header. With the header on, the request prefix changes every turn and LM Studio’s KV-cache prefix matching is invalidated — a documented ~90% performance regression on local backends. LM Studio doesn’t bill anything, so the header is pure overhead.

The chat header hides Extended thinking, Effort, Fast mode, and 1M-context when an LM Studio model is active — AgentBackendConfig::builtin_lm_studio() declares those capabilities as false. They map cleanly to Anthropic’s API but local models implement them inconsistently or not at all, and silently swallowing them would be worse than just hiding the controls.

Ollama doesn’t use Claudette’s gateway: ANTHROPIC_BASE_URL points the CLI straight at Ollama’s /v1/messages. We tried the same setup for LM Studio briefly — the architecture is wire-compatible — but LM Studio classifies hard input errors like context-window overflow as HTTP 500 with an Anthropic-shape body whose error.type is api_error. The Anthropic SDK reads that as transient and retries with backoff; users get a multi-minute spinner with no surfaced error.

The thin gateway pass-through gets us the best of both: 2xx flows through with the same TTFT as direct routing (no translation, no buffering), and 5xx-with-permanent-failure-message gets demoted to 4xx so the error reaches the user immediately.

LM Studio also exposes /v1/responses (OpenAI’s newer Responses API). Earlier iterations of this integration used that path through Claudette’s gateway with full Anthropic ↔ OpenAI Responses translation, but it had two structural problems:

  • Translation overhead: the translation layer buffered the full response before sending anything back, so first-token latency equalled total generation time even when LM Studio’s TTFT was sub-second.
  • Lost streaming: OpenAI Responses streaming uses event: response.output_text.delta events. Translating that incrementally into Anthropic content_block_delta events while preserving tool-call streaming, reasoning blocks, and stop reasons is non-trivial and a regression risk.

Pass-through against LM Studio’s native /v1/messages skips all of that and only intervenes where it has to (status-code translation on errors). The OpenAI Responses gateway path is still used for OpenAI API and Codex Subscription where there’s no Anthropic-shaped alternative.

  • LM Studio runs on GPU when available (Metal on macOS, CUDA on Linux/Windows with the right drivers); falls back to CPU otherwise.
  • Large models need substantial VRAM/RAM. Watch LM Studio’s resource gauge while a turn runs.
  • Concurrent agents amplify pressure. Keep parallelism low (1–2 workspaces at a time) when running locally, especially if the model isn’t fully GPU-resident.
  • LM Studio caches loaded models in memory; the first turn after a load is slower. Subsequent turns reuse the loaded weights and are much faster.

Test fails with “LM Studio is not reachable” — the server isn’t running, or it’s bound to a different port. Run lms server start --port 1234 and confirm curl http://localhost:1234/v1/models works first.

Refresh models returns nothing — you haven’t loaded any models in LM Studio yet. Open the desktop app’s My Models tab (or lms load <model>) and click Refresh again.

Agent immediately errors with “no models” — same as above; LM Studio reports an empty model list when nothing is loaded.

Tool calls get ignored — the loaded model isn’t tool-tuned. Switch to a model with built-in tool-use support (Qwen Coder, Llama Instruct).

Send fails with 400 Bad Request mentioning context length — LM Studio rejected the request because the loaded loaded_context_length is smaller than Claude Code’s prompt + tool defs. Bump the Context Length slider in LM Studio’s My Models panel and reload the model; see Pick a context length that fits Claude Code.

Agent hangs on the first turn — LM Studio is loading the model into VRAM. First-turn latency on a cold model can be 10–60 seconds depending on size. Subsequent turns are much faster.

Composer shows a stale context size after reloading the LM Studio model — Claudette polls LM Studio’s /api/v0/models every ~8 seconds while an LM Studio backend is enabled, so the indicator should refresh on its own. If it doesn’t, click Refresh models in Settings > Models for an immediate update. The polling effect is gated behind the experimental flag — disabling alternative backends pauses the poll.

Local turns are slower than they should be — make sure CLAUDE_CODE_ATTRIBUTION_HEADER=0 is in effect (Claudette sets it automatically for LM Studio and Ollama; verify by checking the agent’s terminal tab env output). With the header on, every request gets a rotating attribution string that breaks LM Studio’s KV-cache prefix matching and re-runs prompt processing from scratch.