📖 The AI Tool Bible

oMLX

Native macOS LLM inference server built on MLX, with paged SSD KV caching for Apple Silicon agents.

Free· Free, Apache 2.0 open sourceCodingMulti-model (Qwen, Llama, Mistral, Gemma, DeepSeek, MiniMax, GLM)
Visit website →
Best for

Pick oMLX if you run Claude Code, Cursor or other long-context coding agents locally on a beefy Apple Silicon Mac and want sub-5s TTFT.

Skip if

Skip it if you are on Linux, Windows, NVIDIA, or just want a hosted API - oMLX is Apple Silicon-only and you bring the hardware.

oMLX is a native macOS inference server built on Apple's MLX framework, packaged as a signed menu-bar app with a web dashboard. Its headline trick is paged SSD KV caching: a two-tier hot-RAM / cold-SSD architecture in safetensors format that persists previously seen prefixes across requests and even server restarts, so coding agents on long contexts cut TTFT from the typical 30-90 seconds you see in Ollama or LM Studio down to under five seconds. It also adds continuous batching via mlx-lm's BatchGenerator (up to ~4.14x speedup at 8x concurrency) and multi-model hosting with LRU eviction for LLMs, VLMs, embeddings and rerankers loaded at once.

The target user is obvious: developers running Claude Code, Cursor, OpenClaw or other agentic coding tools locally on an M-series Mac (ideally an M3 Ultra, though M1+/macOS 15+/64GB is the realistic floor). It exposes both OpenAI-compatible /v1/chat/completions and a native Anthropic /v1/messages endpoint, with a one-click config generator that emits the exact CLI command for each downstream tool. Tool calling is broad - JSON, Qwen, Gemma, GLM, MiniMax formats plus MCP - and reasoning models get automatic <think> tag handling.

oMLX is Apache 2.0 and free to download as a DMG or build from source; there is no paid tier advertised. It reuses an existing LM Studio model directory and has a built-in HuggingFace downloader, so onboarding is painless if you already run local models. Caveat: it is Apple Silicon only - no Linux, no Windows, no NVIDIA - and the published benchmarks are skewed toward a 512GB M3 Ultra, which is not what most readers actually own.

Editor's take

The SSD-paged KV cache is the real story here - it directly fixes the recompute-on-context-shift pain that makes Ollama and LM Studio painful for agents. If you have an M-series Mac with 64GB+ and you live in Claude Code, oMLX is the most credible local backend we have seen this year. Just temper your expectations against the M3 Ultra benchmarks.

— The AI Tool Bible editorial team

Pros

  • Paged SSD KV cache slashes agent TTFT from 30-90s to <5s on long contexts
  • Drop-in OpenAI and native Anthropic /v1/messages endpoints for Claude Code, Cursor, OpenClaw
  • Continuous batching delivers ~4.14x generation speedup at 8x concurrency
  • Native signed/notarized menu-bar app (not Electron) with web dashboard
  • Apache 2.0, reuses your existing LM Studio model directory

Cons

  • ⚠️ Apple Silicon and macOS 15+ only - no Linux, Windows or NVIDIA
  • ⚠️ Best benchmarks assume an M3 Ultra 512GB few readers actually own
  • ⚠️ Young project (VLM support only since v0.2.0) - feature surface still maturing
  • ⚠️ No hosted/cloud option; you supply the hardware

Use cases

local-llm-inferencecoding-agentsapple-siliconopenai-compatible-apimlx

Explore related

Compare with similar tools

All in Coding