oMLX
Native macOS LLM inference server built on MLX, with paged SSD KV caching for Apple Silicon agents.
Pick oMLX if you run Claude Code, Cursor or other long-context coding agents locally on a beefy Apple Silicon Mac and want sub-5s TTFT.
Skip it if you are on Linux, Windows, NVIDIA, or just want a hosted API - oMLX is Apple Silicon-only and you bring the hardware.
oMLX is a native macOS inference server built on Apple's MLX framework, packaged as a signed menu-bar app with a web dashboard. Its headline trick is paged SSD KV caching: a two-tier hot-RAM / cold-SSD architecture in safetensors format that persists previously seen prefixes across requests and even server restarts, so coding agents on long contexts cut TTFT from the typical 30-90 seconds you see in Ollama or LM Studio down to under five seconds. It also adds continuous batching via mlx-lm's BatchGenerator (up to ~4.14x speedup at 8x concurrency) and multi-model hosting with LRU eviction for LLMs, VLMs, embeddings and rerankers loaded at once.
The target user is obvious: developers running Claude Code, Cursor, OpenClaw or other agentic coding tools locally on an M-series Mac (ideally an M3 Ultra, though M1+/macOS 15+/64GB is the realistic floor). It exposes both OpenAI-compatible /v1/chat/completions and a native Anthropic /v1/messages endpoint, with a one-click config generator that emits the exact CLI command for each downstream tool. Tool calling is broad - JSON, Qwen, Gemma, GLM, MiniMax formats plus MCP - and reasoning models get automatic <think> tag handling.
oMLX is Apache 2.0 and free to download as a DMG or build from source; there is no paid tier advertised. It reuses an existing LM Studio model directory and has a built-in HuggingFace downloader, so onboarding is painless if you already run local models. Caveat: it is Apple Silicon only - no Linux, no Windows, no NVIDIA - and the published benchmarks are skewed toward a 512GB M3 Ultra, which is not what most readers actually own.
The SSD-paged KV cache is the real story here - it directly fixes the recompute-on-context-shift pain that makes Ollama and LM Studio painful for agents. If you have an M-series Mac with 64GB+ and you live in Claude Code, oMLX is the most credible local backend we have seen this year. Just temper your expectations against the M3 Ultra benchmarks.
— The AI Tool Bible editorial team
Pros
- ✅ Paged SSD KV cache slashes agent TTFT from 30-90s to <5s on long contexts
- ✅ Drop-in OpenAI and native Anthropic /v1/messages endpoints for Claude Code, Cursor, OpenClaw
- ✅ Continuous batching delivers ~4.14x generation speedup at 8x concurrency
- ✅ Native signed/notarized menu-bar app (not Electron) with web dashboard
- ✅ Apache 2.0, reuses your existing LM Studio model directory
Cons
- ⚠️ Apple Silicon and macOS 15+ only - no Linux, Windows or NVIDIA
- ⚠️ Best benchmarks assume an M3 Ultra 512GB few readers actually own
- ⚠️ Young project (VLM support only since v0.2.0) - feature surface still maturing
- ⚠️ No hosted/cloud option; you supply the hardware
Use cases
Explore related
Compare with similar tools
All in Coding →Cursor
FeaturedAI-first VS Code fork — chat, edit, and agentic coding in one IDE.
GitHub Copilot
FeaturedThe original AI pair programmer, now with chat and agents.
Replit Agent
FeaturedBuild & deploy a full app from a single prompt.
Aider
Terminal-based AI pair programmer that writes commits.
Codeium
Free, fast AI autocomplete + chat across 70+ editors.
Cody
Sourcegraph's AI coding assistant — codebase-aware via their search index.