Voicebox
Open-source desktop voice studio for local cloning, dictation, and giving MCP agents a voice.
Pick Voicebox if you want a local, open-source voice studio that handles cloning, dictation, and giving MCP coding agents a voice — all on your own GPU.
Skip it if you need a hosted API, lack a capable GPU, or want studio-grade TTS quality on par with mature commercial vendors.
Voicebox is an open-source desktop app (macOS, Windows, Linux) that bundles voice cloning, multi-engine TTS, Whisper-powered dictation, and an MCP server into one local-first package. It clones a target voice from as little as three seconds of audio (upload, mic, or system capture), runs inference on Metal, CUDA, ROCm, Intel Arc, or DirectML, and exposes a voicebox.speak tool so Claude Code, Cursor, Cline, and other MCP-aware agents can talk back in voices you own.
It pitches itself directly as a free, local alternative to ElevenLabs and WisprFlow, and the feature surface backs that up: seven TTS engines, a Stories timeline editor for multi-voice narratives, an audio effects pipeline (pitch, reverb, delay, compression) with per-profile defaults, generations up to 50,000 characters with auto-chunking and crossfades, and a global push-to-talk dictation pill that drops cleaned transcripts into any app. A local LLM (Qwen) optionally scrubs ums and self-corrections without rephrasing.
Voicebox is free and open-source, with a GitHub repo and an associated $VOICEBOX donate/token angle. Beyond MCP it also exposes a POST /speak HTTP endpoint for ACP, A2A, shell scripts, and custom harnesses. The catch is that it's a desktop-only workflow that wants real GPU horsepower for the larger Whisper and TTS models, and the celebrity-voice presets it ships with raise the usual ethical and consent questions around cloning real people.
Voicebox is one of the more ambitious local-first voice projects we've seen — cloning, dictation, MCP agent speech, and a stories editor in one open-source desktop app. It's early (v0.2.0) and leans hard on your hardware, but the MCP integration alone makes it worth a look for anyone living in Claude Code or Cursor.
— The AI Tool Bible editorial team
Pros
- ✅ Fully local inference on Metal, CUDA, ROCm, Intel Arc, or DirectML
- ✅ Clones a voice from as little as 3 seconds of audio
- ✅ MCP server lets Claude Code, Cursor, Cline speak in cloned voices
- ✅ Bundles seven TTS engines, Whisper dictation, and a multi-track editor
- ✅ Open source with Mac, Windows, and Linux builds
Cons
- ⚠️ Desktop-only — no hosted/cloud option for non-GPU users
- ⚠️ Quality scales with local hardware; small models trade fidelity
- ⚠️ Shipped celebrity voice presets invite obvious consent concerns
- ⚠️ Young project (v0.2.0) with rough edges likely
Use cases
Explore related
Compare with similar tools
All in Audio →ElevenLabs
FeaturedThe gold standard for AI voice cloning and TTS.
Suno
FeaturedText-to-song AI — full vocal tracks from a prompt.
Udio
Suno's main rival for AI-generated full songs.
AssemblyAI
Speech-to-text API with diarisation, summarisation, and topic detection.
Whisper
OpenAI's open-source speech-to-text — the de-facto baseline.
Resemble.ai
Enterprise voice cloning with deepfake-detection layer.