📖 The AI Tool Bible

Voicebox

Open-source desktop voice studio for local cloning, dictation, and giving MCP agents a voice.

Free· Free and open source; optional $VOICEBOX token donationsAudioMulti-model (Chatterbox, Qwen TTS, Whisper, etc.)
Visit website →
Best for

Pick Voicebox if you want a local, open-source voice studio that handles cloning, dictation, and giving MCP coding agents a voice — all on your own GPU.

Skip if

Skip it if you need a hosted API, lack a capable GPU, or want studio-grade TTS quality on par with mature commercial vendors.

Voicebox is an open-source desktop app (macOS, Windows, Linux) that bundles voice cloning, multi-engine TTS, Whisper-powered dictation, and an MCP server into one local-first package. It clones a target voice from as little as three seconds of audio (upload, mic, or system capture), runs inference on Metal, CUDA, ROCm, Intel Arc, or DirectML, and exposes a voicebox.speak tool so Claude Code, Cursor, Cline, and other MCP-aware agents can talk back in voices you own.

It pitches itself directly as a free, local alternative to ElevenLabs and WisprFlow, and the feature surface backs that up: seven TTS engines, a Stories timeline editor for multi-voice narratives, an audio effects pipeline (pitch, reverb, delay, compression) with per-profile defaults, generations up to 50,000 characters with auto-chunking and crossfades, and a global push-to-talk dictation pill that drops cleaned transcripts into any app. A local LLM (Qwen) optionally scrubs ums and self-corrections without rephrasing.

Voicebox is free and open-source, with a GitHub repo and an associated $VOICEBOX donate/token angle. Beyond MCP it also exposes a POST /speak HTTP endpoint for ACP, A2A, shell scripts, and custom harnesses. The catch is that it's a desktop-only workflow that wants real GPU horsepower for the larger Whisper and TTS models, and the celebrity-voice presets it ships with raise the usual ethical and consent questions around cloning real people.

Editor's take

Voicebox is one of the more ambitious local-first voice projects we've seen — cloning, dictation, MCP agent speech, and a stories editor in one open-source desktop app. It's early (v0.2.0) and leans hard on your hardware, but the MCP integration alone makes it worth a look for anyone living in Claude Code or Cursor.

— The AI Tool Bible editorial team

Pros

  • Fully local inference on Metal, CUDA, ROCm, Intel Arc, or DirectML
  • Clones a voice from as little as 3 seconds of audio
  • MCP server lets Claude Code, Cursor, Cline speak in cloned voices
  • Bundles seven TTS engines, Whisper dictation, and a multi-track editor
  • Open source with Mac, Windows, and Linux builds

Cons

  • ⚠️ Desktop-only — no hosted/cloud option for non-GPU users
  • ⚠️ Quality scales with local hardware; small models trade fidelity
  • ⚠️ Shipped celebrity voice presets invite obvious consent concerns
  • ⚠️ Young project (v0.2.0) with rough edges likely

Use cases

voice-cloningtext-to-speechdictationagent-voicesmulti-voice-narration

Explore related

Compare with similar tools

All in Audio