Dia
Open-weights 1.6B text-to-dialogue model that generates ultra-realistic multi-speaker conversations in one pass.
Pick Dia if you want a self-hosted, openly licensed model for generating realistic two-speaker dialogue with laughter, sighs, and other nonverbals.
Skip it if you need non-English voices, deterministic single-speaker brand voice out of the box, or a managed API with SLAs.
Dia is a 1.6B-parameter text-to-speech model from Nari Labs, built specifically for generating multi-speaker dialogue rather than single-voice narration. You write a script using [S1] and [S2] tags to mark speaker turns, optionally drop in nonverbal cues like (laughs), (sighs), or (coughs), and Dia renders the whole exchange in one pass with believable prosody, timing, and emotional inflection. Weights are released openly on Hugging Face under Apache 2.0, and the model now ships with a first-party Hugging Face Transformers integration.
What sets Dia apart is the dialogue-first design and the open license at a quality tier that the team benchmarks against ElevenLabs Studio and Sesame CSM-1B. It supports zero-shot voice cloning by prefixing an audio prompt plus its transcript, and a free ZeroGPU Hugging Face Space lets you try it without local hardware. On a single RTX 4090 it runs at roughly 2x realtime in bfloat16 with about 4.4GB of VRAM, which puts it comfortably within reach of consumer GPUs.
Caveats are real: English only, no fine-tuned voice (every run drifts unless you pin a seed or supply an audio prompt), CPU inference is not yet supported, and a hosted commercial version is gated behind a waitlist. For research, podcast prototyping, game NPC dialogue, or any project where you want full local control of a conversational TTS stack, it is one of the strongest open releases in the category.
Dia is the most credible open-weights answer to ElevenLabs' dialogue mode we have tested, and the [S1]/[S2] plus nonverbal-tag grammar makes scripting natural. Treat it as a research-grade tool: voice consistency requires seeds or audio prompts, and the disclaimer against impersonation should be taken seriously.
— The AI Tool Bible editorial team
Pros
- ✅ Open weights under Apache 2.0 with first-party Transformers support
- ✅ Multi-speaker [S1]/[S2] dialogue and nonverbal tags in a single pass
- ✅ Zero-shot voice cloning from a short audio prompt plus transcript
- ✅ Runs ~2x realtime on a single RTX 4090 at ~4.4GB VRAM
- ✅ Free Hugging Face ZeroGPU Space to try without local GPU
Cons
- ⚠️ English only; no built-in multilingual support
- ⚠️ Voices drift between runs unless you fix a seed or supply a prompt
- ⚠️ GPU required; CPU inference not yet supported
- ⚠️ Tiny team (1.5 engineers); slower issue turnaround than commercial TTS
Use cases
Explore related
Compare with similar tools
All in Audio →ElevenLabs
FeaturedThe gold standard for AI voice cloning and TTS.
Suno
FeaturedText-to-song AI — full vocal tracks from a prompt.
Udio
Suno's main rival for AI-generated full songs.
AssemblyAI
Speech-to-text API with diarisation, summarisation, and topic detection.
Whisper
OpenAI's open-source speech-to-text — the de-facto baseline.
Resemble.ai
Enterprise voice cloning with deepfake-detection layer.