Dia

Open-weights 1.6B text-to-dialogue model that generates ultra-realistic multi-speaker conversations in one pass.

Free· Free, open weights (Apache 2.0); hosted larger version waitlistedAudioDia-1.6B

Best for

Pick Dia if you want a self-hosted, openly licensed model for generating realistic two-speaker dialogue with laughter, sighs, and other nonverbals.

Skip if

Skip it if you need non-English voices, deterministic single-speaker brand voice out of the box, or a managed API with SLAs.

Dia is a 1.6B-parameter text-to-speech model from Nari Labs, built specifically for generating multi-speaker dialogue rather than single-voice narration. You write a script using [S1] and [S2] tags to mark speaker turns, optionally drop in nonverbal cues like (laughs), (sighs), or (coughs), and Dia renders the whole exchange in one pass with believable prosody, timing, and emotional inflection. Weights are released openly on Hugging Face under Apache 2.0, and the model now ships with a first-party Hugging Face Transformers integration.

What sets Dia apart is the dialogue-first design and the open license at a quality tier that the team benchmarks against ElevenLabs Studio and Sesame CSM-1B. It supports zero-shot voice cloning by prefixing an audio prompt plus its transcript, and a free ZeroGPU Hugging Face Space lets you try it without local hardware. On a single RTX 4090 it runs at roughly 2x realtime in bfloat16 with about 4.4GB of VRAM, which puts it comfortably within reach of consumer GPUs.

Caveats are real: English only, no fine-tuned voice (every run drifts unless you pin a seed or supply an audio prompt), CPU inference is not yet supported, and a hosted commercial version is gated behind a waitlist. For research, podcast prototyping, game NPC dialogue, or any project where you want full local control of a conversational TTS stack, it is one of the strongest open releases in the category.

Editor's take

Dia is the most credible open-weights answer to ElevenLabs' dialogue mode we have tested, and the [S1]/[S2] plus nonverbal-tag grammar makes scripting natural. Treat it as a research-grade tool: voice consistency requires seeds or audio prompts, and the disclaimer against impersonation should be taken seriously.

— The AI Tool Bible editorial team