Sesame
Conversational voice AI aiming to cross the uncanny valley with context-aware, emotionally aware speech.
Pick Sesame if you want state-of-the-art open-weight conversational speech and are willing to self-host CSM for voice agents or research.
Skip it if you need a turnkey hosted TTS API with SLAs, multilingual coverage, or enterprise support today.
Sesame is an AI voice research company building conversational agents that sound like real people rather than the flat, robotic assistants that most TTS systems still produce. Its centerpiece is the Conversational Speech Model (CSM), a multimodal transformer that jointly processes text and audio tokens through a semantic backbone and an acoustic decoder operating on RVQ codes. The team has released three sizes (1B, 3B, and 8B backbone parameters) and made key components available under Apache 2.0 on GitHub, alongside a research preview at app.sesame.com and a mobile signup for a broader consumer product.
The pitch is 'voice presence' - agents you can think out loud with, that pick up context and respond with human-like prosody. Sesame is aiming this at everyday users rather than call-center automation, and it has a longer-term hardware bet in the form of AI eyewear slated for 2027. Pricing isn't published; the research preview is free and the mobile app is invite-first.
For developers, the interesting part is the open weights and the paper on how CSM is trained (compute amortization on 1/16th of frames, homograph and pronunciation-consistency benchmarks). There is no public commercial API yet - if you want to build on Sesame today you're working from the open-source release, not a hosted endpoint.
Sesame's demos are the first voice AI in a while that made us do a double take - the prosody and back-channel timing feel genuinely alive. It's early, and there's no billable API yet, but the open-source CSM release makes it one of the more credible bets in the voice-agent space.
— The AI Tool Bible editorial team
Pros
- ✅ Open-source weights under Apache 2.0 for the CSM speech model
- ✅ Distinctly natural, context-aware prosody compared to typical TTS
- ✅ Backed by serious original research with published benchmarks
- ✅ Free research preview available at app.sesame.com
Cons
- ⚠️ No public commercial API - you self-host the open weights
- ⚠️ Pricing and productisation still vague; consumer app is invite-only
- ⚠️ Hardware (AI glasses) not shipping until 2027
- ⚠️ Small model catalogue focused on English voice quality
Use cases
Explore related
Compare with similar tools
All in Audio →ElevenLabs
FeaturedThe gold standard for AI voice cloning and TTS.
Suno
FeaturedText-to-song AI — full vocal tracks from a prompt.
Udio
Suno's main rival for AI-generated full songs.
AssemblyAI
Speech-to-text API with diarisation, summarisation, and topic detection.
Whisper
OpenAI's open-source speech-to-text — the de-facto baseline.
Resemble.ai
Enterprise voice cloning with deepfake-detection layer.