Whisper

✓ Editorially verified

OpenAI's open-source speech-to-text — the de-facto baseline.

Free· Free open weights; $0.006/min via OpenAI APIAudioWhisper large-v38.6 / 10

Best for

Pick Whisper when you can self-host (or the OpenAI API is fine) and want strong baseline transcription at near-zero per-hour cost.

Skip if

Skip it when you need turnkey diarisation, summarisation, or streaming — AssemblyAI is built for that.

Whisper is OpenAI's open-source speech recognition model. It's free to self-host, multilingual (99 languages), and the baseline against which every other STT model is measured. The large-v3 release is genuinely competitive with paid alternatives on accuracy.

For anyone with engineering capacity, Whisper is the default. Self-hosted, it costs effectively nothing per hour of audio. Available via OpenAI's API for those who don't want to operate a GPU. Hugging Face Transformers makes integration straightforward in Python.

The model has no built-in diarisation — speaker labels need a separate pipeline (pyannote, etc.). Hallucinations on silent segments are a known issue and require post-processing to clean up. For production pipelines these are solvable; out of the box they're surprising.

Editor's take

Whisper is the rare OpenAI release that's open-weight and excellent. It set the standard for what speech-to-text should cost, and it remains the right default for almost any team with engineering capacity.

— The AI Tool Bible editorial team