Open Weight Text-to-Speach with Voxtral TTS
Mistral AI released Voxtral TTS on March 26, 2026, an open-weight text-to-speech model with 4 billion parameters capable of generating natural-sounding speech in nine languages. The model supports voice cloning from just three seconds of audio and is optimized for low-latency performance, making it suitable for real-time applications. While the model weights are available for non-commercial use under a CC BY-NC 4.0 license, commercial use requires a separate agreement or access via Mistral's API.
- ▪Voxtral TTS is a 4-billion-parameter text-to-speech model developed by Mistral AI and released on March 26, 2026.
- ▪It enables voice cloning from as little as three seconds of reference audio and supports nine languages including English, Spanish, and Arabic.
- ▪The model achieves a real-time factor of 9.7x with approximately 100ms time-to-first-audio, making it suitable for real-time conversational applications.
- ▪Voxtral TTS uses open weights under the CC BY-NC 4.0 license for non-commercial use, while commercial usage requires a licensing agreement or use of Mistral's API.
- ▪In human evaluations, Voxtral TTS outperformed ElevenLabs Flash v2.5 in most supported languages, with a 68.4% win rate overall.
Opening excerpt (first ~120 words) tap to expand
Image by Editor # Introduction Voice-enabled applications are everywhere, from virtual assistants to customer service chatbots. But for developers, building natural-sounding speech into apps has often meant relying on expensive cloud APIs or dealing with robotic, unnatural voices. Mistral AI aims to change that with Voxtral TTS. It is a powerful, open-weight text-to-speech (TTS) model that you can run on your own hardware. Released on March 26, 2026, this 4-billion-parameter model generates human-like speech in nine languages and adapts to a new voice from as little as three seconds of reference audio.
…
Excerpt limited to ~120 words for fair-use compliance. The full article is at KDnuggets.