Voxtral

Text-to-Speech

Speaking of Voxtral

Today we’re releasing Voxtral TTS, our first text-to-speech model with state-of-the-art performance in multilingual voice generation. The model is 3x faster than the industry standard, making Voxtral-powered agents natural, reliable, and cost-effective at scale.

Highlights.

  1. SOTA performance 

  2. Expressiveness

  3. Multilingual 

  4. Real-time (or near-real-time)

  5. Adaptable 

A natural voice generation hinges on the model’s ability to not only recite but interpret a text accurately. Contextual understanding - like neutral, happy, sarcastic, etc. - determines whether the listener considers the generation accurate or robotic. Second, our model excels at speaker modeling: capturing how a specific person naturally speaks. Our voice adaptation feature goes beyond traditional read-speech by capturing a speaker’s personality, including their natural pauses, rhythm, intonation, and emotional dexterity.

Audio is the new UX. Create new interactions for collaboration and understanding only found in speech. Begin now in AI Studio with our Mistral Voices in American, British, and French dialects.

Listen and decide: can you tell the difference?

Our team speaks dozens of languages in multiple dialects, we understand the importance of cultural nuance and built a model that is a reflection of us. Speech generation builds trust via natural-like rhythm, emotion, and even the use of humor. That’s why with voice emulation, we focused on authenticity and emotional expressiveness.

Voice emulation

Original voice
Margaret

Margaret

Model Behavior Architect

American

Emulation
Prompt

Boy oh boy! I'm so excited for the summer. It's going to be so warm here, can't wait for swimming in the Lido and making cherry pie.

Provider 1
Provider 2

State-of-the-art performance.

Automated metrics such as word-error-rate and audio quality scores for multilingual text-to-speech systems are unable to measure naturalness of speech. What makes speech natural is extremely nuanced and requires a deep understanding of cultural differences and typical speaking patterns. Hence, comparative human evaluations performed by native speakers are crucial.

For voice agents, latency and quality are in constant tension. Human evaluations show that Voxtral TTS achieves superior naturalness compared to ElevenLabs Flash v2.5 while maintaining similar Time-to-First-Audio (TTFA). Voxtral also performs at parity with the quality of ElevenLabs v3, successfully supporting emotion-steering for more lifelike interactions.

Gemini TTS (gemini-2.5-flash-preview-tts) currently supports only non-streaming mode (generate_content), so its reported “TTFA” is effectively full-response latency (TTFA = Total), whereas Voxtral’s ≈0.7s TTFA reflects true streaming first-audio delivery.

We conducted a comparative human evaluation of Voxtral TTS and ElevenLabs v2.5 Flash in a zero-shot custom voice context. Using two recognizable voices in their native dialects per supported language, annotators performed a side-by-side preference test on naturalness, accent adherence, and acoustic similarity to the original reference. Voxtral TTS widens the quality gap to v2.5 Flash in this zero-shot multilingual custom voice setting, highlighting the instant customizability of Voxtral TTS to any voice.

Spoken natively.

Trained on a large speech dataset, Voxtral TTS is built for global application. It supports state-of-the-art performance in 9 languages: English, French, German, Spanish, Dutch, Portuguese, Italian, Hindi, and Arabic.

The model also demonstrates zero-shot cross-lingual voice adaptation even if it’s not explicitly trained for it. For example, the model can generate English speech with a French voice prompt and English text. The resulting speech sounds natural while adopting the accent of the provided voice prompt (in this example, the generated speech has a natural French-accented English). This makes the model useful for building cascaded speech-to-speech translation systems.

Cascaded speech-to-speech translation

Before we begin, I'll need to verify a few details. Can you confirm your full name and date of birth?

Voxtral TTS

Built for low-latency streaming.

Latency is critical for voice agent applications. Voxtral TTS achieves a TTFA of 90ms for a typical input voice sample of 10 seconds and 500 characters, with a real-time factor (RTF) of ≈6x. The model natively generates up to two minutes of audio, and our API handles arbitrarily long generations with smart interleaving.

Voxtral TTS architecture.

The model is a transformer-based, autoregressive, flow-matching model, built on Ministral 3B. It consists of the following components:

  • 3.4B parameters transformer decoder backbone

  • 390M flow-matching acoustic transformer

  • 300M neural audio codec (symmetric encoder-decoder)

The model takes a voice prompt (3 to 25 seconds) and a text prompt in 9 supported languages. For each audio frame, the transformer backbone predicts a semantic token, then the flow-matching transformer runs 16 function evaluations (NFEs) to produce the acoustic latent.

We developed an in-house codec, which processes audio causally using a semantic VQ (8192 vocabulary) and an acoustic FSQ (36 dim and 21 levels) latent and produces them at 12.5Hz frame rate.

Powering voice workflows.

Voxtral TTS closes the loop on audio intelligence, giving enterprise voice pipelines an output layer that passes the human test. It works alongside Voxtral Transcribe for full speech-to-speech, or integrates into any existing speech-to-text and LLM stack.

[HERO VIDEO - PIZZA ORDER]

Customer support.

Voice agents that route and resolve queries across channels with natural, brand-appropriate speech.

Place Voxtral TTS into existing Contact Support call systems for automated spoken responses, with output that integrates into existing workflows.

Sales and marketing.

Consistent brand voice for personalized messaging, with cross-lingual emulation that lets a single voice work across markets. Useful for voiceovers, narration, dubbed content, podcasts, and newsletters for internal and client-facing content.

Real-time translation.

Voice-to-voice translation across 9 languages for cross-border operations, events, and internal content, preserving speaker accent and identity in the target language.

Fictional characters.

Characters for interactive storytelling and game design via emotion-steering to control tone and personality across contexts.

Personal assistant

Create a voice agent that speaks in natural language and triggers actions in conversation.

Audio playground.

Test Voxtral TTS directly in Mistral Studio. Select one of the Mistral voices or record your own.

[Mistral Studio demo]

Get started with Voxtral TTS.

Voxtral TTS is available now via API at XXX per minute.

Try it now in Mistral Studio or in Le Chat

The model is available as open weights on Hugging Face under Apache 2.0 license.

Explore the model’s documentation.

We’re hiring!

We are building the voice layer for AI, and If this is the kind of problem you want to work on, we'd love to hear from you.