Mistral AI Unveils Voxtral Transcribe 2: Revolutionizing Multilingual Speech-to-Text for Scaled Production

Mistral AI has significantly advanced its automatic speech recognition (ASR) capabilities with the introduction of the Voxtral Transcribe 2 family. This new offering aims to address the growing demand for robust speech-to-text solutions within AI products, ranging from interactive voice agents to sophisticated meeting analysis tools. The Voxtral Transcribe 2 suite features two distinct models, each optimized for specific use cases while prioritizing efficiency, responsiveness, and deployment flexibility.

Dual Models for Diverse ASR Needs

The Voxtral Transcribe 2 family comprises two specialized models:

Voxtral Mini Transcribe V2: Tailored for batch transcription tasks, this model excels in delivering high-accuracy output combined with speaker diarization.
Voxtral Realtime (Voxtral Mini 4B Realtime 2602): Engineered for low-latency streaming transcription, this model is also made available as open weights.

Both models support a wide array of 13 languages, including English, Chinese, Hindi, Spanish, Arabic, French, Portuguese, Russian, German, Japanese, Korean, Italian, and Dutch, enabling broad international applicability.

Voxtral Realtime: High-Performance Streaming with Tunable Latency

Voxtral Mini 4B Realtime 2602, a 4-billion-parameter multilingual real-time speech transcription model, sets a new benchmark for open-weights models by achieving accuracy comparable to offline systems with a delay under 500 milliseconds. Its architecture incorporates a 3.4-billion-parameter language model and a 0.6-billion-parameter audio encoder, both utilizing sliding-window attention for continuous streaming support.

A key feature of Voxtral Realtime is its configurable transcription delay, adjustable from 80 ms up to 2.4 seconds. This allows developers to fine-tune the balance between speed and accuracy:

Delays between 80–200 ms are ideal for highly interactive applications.
Approximately 480 ms provides an optimal balance, matching leading offline and real-time systems on benchmarks like FLEURS.
A 2.4-second delay maximizes accuracy, aligning with the batch model's performance for tasks such as subtitling.

The model is released in BF16 format, designed for efficient on-device or edge deployment, capable of running on a single GPU with at least 16 GB of memory.

Voxtral Mini Transcribe V2: Batch Excellence with Enterprise Features

Voxtral Mini Transcribe V2 is a closed-weights, API-exposed model (voxtral-mini-2602) optimized for transcription quality and speaker diarization. Priced competitively at $0.003 per minute, it consistently achieves approximately 4% Word Error Rate (WER) on the FLEURS benchmark, outperforming several competitor APIs in accuracy and cost-effectiveness.

This model offers a suite of enterprise-grade features:

Speaker Diarization: Accurately labels speakers with precise timestamps, making it invaluable for meetings, interviews, and multi-party calls.
Context Biasing: Allows input of up to 100 words or phrases to improve accuracy for specific terminology or names.
Word-level Timestamps: Provides detailed timing for each word, crucial for subtitle generation and searchable audio content.
Noise Robustness: Maintains high accuracy even in challenging acoustic environments.
Extended Audio Support: Processes up to three hours of audio in a single request.

Deployment and Accessibility

Integration paths for the new models are streamlined. Voxtral Mini Transcribe V2 is accessible through Mistral's audio transcription API and the Mistral Studio audio playground. Voxtral Realtime, while also available via the Mistral API at $0.006 per minute, stands out with its open-weights release on Hugging Face under the Apache 2.0 license, complete with official vLLM Realtime support.

This dual-pronged approach from Mistral AI offers developers and enterprises powerful, flexible tools to integrate advanced speech-to-text capabilities into their next-generation AI applications.

Mistral AI Unveils Voxtral Transcribe 2: Revolutionizing Multilingual Speech-to-Text for Scaled Production

Dual Models for Diverse ASR Needs

Voxtral Realtime: High-Performance Streaming with Tunable Latency

Voxtral Mini Transcribe V2: Batch Excellence with Enterprise Features

Deployment and Accessibility

Latest News

Unlocking Smart Logistics: AI Agents Deliver Precision Routing for Supply Chains

Microsoft Gaming Unveils Bold New Direction: Phil Spencer Retires, AI Strategist Named CEO

Microsoft Appoints AI Visionary Asha Sharma to Lead Xbox, Signaling Major Strategic Shift

Autonomous Vehicles Unmasked: Tesla & Waymo Robotaxis Still Require Human Remote Support

Groundbreaking Split: National PTA Rejects Meta Partnership Amid Child Safety Storm

More News