FlashLabs researchers have announced the release of Chroma 1.0, an advanced artificial intelligence model designed to facilitate real-time, natural speech dialogue. This system stands out by processing spoken input and generating spoken responses, consistently preserving the speaker's unique voice identity throughout multi-turn conversations. It is presented as the first open-source, end-to-end spoken dialogue solution that integrates low-latency interaction with high-fidelity personalized voice cloning, requiring only a few seconds of reference audio.
Unlike many conventional systems that rely on text transcripts, Chroma 1.0 operates directly with discrete speech representations. It aims to address use cases similar to those handled by commercial real-time agents, but with a streamlined 4-billion parameter dialogue core. A key design principle behind Chroma is its prioritization of speaker similarity, treating it as a core objective rather than an auxiliary feature. The model reportedly achieves a 10.96% relative improvement in speaker similarity compared to a human baseline and boasts a Real Time Factor (RTF) of 0.43, indicating its ability to generate speech more than twice as fast as playback.
Revolutionizing Dialogue Architecture
Most existing production assistants utilize a three-stage pipeline: automatic speech recognition (ASR) converts audio to text, a large language model (LLM) handles reasoning, and text-to-speech (TTS) synthesizes the response. While flexible, this approach introduces latency and loses crucial paralinguistic details such as timbre, emotional tone, speaking rate, and prosody once the audio is collapsed into text. For real-time dialogue, this loss of acoustic richness directly impacts speaker fidelity and naturalness.
Chroma, however, aligns with a newer class of speech-to-speech systems that directly map between sequences of codec tokens. A speech tokenizer and neural codec initially produce quantized acoustic codes. Subsequently, a language model processes and responds to a sequence that interleaves both text tokens and these audio codes, entirely bypassing the need for an explicit intermediate text transcript. This innovative method ensures that the model remains conditioned on both prosody and speaker identity throughout the entire processing chain.
Under the Hood: Chroma's Architecture
The Chroma 1.0 system comprises two primary components: the Chroma Reasoner and the speech stack. The Reasoner is responsible for multimodal understanding and text generation. The speech stack, consisting of the Chroma Backbone, Chroma Decoder, and Chroma Codec Decoder, then transforms this semantic output into personalized spoken audio responses.
- Chroma Reasoner: Built upon the Thinker module from the Qwen-omni series, it uses the Qwen2 Audio encoding pipeline. It processes combined text and audio inputs through shared front ends, integrates them using cross-modal attention, and aligns them over time using Time-aligned Multimodal Rotary Position Embedding (TM-RoPE). The outcome is a sequence of hidden states that encompass both linguistic content and acoustic cues.
- Chroma Backbone: This 1-billion parameter LLaMA-style model, based on Llama3, is conditioned on the target voice using CSM-1B. This module encodes a brief reference audio clip and its transcript into embedding prompts. During inference, token embeddings and hidden states from the Reasoner provide a unified context for the Backbone to generate acoustic codes while maintaining awareness of the dialogue's semantic state.
- Streaming Support: To enable streaming, the system employs a fixed 1-to-2 interleaving schedule. For every text token produced by the Reasoner, the Backbone generates two audio code tokens. This design allows speech generation to commence as soon as text generation begins, eliminating waits for full sentences and contributing significantly to a low Time to First Token (TTFT).
- Chroma Decoder: A lightweight LLaMA variant with approximately 100 million parameters. While the Backbone predicts only the initial Residual Vector Quantization (RVQ) codebook per frame (a coarse representation), the Decoder then takes the Backbone's hidden state and this initial code to autoregressively predict the remaining RVQ levels within the same frame. This factorization keeps long-context temporal structure in the Backbone and limits the Decoder to frame-local refinement, optimizing compute resources and enhancing detailed prosody.
- Chroma Codec Decoder: This component concatenates the coarse and refined codes, mapping them into waveform samples. Following the design of the Mimi vocoder, it employs a causal convolutional neural network, ensuring each output sample depends only on past context, a necessity for streaming. The system utilizes eight codebooks, reducing the number of autoregressive refinement steps for the Decoder while preserving sufficient detail for voice cloning.
Performance Benchmarks and Unique Advantages
The model's training relies on a synthetic speech-to-speech (S2S) pipeline, addressing the scarcity of high-quality speech dialogue data with robust reasoning signals. An LLM first generates textual answers to user questions, and a Text-to-Speech (TTS) system then synthesizes target speech that matches the reference audio's timbre for those answers. These synthetic pairs are crucial for training the Backbone and Decoder for acoustic modeling and voice cloning.
Objective evaluation using the SEED-TTS-EVAL protocol on English CommonVoice speakers shows Chroma achieving a Speaker Similarity score of 0.81 at 24 kHz, surpassing the human baseline of 0.73 and other TTS systems like CosyVoice-3 (0.72). Subjective naturalness comparisons with ElevenLabs eleven_multilingual_v2 indicated a preference for ElevenLabs, though speaker similarity scores were very close. The overall Time to First Token (TTFT) is approximately 147 ms, well under one second, making it highly suitable for interactive dialogue.
On the URO Bench basic track for spoken dialogue and reasoning, Chroma, despite its 4B parameters, achieved an overall task accomplishment score of 57.44%, ranking second only to the 9B parameter GLM-4 Voice and outperforming several other models. Critically, Chroma is the only model in this comparison that supports personalized voice cloning, offering competitive cognitive capabilities alongside high-fidelity, real-time voice personalization.
This article is a rewritten summary based on publicly available reporting. For the original story, visit the source.
Source: MarkTechPost