Inworld AI has launched TTS-1.5, a significant enhancement to its text-to-speech (TTS) technology, designed specifically for real-time voice agents. This upgrade within the TTS-1 series prioritizes demanding requirements for low latency, high audio fidelity, and cost-effectiveness. The system is recognized as a top-tier text-to-speech solution by Artificial Analysis, offering enhanced expressiveness and greater stability compared to previous versions, making it suitable for extensive consumer applications.
Optimized for Real-Time Responsiveness
A primary focus of TTS-1.5 is its P90 time to first audio latency, a critical indicator of user-perceived speed. The TTS-1.5 Max model achieves a P90 latency under 250 milliseconds, while TTS-1.5 Mini goes even further, dropping below 130 milliseconds. These figures represent an approximate fourfold speed improvement over Inworld's previous TTS generation.
The architecture supports streaming via WebSocket, allowing audio synthesis and playback to commence almost instantly upon generation of the first chunk. This capability helps maintain overall interaction latency consistent with typical real-time language model responses, which is crucial for integrated agent pipelines. Inworld typically advises the TTS-1.5 Max for most uses, as it balances latency around 200 ms with superior stability and audio fidelity. The TTS-1.5 Mini is tailored for extremely latency-sensitive scenarios, such as interactive gaming or ultra-responsive conversational AI, where every millisecond is vital.
Enhanced Expression and System Stability
Building upon its predecessor, TTS-1.5 delivers approximately a 30 percent increase in expressive range and around 40 percent better stability. Expressiveness encompasses elements such as prosody, emphasis, and emotional nuances, allowing for more natural and engaging conversations. Stability metrics, including word error rate and consistent output across varied and long prompts, have also seen substantial improvements. Reducing the word error rate mitigates common issues like truncated sentences, unintentional word substitutions, or audio artifacts, which is particularly beneficial when the TTS output directly originates from generated language model text.
Cost-Effective for Mass Deployment
The pricing structure for TTS-1.5 is designed for consumer-scale applications, offering two main configurations. Inworld TTS-1.5 Mini is priced at $5 per one million characters, equating to roughly $0.005 per minute of spoken audio. The TTS-1.5 Max costs $10 per one million characters, or approximately $0.01 per minute. This strategic pricing model ensures that text-to-speech integration remains economically viable for high-usage products, such as AI companions, educational platforms, or customer support lines, without becoming a prohibitive operational expense.
Extensive Language Support and Voice Cloning
TTS-1.5 provides robust multilingual capabilities, supporting 15 languages. This comprehensive list includes English, Spanish, French, Korean, Dutch, Chinese, German, Italian, Japanese, Polish, Portuguese, Russian, Hindi, Arabic, and Hebrew, enabling a single TTS pipeline to serve diverse global markets.
The system also features both instant and professional voice cloning. Instant voice cloning can generate a custom voice from merely 15 seconds of audio, accessible directly through Inworld’s portal and API. For branded voices or less common accents, professional voice cloning requires a minimum of 30 minutes of clean audio, with 20 minutes or more recommended for optimal results.
Flexible Deployment and Integration
For deployment flexibility, TTS-1.5 is available as a cloud API and as an on-premise solution. The on-premise option allows the full model to operate within a customer's own infrastructure, addressing specific data sovereignty and compliance requirements. Both deployment methods maintain the same high-quality profile. The models are also designed for seamless integration with partner platforms like LiveKit, Pipecat, and Vapi, facilitating comprehensive end-to-end voice agent stacks.
This article is a rewritten summary based on publicly available reporting. For the original story, visit the source.
Source: MarkTechPost