Meta introduces Voicebox, does a first on Generative AI speech

Meta AI researchers have moved a step forward in the field of generative AI for speech with the development of Voicebox. Unlike previous models, Voicebox can generalize to speech-generation tasks that it was not specifically trained for, demonstrating state-of-the-art performance.

Voicebox is a versatile generative system for speech that can produce high-quality audio clips in a wide variety of styles. It can create outputs from scratch or modify existing samples. The model supports speech synthesis in six languages, as well as noise removal, content editing, style conversion, and diverse sample generation.

Traditionally, generative AI models for speech required specific training for each task using carefully prepared training data. However, Voicebox adopts a new approach called Flow Matching, which surpasses diffusion models in performance. It outperforms existing state-of-the-art models like VALL-E for English text-to-speech tasks, achieving better word error rates (5.9% vs. 1.9%) and audio similarity (0.580 vs. 0.681), while also being up to 20 times faster. In cross-lingual style transfer, Voicebox surpasses YourTTS by reducing word error rates from 10.9% to 5.2% and improving audio similarity from 0.335 to 0.481.

Blog