Microsoft Unveils VALL-E, A Voice DALL-E

VALL-E can generate various outputs with the same input text while maintaining the speaker’s emotion and the acoustical prompt. VALL-E can synthesise natural speech with high speaker accuracy by prompting in the zero-shot scenario. According to evaluation results, VALL-E performs much better on LibriSpeech and VCTK than the most advanced zero-shot TTS system. VALL-E even achieved new state-of-the-art zero-shot TTS results on LibriSpeech and VCTK.

It is interesting to note that people who have lost their voice can ‘talk’ again through this text-to-speech method if they have previous voice recordings of themselves. Two years ago, a Stanford University Professor, Maneesh Agarwala, also told AIM that they were working on something similar, where they had planned to record a patient’s voice before the surgery and then use that pre-surgery recording to convert their electrolarynx voice back into their pre-surgery voice.

Blog