CLSP Fall Seminar Series: Wei-Ning Hsu
Description
Wei-Ning Hsu, a research scientist at Meta's Foundational AI [Artificial Intelligence] Research (FAIR), will give a talk titled "Large Scale Universal Speech Generative Models" for the Center for Language and Speech Processing.
Abstract:
Large-scale generative models such as ChatGPT and DALL-E have revolutionized natural language processing and computer vision research. These models not only generate high fidelity text or image outputs but also demonstrate impressive domain and task generalization capabilities. In contrast, audio generative models are relatively primitive in scale and generalization.
In this talk, I will start with a brief introduction on conventional neural speech generative models and discuss why they are unfit for scaling to Internet-scale data. Next, by reviewing the latest large-scale generative models for text and image, I will outline a few lines of promising approaches to build scalable speech models. Last, I will present Voicebox, Meta's latest work to advance this area. Voicebox is the most versatile generative model for speech. It is trained with a simple task—text conditioned speech infilling—on over 50,000 hours of multilingual speech with a powerful flow-matching objective. Through in-context learning, Voicebox can perform monolingual/cross-lingual zero-shot text-to-speech, holistic style conversion, transient noise removal, content editing, and diverse sample generation. Moreover, Voicebox achieves state-of-the-art performance and excellent run-time efficiency.
Who can attend?
- Faculty
- Staff
- Students