Computer Science Seminar: Sang Michael Xie
Description
Sang Michael Xie, a computer science doctoral student studying machine learning at Stanford University, will give a talk titled "Data-Distribution-Centric Machine Learning for Generalizable Language Models" for the Department of Computer Science.
Abstract:
High-quality datasets are crucial for improving the capabilities and training efficiency of large language models. However, current datasets are typically prepared in an ad hoc, heuristic way. In this talk, Sang Michael Xie will present principled approaches to improving and understanding language models centered on the pre-training data distribution. First, he will describe how to improve the efficiency of training multipurpose language models by optimizing the mixture of data sources with robust optimization. Second, he will discuss an efficient importance resampling method for selecting relevant data from trillion-token-scale web datasets for training a specialized model. Finally, he will introduce a first theoretical analysis of in-context learning, a key capability of language models to learn from examples in a textual prompt, that traces the capability back to modeling coherence structure in the pre-training data.
Who can attend?
- Faculty
- Staff
- Students