Computer Science Seminar: Sang Michael Xie

March 25, 2024
12 - 1:15pm EDT
This event is free

Who can attend?

  • Faculty
  • Staff
  • Students

Contact

Toni DeTallo
410-516-8775

Description

Sang Michael Xie, a computer science doctoral student studying machine learning at Stanford University, will give a talk titled "Data-Distribution-Centric Machine Learning for Generalizable Language Models" for the Department of Computer Science.

Abstract:

High-quality datasets are crucial for improving the capabilities and training efficiency of large language models. However, current datasets are typically prepared in an ad hoc, heuristic way. In this talk, Sang Michael Xie will present principled approaches to improving and understanding language models centered on the pre-training data distribution. First, he will describe how to improve the efficiency of training multipurpose language models by optimizing the mixture of data sources with robust optimization. Second, he will discuss an efficient importance resampling method for selecting relevant data from trillion-token-scale web datasets for training a specialized model. Finally, he will introduce a first theoretical analysis of in-context learning, a key capability of language models to learn from examples in a textual prompt, that traces the capability back to modeling coherence structure in the pre-training data.

Who can attend?

  • Faculty
  • Staff
  • Students

Contact

Toni DeTallo
410-516-8775