报告题目：Unstructured Data Management at Scale for Large Language Models
报告人：Dong Deng（邓栋），assitant professor in the Computer Science Department at Rutgers University (USA)
A clear trend in machine learning is that model becomes larger and larger and more and more training data is used. For example the number of parameters and training corpora size of large language models (LLM) both grow around 1000 times in the past few years. As a result, the latest LLMs are trained on TB-level data, which brings significant challenges to data management: even a simple operation on the training data entails a huge amount of computation. Recent studies find LLMs memorize part of the training data, which brings significant privacy risks. In this talk, we discuss how to evaluate the LLM memorization behavior quantitively. For this purpose, we develop an efficient and scalable near-duplicate sequence search algorithm. Given a query sequence, it finds (almost) all the near-duplicate sequences in the TB-level training corpus. Note a sequence is a snippet in a text and thus the number of sequences in a text is quartic to the text length. In addition, we briefly discuss how to remedy the LLM memorization by efficient training data deduplication.
Dong Deng（邓栋） is an assistant professor in the Computer Science Department at Rutgers University. His research interests include large-scale data management, data science, database systems, and data curation. Before joining Rutgers, he was a postdoc in the Database Group at MIT, where he worked with Mike Stonebraker and Sam Madden on data curation systems. He received his Ph.D. from Tsinghua University with honors. He has published over 30 research papers at top database venues, mainly SIGMOD, VLDB, and ICDE. Based on Google Scholar, his publications have received over 2000 citations.