Please use this identifier to cite or link to this item:
http://arks.princeton.edu/ark:/88435/dsp01c247dw48x
Title: | Selecting Language Model Training Data Using Deep Textual Properties |
Authors: | Gupta, Aatmik |
Advisors: | Chen, Danqi |
Department: | Computer Science |
Class Year: | 2024 |
Publisher: | Princeton, NJ : Princeton University |
Abstract: | The efficacy of language models depends on the quality of pre-training data. Many existing selection methodologies are combinations of rudimentary heuristics. We present a novel method for data selection that focuses on deep textual properties which align with human perceptual judgments. We focus on four such properties, and design a method to identify these properties in text data at scale by using a state-of-the-art large language model (GPT-3.5 Turbo) as an annotator. We then use the annotated data to train a relatively lightweight 1.3B model that assigns quality ratings to documents. This allows us to cheaply and reliably rate the quality of a large corpus of text along these four dimensions, which encapsulate aspects of text quality that require a deep understanding of the text that is difficult to capture with simpler methods. We use these ratings to select a subset of documents for training language models, and show that models trained on this subset outperform models trained on randomly selected data. We also extensively analyze the quality ratings and discuss how they are distributed across different domains, topics, languages, identities, geographies, and social roles. |
URI: | http://arks.princeton.edu/ark:/88435/dsp01c247dw48x |
Type of Material: | Academic dissertations (M.S.E.) |
Language: | en |
Appears in Collections: | Computer Science, 2023 |
Files in This Item:
File | Description | Size | Format | |
---|---|---|---|---|
Gupta_princeton_0181G_15029.pdf | 7.94 MB | Adobe PDF | View/Download |
Items in Dataspace are protected by copyright, with all rights reserved, unless otherwise indicated.