Skip navigation
Please use this identifier to cite or link to this item: http://arks.princeton.edu/ark:/88435/dsp01kk91fp94z
Title: Towards Optimizing In-Context Learning and Ensuring Data Transparency in Large Language Models
Authors: Ajith, Anirudh
Advisors: Narasimhan, Karthik
Department: Computer Science
Class Year: 2024
Publisher: Princeton, NJ : Princeton University
Abstract: Contemporary large language models (LLMs) are trained on enormous text corpora sourced from books and the internet. Although in-context learning (ICL) enables them to show strong performance on a variety of tasks, these models remain very sensitive to the precise details of this prompts. Instructions, a key component of these prompts have remained understudied in the literature thus far. In the first part of this thesis, we develop InstructEval -- an evaluation suite for the systematic evaluation of instruction selection methods, and perform experiments to comprehensively evaluate 7 popular instruction selection methods. Our experiments reveal that curated manually-written instructions or simple instructions without any task-specific descriptions often elicit superior ICL performance overall than automatic instruction-induction methods, hence pointing to a lack of generalizability among the latter. Since the leakage of ICL benchmark test data into pretraining corpora can compromise benchmark evaluations, we are motivated to study the pretraining data detection problem in the second part of this thesis. Given a piece of text and black-box access to an LLM, we ask if we can determine whether the model was pretrained on the provided text. We introduce the WikiMIA benchmark to facilitate this study and propose a new detection method called MinK%Prob that unlike prior work, does not require reference models, additional training or any knowledge about a model’s pretraining data distribution. In addition to showing that MinK%Prob outperforms baselines on WikiMIA, we also demonstrate its utility for detecting the leakage of benchmark data and copyrighted content.
URI: http://arks.princeton.edu/ark:/88435/dsp01kk91fp94z
Type of Material: Academic dissertations (M.S.E.)
Language: en
Appears in Collections:Computer Science, 2023

Files in This Item:
File Description SizeFormat 
Ajith_princeton_0181G_15049.pdf1.04 MBAdobe PDFView/Download


Items in Dataspace are protected by copyright, with all rights reserved, unless otherwise indicated.