Skip navigation
Please use this identifier to cite or link to this item: http://arks.princeton.edu/ark:/88435/dsp01fq977z13z
Title: Retrieving Diverse Data With Large Language Models Using The Vendi Score
Authors: Crnkovic-Rubsamen, Ava
Advisors: Dieng, Adji Bousso
Department: Computer Science
Certificate Program: Robotics & Intelligent Systems Program
Center for Statistics and Machine Learning
Class Year: 2024
Abstract: In building a search engine, we aim to return top-ranked results that satisfy as many users as possible. Similarly, when coding a chatbot, we want to ensure it answers every question correctly, even if they are ambiguous or vague. This process entails an information retrieval system that collects a set of relevant documents to the query. A large language model then uses these documents to produce an answer for the chatbot user. A traditional information retrieval system may return several top-ranked documents that cover the same piece of information, thus failing both to address query ambiguity and to meet diverse users' needs while also risking bias and unfairness. To overcome these challenges, diverse information retrieval systems re-rank returned documents, prioritizing the diversity of results. In this thesis, I propose applying the Vendi Score diversity metric to embedding-based information retrieval. First, I use a large language model to embed a corpus of documents. Then, I embed a query and return an initial collection of documents via nearest neighbor search in the embedding vector space. From this set, I use the Vendi Score to iteratively choose top-ranked results, building a final document list that emphasizes query relevance along with information diversity. Preliminary experiments show promise for this application of the Vendi Score metric. I test my proposed method on the BEIR datasets, which cover a variety of information retrieval applications including fact-checking, citation-prediction, and question-answering. I compare my method, Vendi Ranking, to the baseline similarity search along with the Maximal Marginal Relevance (MMR) method.
URI: http://arks.princeton.edu/ark:/88435/dsp01fq977z13z
Type of Material: Princeton University Senior Theses
Language: en
Appears in Collections:Computer Science, 1987-2024
Robotics and Intelligent Systems Program

Files in This Item:
File Description SizeFormat 
CRNKOVIC-RUBSAMEN-AVA-THESIS.pdf453.25 kBAdobe PDF    Request a copy


Items in Dataspace are protected by copyright, with all rights reserved, unless otherwise indicated.