Please use this identifier to cite or link to this item:
http://arks.princeton.edu/ark:/88435/dsp01fq977z13z
Title: | Retrieving Diverse Data With Large Language Models Using The Vendi Score |
Authors: | Crnkovic-Rubsamen, Ava |
Advisors: | Dieng, Adji Bousso |
Department: | Computer Science |
Certificate Program: | Robotics & Intelligent Systems Program Center for Statistics and Machine Learning |
Class Year: | 2024 |
Abstract: | In building a search engine, we aim to return top-ranked results that satisfy as many users as possible. Similarly, when coding a chatbot, we want to ensure it answers every question correctly, even if they are ambiguous or vague. This process entails an information retrieval system that collects a set of relevant documents to the query. A large language model then uses these documents to produce an answer for the chatbot user. A traditional information retrieval system may return several top-ranked documents that cover the same piece of information, thus failing both to address query ambiguity and to meet diverse users' needs while also risking bias and unfairness. To overcome these challenges, diverse information retrieval systems re-rank returned documents, prioritizing the diversity of results. In this thesis, I propose applying the Vendi Score diversity metric to embedding-based information retrieval. First, I use a large language model to embed a corpus of documents. Then, I embed a query and return an initial collection of documents via nearest neighbor search in the embedding vector space. From this set, I use the Vendi Score to iteratively choose top-ranked results, building a final document list that emphasizes query relevance along with information diversity. Preliminary experiments show promise for this application of the Vendi Score metric. I test my proposed method on the BEIR datasets, which cover a variety of information retrieval applications including fact-checking, citation-prediction, and question-answering. I compare my method, Vendi Ranking, to the baseline similarity search along with the Maximal Marginal Relevance (MMR) method. |
URI: | http://arks.princeton.edu/ark:/88435/dsp01fq977z13z |
Type of Material: | Princeton University Senior Theses |
Language: | en |
Appears in Collections: | Computer Science, 1987-2024 Robotics and Intelligent Systems Program |
Files in This Item:
File | Description | Size | Format | |
---|---|---|---|---|
CRNKOVIC-RUBSAMEN-AVA-THESIS.pdf | 453.25 kB | Adobe PDF | Request a copy |
Items in Dataspace are protected by copyright, with all rights reserved, unless otherwise indicated.