Please use this identifier to cite or link to this item:
http://arks.princeton.edu/ark:/88435/dsp01z603r175h
Title: | Understanding the Role of Data in Model Decisions |
Authors: | Gupta, Arushi |
Advisors: | Arora, Sanjeev |
Contributors: | Computer Science Department |
Subjects: | Artificial intelligence |
Issue Date: | 2024 |
Publisher: | Princeton, NJ : Princeton University |
Abstract: | As neural networks are increasingly employed in high stakes applications such as criminal justice, medicine, etc, [1] it becomes increasingly important to understand why these mod- els make the decisions they do. For example, it is important to develop tools to analyze whether models are perpetuating harmful demographic inequalities they have found in their training data in their future decision making [2]. However, neural networks typically require large training sets, have “black-box” decision making, and have costly retraining protocols, increasing the difficulty of this problem. This work considers three questions. Q1) What is the relationship between the elements of an input and the model’s decision? Q2) What is the relationship between the individual training points and the model’s decision. And finally Q3) To what extent do there exist (efficient) approximations that would allow practitioners to predict how model performance would change given different training data, or a different training protocol.Part I addresses Q1 for masking saliency methods. These methods implicitly assume that grey pixels in an image are “uninformative.” We find experimentally that this assumption may not always be true, and define “soundness,” which measures a desirable property of a saliency map. Part II addresses Q2 and Q3 in the context of influence functions, which aim to approx- imate the effect of removing a training points on the model’s decision. We use harmonic analysis to examine a particular type of influence method, namely datamodels, and find that there is a relationship between the coefficients of the datamodel, and the Fourier coefficients of the target function. Finally, Part III addresses Q3 in the context of test data. First, we assess whether held out test data is necessary to approximate the outer loop of meta learning, or whether recycling training data constitutes a sufficient approximation. We find that held out test data is important, as it learns representations that are low rank. Then, inspired by the PGDL competition [3] we investigate whether GAN generated data, despite well known limitations, can be used to approximate generalization performance when no test or validation set is available, and find that they can. |
URI: | http://arks.princeton.edu/ark:/88435/dsp01z603r175h |
Type of Material: | Academic dissertations (Ph.D.) |
Language: | en |
Appears in Collections: | Computer Science |
Files in This Item:
File | Description | Size | Format | |
---|---|---|---|---|
Gupta_princeton_0181D_14897.pdf | 7.08 MB | Adobe PDF | View/Download |
Items in Dataspace are protected by copyright, with all rights reserved, unless otherwise indicated.