Please use this identifier to cite or link to this item:
http://arks.princeton.edu/ark:/88435/dsp01r494vp477
Title: | Algorithmic Detection of Label Errors in ImageNet via Clustering of Loss Curves |
Authors: | El-Habr, Mason |
Advisors: | Jha, Niraj |
Department: | Electrical and Computer Engineering |
Certificate Program: | Center for Statistics and Machine Learning |
Class Year: | 2023 |
Abstract: | The ever-continuing growth of machine learning models has necessitated the expansion of labeled machine learning datasets, but completely accurate human labeling is costly. Those seeking to produce very large datasets have opted for more scalable labeling procedures, such as crowdsourcing or algorithmic evaluation, but these procedures are known to produce a small but non-negligible amount of errors. In response, many have developed algorithmic approaches to detect label errors in machine learning datasets. Clustering Training Losses (CTRL) is one approach that records the loss trajectories of each sample during training, and then clusters those trajectories into “noisy” and “clean” categories. When CTRL was used to prune out bad samples, the resulting models achieved state of the art accuracy on CIFAR10, CIFAR100, and selected tabular datasets. While CTRL has proven quite effective when tested against simulated noise in small datasets, CTRL has not yet been applied to datasets with real-world noise, or to very large databases. In this study, we applied the CTRL technique to the ILSVRC2012 computer vision dataset (ImageNet). After evaluating 198 dataset masks, we selected a best mask for re-training and achieved top-1 validation accuracy improvements of .316 percentage points on ResNet50, and .408 percentage points on ResNet101. We also utilized the mask to evaluate the impact of label noise on model architecture. Finally, we created a new set of labels for the flagged images, which achieves a top-1 validation accuracy improvement of .062 percentage points. |
URI: | http://arks.princeton.edu/ark:/88435/dsp01r494vp477 |
Type of Material: | Princeton University Senior Theses |
Language: | en |
Appears in Collections: | Electrical and Computer Engineering, 1932-2023 |
Files in This Item:
File | Description | Size | Format | |
---|---|---|---|---|
EL-HABR-MASON-THESIS.pdf | 1 MB | Adobe PDF | Request a copy |
Items in Dataspace are protected by copyright, with all rights reserved, unless otherwise indicated.