Skip navigation
Please use this identifier to cite or link to this item: http://arks.princeton.edu/ark:/88435/dsp01r494vp477
Title: Algorithmic Detection of Label Errors in ImageNet via Clustering of Loss Curves
Authors: El-Habr, Mason
Advisors: Jha, Niraj
Department: Electrical and Computer Engineering
Certificate Program: Center for Statistics and Machine Learning
Class Year: 2023
Abstract: The ever-continuing growth of machine learning models has necessitated the expansion of labeled machine learning datasets, but completely accurate human labeling is costly. Those seeking to produce very large datasets have opted for more scalable labeling procedures, such as crowdsourcing or algorithmic evaluation, but these procedures are known to produce a small but non-negligible amount of errors. In response, many have developed algorithmic approaches to detect label errors in machine learning datasets. Clustering Training Losses (CTRL) is one approach that records the loss trajectories of each sample during training, and then clusters those trajectories into “noisy” and “clean” categories. When CTRL was used to prune out bad samples, the resulting models achieved state of the art accuracy on CIFAR10, CIFAR100, and selected tabular datasets. While CTRL has proven quite effective when tested against simulated noise in small datasets, CTRL has not yet been applied to datasets with real-world noise, or to very large databases. In this study, we applied the CTRL technique to the ILSVRC2012 computer vision dataset (ImageNet). After evaluating 198 dataset masks, we selected a best mask for re-training and achieved top-1 validation accuracy improvements of .316 percentage points on ResNet50, and .408 percentage points on ResNet101. We also utilized the mask to evaluate the impact of label noise on model architecture. Finally, we created a new set of labels for the flagged images, which achieves a top-1 validation accuracy improvement of .062 percentage points.
URI: http://arks.princeton.edu/ark:/88435/dsp01r494vp477
Type of Material: Princeton University Senior Theses
Language: en
Appears in Collections:Electrical and Computer Engineering, 1932-2023

Files in This Item:
File Description SizeFormat 
EL-HABR-MASON-THESIS.pdf1 MBAdobe PDF    Request a copy


Items in Dataspace are protected by copyright, with all rights reserved, unless otherwise indicated.