ML-Based Misinformation Detection in Podcasts

Please use this identifier to cite or link to this item: http://arks.princeton.edu/ark:/88435/dsp01vd66w3092

Title:	ML-Based Misinformation Detection in Podcasts
Authors:	Han, Cindy
Advisors:	Mayer, Jonathan
Department:	Computer Science
Class Year:	2022
Abstract:	In recent years, the ubiquity of misinformation has spurred a lot of research around using machine learning to classify text-based misinformation in news articles and social media posts. However, there has been little prior work on classifying audio-based misinformation such as podcasts, despite the large quantity of misinformation they facilitate. Using the Spotify Podcast Dataset, I compile a new dataset (PAWcast) containing transcript snippets, their misinformation labels, and other metadata. Using this PAWcast dataset, in addition to five other texual misinformation datasets, I train two state-of-the-art classifiers, one based on LIWC features and one us- ing the BERT model. I then design a new ML classifier (TIGER) that finds a balance between training on the combined datasets and training on a single dataset. The TIGER model achieves a 74% F1 score on the PAWcast dataset (both with and without podcast- specific features). It also achieves a 75% average F1 score across all the datasets, which matches or exceeds the existing state-of-the-art models on each dataset.
URI:	http://arks.princeton.edu/ark:/88435/dsp01vd66w3092
Type of Material:	Princeton University Senior Theses
Language:	en
Appears in Collections:	Computer Science, 1987-2024

Files in This Item:

File	Size	Format
HAN-CINDY-THESIS.pdf	1.3 MB	Adobe PDF	Request a copy

Search

Browse