Skip navigation
Please use this identifier to cite or link to this item: http://arks.princeton.edu/ark:/88435/dsp01rx913t091
Title: Interpreting and Expanding Censorship Data using Latent Dirichlet Allocation (LDA)
Authors: Goenka, Manan
Advisors: Mittal, Prateek
Department: Computer Science
Class Year: 2022
Abstract: Governments around the world are increasingly employing censorship mechanisms to manipulate their populations and to crush political dissent. Although prior works have proposed methods of using network features to identify such censorship measures, they have not used modern NLP techniques to analyze the types of content being blocked. In this work, we use a SOTA topic modeling technique called LDA to identify the key subjects being censored in five different countries: China, Russia, Iran, Turkey and India. Using country-specific lists of censored websites from Citizen Lab, we discover topics relating to the violation of human rights, the COVID-19 pandemic, online gambling, etc., and analyze them in the context of each country's political and legal climate. We also present the design of a dashboard to showcase the topics being censored by various governments in real time. In addition, we introduce a method of generating topic based features for individual web pages and propose the design of an ML system that could use such features in conjunction with network features to detect censorship. These methods of identifying censorship using both network and topic based features rely on lists of censored web pages provided by various third party services. Such lists are often curated by manual experts and are static in nature. Researchers have proposed automated methods of expanding these lists using phrases generated by an NLP technique called TF-IDF. We build upon these works by illustrating how the topics produced by our LDA model could be used to drive the generation of new sets of censored websites. We also put forward a new metric that evaluates the ability of such list expansion methods to predict future censorship and show that our approach based on topic modeling outperforms the TF-IDF method on both existing metrics and our proposed mode of evaluation.
URI: http://arks.princeton.edu/ark:/88435/dsp01rx913t091
Type of Material: Princeton University Senior Theses
Language: en
Appears in Collections:Computer Science, 1987-2023

Files in This Item:
File Description SizeFormat 
GOENKA-MANAN-THESIS.pdf1.09 MBAdobe PDF    Request a copy


Items in Dataspace are protected by copyright, with all rights reserved, unless otherwise indicated.