Skip navigation
Please use this identifier to cite or link to this item:
Authors: Li, Shuyang
Advisors: Li, Xiaoyan
Department: Operations Research and Financial Engineering
Class Year: 2016
Abstract: Understanding sentiment is an important task in natural language processing. In this paper we investigate the use of rich features to extend the bag-of-words model for sentiment analysis using machine learning in the movie review domain. We focus on the areas of subjectivity analysis, negation handling, and aggregate document features, and we investigate three ensemble methods and four singular classifiers. Our experimental results show that AdaBoost performs best among all classifiers on the simple unigram feature set, while the Maximum Entropy classifier provides best performance on our enhanced feature sets. Stochastic Gradient Descent is nearly as accurate as AdaBoost and significantly faster. We also examine 128 commonly misclassified reviews and identify additional challenges to NLP in the movie review domain. We have been able to increase classifier performance through the addition of aggregate document polarity and purity features and summary sentence features based on manual subjectivity and summary sentence extraction. From this, we see potential to improve classification accuracy through improved automatic subjectivity analysis methods and summarization. Additional gains may be made by using a domain-specific polarity lexicon to generate aggregate features. We created a manually labeled set of subjective and summary sentences for each review in our corpus. This may serve as a useful benchmark dataset for future work in subjectivity analysis. Using the manually labeled corpus solely to restrict the feature space reduces classifier performance, while using it as a base to generate aggregate features improves accuracy. We also see that using manual subjectivity analysis for both feature restriction and aggregate feature generation further improves classification performance. This suggests that subjectivity analysis is useful for generating rich features as well as for feature space restriction.
Extent: 86 pages
Type of Material: Princeton University Senior Theses
Language: en_US
Appears in Collections:Operations Research and Financial Engineering, 2000-2016

Files in This Item:
File SizeFormat 
Li_Shuyang_final_thesis.pdf2.37 MBAdobe PDF    Request a copy

Items in Dataspace are protected by copyright, with all rights reserved, unless otherwise indicated.