Please use this identifier to cite or link to this item:
http://arks.princeton.edu/ark:/88435/dsp01kp78gk577
Title: | Topic Modeling and Reading Age Prediction on Children’s Book Descriptions |
Authors: | Momataz, Khandaker |
Advisors: | Fellbaum, Christiane |
Department: | Computer Science |
Class Year: | 2022 |
Abstract: | There is a growing demand for digital content and information across various fields [1]. The profileration of digital data may come with challenges that can be remedied using applications of machine learning and artificial intelligence, such as text mining. For example, the data may come in an unstructured format, requiring the use of human intervention to analyze and classify. However, this can be an exhausting and expensive task. Machine learning may allow us to complete these tasks with efficiency and speed, at a lower cost, while also allowing us to learn more about fields, such as children’s books. It may also allow us to answer questions such as, what are children’s books about? Do the themes and topics in these books change over different age groups? What can they imply about society and how children are taught to think and explore the world around them? This paper explores children’s books descriptions using two approaches. We utilize Latent Dirichlet Allocation for topic modeling on a set of 1,779 children’s book descriptions scraped from the internet to uncover hidden topics. Further, we hypothesize that books for certain age groups will contain semantic and thematic similarity that will be evident in the types and tokens present in descriptions. This motivates us to build accurate classifiers that can classify books by age based on the features in the descriptions alone. We found that all of our classifiers performed well, yielding accuracies over 80% on average. Multinomial Naive Bayes on uni-grams achieved the highest average accuracy (84%) during cross-validation. Moreover, we created topic models for the overall data set, a sub-data set of the books with reading ages ranging from 0 to 7, and a sub-data set of of the books with reading ages ranging from 8 to 15, with coherence values of 0.4190, 0.4203, and 0.4504, respectively. This project and its techniques can benefit parents, children, libraries, authors, editors, and publishers by informing applications in sorting, classifying, and recommending books. |
URI: | http://arks.princeton.edu/ark:/88435/dsp01kp78gk577 |
Type of Material: | Princeton University Senior Theses |
Language: | en |
Appears in Collections: | Computer Science, 1987-2024 |
Files in This Item:
File | Description | Size | Format | |
---|---|---|---|---|
MOMATAZ-KHANDAKER-THESIS.pdf | 504.29 kB | Adobe PDF | Request a copy |
Items in Dataspace are protected by copyright, with all rights reserved, unless otherwise indicated.