Skip navigation
Please use this identifier to cite or link to this item:
Authors: Velasco II, Alfredo
Advisors: Singh, Mona
Department: Computer Science
Class Year: 2023
Publisher: Princeton, NJ : Princeton University
Abstract: Cancer is a disease that is caused by somatic alterations that result in increased cellular growth rates.Many of these alterations are mutations that occur within protein-coding regions of the genome. Cancerous cells typically also contain numerous mutations that do not contribute to cancer initiation or progression. Thus, there is a need to develop computational methods that can differentiate between mutations that are relevant for cancer (called driver mutations) and mutations that are not (called passenger mutations). This thesis considers different machine learning (ML) methods as well as different feature sets with the goal of predicting cancer driver mutations. One of these methods of feature generation includes protein language models which result in high-dimensional representations of amino acids that capture the context within which they appear within protein sequences. Another more familiar method for feature generation is to obtain the physiochemical properties of amino acids such as their hydrophobic properties or if they are evolutionarily conserved. We also look at a variety of ML models including random forests, gradient boosting, logistic regression, Gaussian naive Bayes, and decision trees to determine which model gives the best predictions. Together, the ultimate goal of this project is to determine which combination of feature sets and ML models give the best predictive performance in distinguishing driver and passenger mutations.
Language: en
Appears in Collections:Computer Science, 2023

Files in This Item:
File Description SizeFormat 
VelascoII_princeton_0181G_14611.pdf1.42 MBAdobe PDFView/Download

Items in Dataspace are protected by copyright, with all rights reserved, unless otherwise indicated.