Please use this identifier to cite or link to this item:
http://arks.princeton.edu/ark:/88435/dsp01bv73c365f
Title: | Predicting Cancer Driver Mutations Using Deep Learning Protein Language Models |
Authors: | Du, Judy |
Advisors: | Singh, Mona |
Contributors: | Quantitative Computational Biology Department |
Keywords: | BERT bi-directional encoder representations of transformers cancer machine learning NLP targeted sequencing panels |
Subjects: | Computer science Medicine |
Issue Date: | 2022 |
Publisher: | Princeton, NJ : Princeton University |
Abstract: | Understanding the effect of mutations on tumorigenesis is crucial to uncovering a patient’s underlying cause of cancer. In recent years, researchers have developed targeted sequencing panels to identify a patient’s unique set of mutations within ac- tionable cancer genes. Crucial to precision medicine, these panels allow clinicians to tailor a personalized treatment plan to a patient’s individual cancer genome. How- ever, predicting the success of a treatment plan still remains a difficult task, as many mutations observed within known cancer-driver genes are still classified as variants of unknown significance (VUS): mutants whose effect on cancer progression is unclear. Here, we leverage commonalities across mutations already known to drive cancer to guide our understanding of these variants of unknown significance. We build ma- chine learning models to differentiate between so-called cancer drivers (i.e., those that ”drive” cancer) from passenger mutations (i.e., those that do not have a signif- icant impact on cellular growth) using features generated by deep learning language models. We show that characterizing mutations using a pre-trained Bi-directional Encoder Representations of Transformers (BERT) model proves to be useful, per- forming on par with state-of-the-art driver prediction models. Moreover, we show that there is a synergistic effect on performance for classification models built when combining language model representations with more traditional ways of annotating protein sequences (e.g., conservation and amino acid physiochemistry). All in all, we demonstrate the utility of current deep learning language models in characteriz- ing putative cancer driver mutations, able to supplement traditional ways biologists characterize the proteome. |
URI: | http://arks.princeton.edu/ark:/88435/dsp01bv73c365f |
Type of Material: | Academic dissertations (Ph.D.) |
Language: | en |
Appears in Collections: | Quantitative Computational Biology |
Files in This Item:
File | Description | Size | Format | |
---|---|---|---|---|
Du_princeton_0181D_14345.pdf | 5.35 MB | Adobe PDF | View/Download |
Items in Dataspace are protected by copyright, with all rights reserved, unless otherwise indicated.