Skip navigation
Please use this identifier to cite or link to this item: http://arks.princeton.edu/ark:/88435/dsp01bv73c365f
Title: Predicting Cancer Driver Mutations Using Deep Learning Protein Language Models
Authors: Du, Judy
Advisors: Singh, Mona
Contributors: Quantitative Computational Biology Department
Keywords: BERT
bi-directional encoder representations of transformers
cancer
machine learning
NLP
targeted sequencing panels
Subjects: Computer science
Medicine
Issue Date: 2022
Publisher: Princeton, NJ : Princeton University
Abstract: Understanding the effect of mutations on tumorigenesis is crucial to uncovering a patient’s underlying cause of cancer. In recent years, researchers have developed targeted sequencing panels to identify a patient’s unique set of mutations within ac- tionable cancer genes. Crucial to precision medicine, these panels allow clinicians to tailor a personalized treatment plan to a patient’s individual cancer genome. How- ever, predicting the success of a treatment plan still remains a difficult task, as many mutations observed within known cancer-driver genes are still classified as variants of unknown significance (VUS): mutants whose effect on cancer progression is unclear. Here, we leverage commonalities across mutations already known to drive cancer to guide our understanding of these variants of unknown significance. We build ma- chine learning models to differentiate between so-called cancer drivers (i.e., those that ”drive” cancer) from passenger mutations (i.e., those that do not have a signif- icant impact on cellular growth) using features generated by deep learning language models. We show that characterizing mutations using a pre-trained Bi-directional Encoder Representations of Transformers (BERT) model proves to be useful, per- forming on par with state-of-the-art driver prediction models. Moreover, we show that there is a synergistic effect on performance for classification models built when combining language model representations with more traditional ways of annotating protein sequences (e.g., conservation and amino acid physiochemistry). All in all, we demonstrate the utility of current deep learning language models in characteriz- ing putative cancer driver mutations, able to supplement traditional ways biologists characterize the proteome.
URI: http://arks.princeton.edu/ark:/88435/dsp01bv73c365f
Type of Material: Academic dissertations (Ph.D.)
Language: en
Appears in Collections:Quantitative Computational Biology

Files in This Item:
This content is embargoed until 2023-11-22. For questions about theses and dissertations, please contact the Mudd Manuscript Library. For questions about research datasets, as well as other inquiries, please contact the DataSpace curators.


Items in Dataspace are protected by copyright, with all rights reserved, unless otherwise indicated.