Please use this identifier to cite or link to this item:
http://arks.princeton.edu/ark:/88435/dsp010k225f41g
Title: | Deep Learning for Sequence-Based Gene Expression Prediction |
Authors: | Sokolova, Ksenia |
Advisors: | Troyanskaya, Olga G |
Contributors: | Computer Science Department |
Keywords: | deep learning deep learning in genomics gene expression prediction sequence models |
Subjects: | Computer science Bioinformatics |
Issue Date: | 2024 |
Publisher: | Princeton, NJ : Princeton University |
Abstract: | Human biology is rooted in highly specialized cell types programmed by a common genome, 98% of which is outside of genes. While genetic variation in the enormous noncoding space is linked to the majority of disease risk, the impact of this variation is poorly understood. The recent advances in sequencing technology made it possible to perform whole genome sequencing of the large cohorts, uncovering many variants per individual. A crucial challenge is to understand the collective impact of these variants on gene expression across varied human cell types and their subsequent roles in disease progression.This dissertation begins by tackling the challenge of associating noncoding genetic variants with changes in gene expression in primary human cell types. We introduce ExPectoSC, an atlas of modular deep-learning-based models for predicting cell-type-specific gene expression directly from sequence. With models spanning 105 primary human cell types across seven organ systems, it offers a detailed insight into the effect of variation. The resulting atlas of sequence-based gene expression and variant effects is publicly available in a user-friendly interface and readily extensible to any primary cell types. We follow this work with an example application of the ExpectoSC to the study of glomerular diseases, a major cause of end stage renal disease in the US. Despite having similar clinical presentations, these diseases are known for their heterogeneity and variable patient outcomes. By integrating whole-genome sequencing data with ExPectoSC's predictions, we construct comprehensive gene expression disruption profiles for patients. Finally, we developed a new method for genomic-centered contrastive pre-training, called cGen, to improve training of the models from sequence alone in limited-data contexts. Utilizing sequence augmentations, after pre-training cGen generates unsupervised embeddings that highlight functional clusters and are informative of gene expression in the absence of any labeled information. Together, these contributions highlight the power of computational approaches to decode the noncoding genome, offering new avenues for the diagnosis, prognosis, and treatment of human diseases. |
URI: | http://arks.princeton.edu/ark:/88435/dsp010k225f41g |
Type of Material: | Academic dissertations (Ph.D.) |
Language: | en |
Appears in Collections: | Computer Science |
Files in This Item:
This content is embargoed until 2025-06-06. For questions about theses and dissertations, please contact the Mudd Manuscript Library. For questions about research datasets, as well as other inquiries, please contact the DataSpace curators.
Items in Dataspace are protected by copyright, with all rights reserved, unless otherwise indicated.