Targeted analyses of very large genome-wide data collections

Lee, Young-suk

Please use this identifier to cite or link to this item: http://arks.princeton.edu/ark:/88435/dsp018k71nk490

Title:	Targeted analyses of very large genome-wide data collections
Authors:	Lee, Young-suk
Advisors:	Troyanskaya, Olga
Contributors:	Computer Science Department
Keywords:	data integration functional network genome-wide data human diseases machine learning ontology
Subjects:	Computer science Bioinformatics Molecular biology
Issue Date:	2016
Publisher:	Princeton, NJ : Princeton University
Abstract:	Genome-scale experiments provide an overwhelming amount of molecular information for biologist. New computational methods are needed for specific analysis and interpretation of such high-dimensional data. Here we take advantage of the massive public repositories to quantify the tissue-specific signals in gene expression profiles, characterize distinctive molecular features of human diseases, deconvolve the latent cell-type-specific factors in mixed clinical samples, and automatically integrate heterogeneous data sources in the context of a specific genome-wide dataset. First, we describe URSA (Unveiling RNA Sample Annotation) that incorporates the known tissue/cell-type relationships to better estimate the specific signal in any given gene expression profile. Our ontology-aware method combines independent discriminative classifiers in a Bayesian framework, outperforming other machine learning methods. We provide a molecular interpretation for the tissue and cell-type models learned by URSA, enabling a data-driven view of molecular processes specific to particular tissues and cell types. Then, we extend this work for human diseases. We use thousands of clinical disease-specific expression profiles in public repositories to quantify distinctive functional and anatomical characteristics of human diseases. Through our data-driven analysis, we explore the complexity of the human disease landscape and propose exploratory hypothesis for drug repurposing even for rare disease with no prior genetic knowledge. Lastly, we describe YETI (Your Evidence Tailored Integration) for targeted integration of heterogeneous genome-wide data sources. Biomedical researchers generate genome-wide datasets for data-driven exploration of specific questions but such analyses are disconnect from big public data collections. YETI is the first automatic integration method that effectively constructs functional networks specific to a genome-scale dataset. We show that the resulting integration reflect the biological context of the user-provided dataset while providing accurate prediction for functional interactions.
URI:	http://arks.princeton.edu/ark:/88435/dsp018k71nk490
Alternate format:	The Mudd Manuscript Library retains one bound copy of each dissertation. Search for these copies in the library's main catalog: http://catalog.princeton.edu/
Type of Material:	Academic dissertations (Ph.D.)
Language:	en
Appears in Collections:	Computer Science

Files in This Item:

File	Description	Size	Format
Lee_princeton_0181D_11669.pdf		7.66 MB	Adobe PDF	View/Download

Show full item record

Search

Browse