Skip navigation
Please use this identifier to cite or link to this item: http://arks.princeton.edu/ark:/88435/dsp018k71nk490
Title: Targeted analyses of very large genome-wide data collections
Authors: Lee, Young-suk
Advisors: Troyanskaya, Olga
Contributors: Computer Science Department
Keywords: data integration
functional network
genome-wide data
human diseases
machine learning
ontology
Subjects: Computer science
Bioinformatics
Molecular biology
Issue Date: 2016
Publisher: Princeton, NJ : Princeton University
Abstract: Genome-scale experiments provide an overwhelming amount of molecular information for biologist. New computational methods are needed for specific analysis and interpretation of such high-dimensional data. Here we take advantage of the massive public repositories to quantify the tissue-specific signals in gene expression profiles, characterize distinctive molecular features of human diseases, deconvolve the latent cell-type-specific factors in mixed clinical samples, and automatically integrate heterogeneous data sources in the context of a specific genome-wide dataset. First, we describe URSA (Unveiling RNA Sample Annotation) that incorporates the known tissue/cell-type relationships to better estimate the specific signal in any given gene expression profile. Our ontology-aware method combines independent discriminative classifiers in a Bayesian framework, outperforming other machine learning methods. We provide a molecular interpretation for the tissue and cell-type models learned by URSA, enabling a data-driven view of molecular processes specific to particular tissues and cell types. Then, we extend this work for human diseases. We use thousands of clinical disease-specific expression profiles in public repositories to quantify distinctive functional and anatomical characteristics of human diseases. Through our data-driven analysis, we explore the complexity of the human disease landscape and propose exploratory hypothesis for drug repurposing even for rare disease with no prior genetic knowledge. Lastly, we describe YETI (Your Evidence Tailored Integration) for targeted integration of heterogeneous genome-wide data sources. Biomedical researchers generate genome-wide datasets for data-driven exploration of specific questions but such analyses are disconnect from big public data collections. YETI is the first automatic integration method that effectively constructs functional networks specific to a genome-scale dataset. We show that the resulting integration reflect the biological context of the user-provided dataset while providing accurate prediction for functional interactions.
URI: http://arks.princeton.edu/ark:/88435/dsp018k71nk490
Alternate format: The Mudd Manuscript Library retains one bound copy of each dissertation. Search for these copies in the library's main catalog: http://catalog.princeton.edu/
Type of Material: Academic dissertations (Ph.D.)
Language: en
Appears in Collections:Computer Science

Files in This Item:
File Description SizeFormat 
Lee_princeton_0181D_11669.pdf7.66 MBAdobe PDFView/Download


Items in Dataspace are protected by copyright, with all rights reserved, unless otherwise indicated.