Skip navigation
Please use this identifier to cite or link to this item: http://arks.princeton.edu/ark:/88435/dsp01rn301425t
Title: Expanding the computational biologist’s toolkit: Experimental design and multi-modality in genomics
Authors: Dumitrascu, Bianca
Advisors: Engelhardt, Barbara E
Contributors: Quantitative Computational Biology Department
Keywords: experimental design
genomics
single cell sequencing
transfer learning
Subjects: Biostatistics
Computer science
Issue Date: 2019
Publisher: Princeton, NJ : Princeton University
Abstract: The traditional biological research pipeline consists of three steps: hypothesis generation, data collection, and data analysis. Data analysis is sometimes followed by a readjustment in hypothesis assessment, allowing for an iterative approach to the scientific inquiry. With the decreasing costs of data collection in high-throughput genomics, and with the increasing number of groups pursuing interconnected ques- tions, several experimental design challenges emerge. In this work, we address three experimental challenges motivated by advances in single-cell RNA-seq (scRNA-seq) technologies: budget allocation, marker selection and multi-modal data aggregation. First, we develop a novel heuristic for contextual bandit problems with logistic rewards and we show a new, bandit-inspired application to iterative experimental design in multi-tissue single-cell RNA-seq (scRNA-seq) data. We present two algorithms, a Good-Toulmin like estimator via Thompson sampling and a Pitman-Yor prior based approach with near optimal performance. Given a budget and modeling cell type information across tissues, they both estimate how many cells are required for sampling from each tissue with the goal of maximizing cell type discovery across samples from multiple iterations. Second, we consider the problem of marker selection in the context of multi modal data collection. Single-cell data analysis allows for the clustering of cells according to their genomic functionality as represented by their gene expression profiles. Such clustering can be achieved using a variety of methods and an active collaboration between experimentalists and computational groups. However, gene expression provides only one facet in depicting cell identity. Motivated by the emerging imaging technologies we present methods for selecting cluster and cluster hierarchy preserving subsets of marker genes that can optimize the imaging of population of cells. Finally, we employ tools from transfer learning to propose a generative model which aggregates information across multiple biological modalities: gene expression and histological sides. The model is a novel take on deep probabilistic canonical correlation analysis which allows for the joint mapping from gene space to morphology and from morphology to gene space, along with an interpretable latent space structure which we further evaluate through quantitative trait loci (QTL) analysis.
URI: http://arks.princeton.edu/ark:/88435/dsp01rn301425t
Alternate format: The Mudd Manuscript Library retains one bound copy of each dissertation. Search for these copies in the library's main catalog: catalog.princeton.edu
Type of Material: Academic dissertations (Ph.D.)
Language: en
Appears in Collections:Quantitative Computational Biology

Files in This Item:
File Description SizeFormat 
Dumitrascu_princeton_0181D_12932.pdf29.72 MBAdobe PDFView/Download


Items in Dataspace are protected by copyright, with all rights reserved, unless otherwise indicated.