Statistical Inference of Variables Driving Systematic Variation in High-Dimensional Biological Data

Chung, Neo Christopher  Honghoon

Please use this identifier to cite or link to this item: http://arks.princeton.edu/ark:/88435/dsp01rv042w30x

Full metadata record

DC Field	Value	Language
dc.contributor.advisor	Storey, John D	en_US
dc.contributor.author	Chung, Neo Christopher Honghoon	en_US
dc.contributor.other	Quantitative Computational Biology Department	en_US
dc.date.accessioned	2014-09-25T22:38:52Z	-
dc.date.available	2014-09-25T22:38:52Z	-
dc.date.issued	2014	en_US
dc.identifier.uri	http://arks.princeton.edu/ark:/88435/dsp01rv042w30x	-
dc.description.abstract	Modern genomic technologies collect an ever-increasing amount of information (e.g., gene expression and genotypes) about model organisms and humans. Systematic patterns of variation in such large-scale biological studies reflect the underlying molecular signatures of disease status, environment, and others, and can be quantified using principal component analysis (PCA) and related methods. For example, histological examination of tumor cells has long provided clinical classifications of cancer which are indirect, imprecise, and low-resolution. In contrast, we can infer different types of cancer directly from gene expression profiles of cancerous tumor samples. An unsolved problem in this context is how to systematically identify the observed variables that are drivers of systematic variation captured by PCA. My dissertation introduces a statistical framework to rigorously utilize a quantitative characterization of systematic variation. The key challenge in utilizing latent variable estimates -- such as principal components (PCs) -- is how to prevent overfitting. It is well established that conventional statistical tests for association using quantities estimated from the data itself will artificially inflate statistical significance, because the data is used twice. We introduce a general resampling approach, called the jackstraw, to calculate statistical significance of association between the observed variables and their latent variables, while automatically adjusting for how much PCA overfits the particular dataset. Furthermore, based on weights derived from the jackstraw, we developed significance-based shrinkage methods for the loadings of PCs and high-dimensional covariance matrices, called the jackstraw weighted shrinkage. Incorporating this set of proposed methods, we investigated genetic differentiation due to the global human population structure. Overall, the proposed statistical framework makes minimal assumptions and offers flexibility in exploring and analyzing the data, while providing a safeguard against an anti-conservative bias due to overfitting.	en_US
dc.language.iso	en	en_US
dc.publisher	Princeton, NJ : Princeton University	en_US
dc.relation.isformatof	The Mudd Manuscript Library retains one bound copy of each dissertation. Search for these copies in the <a href=http://catalog.princeton.edu> library's main catalog </a>	en_US
dc.subject	data	en_US
dc.subject	jackstraw	en_US
dc.subject	latent variable model	en_US
dc.subject	principal component analysis	en_US
dc.subject	resampling	en_US
dc.subject	sparse pca	en_US
dc.subject.classification	Biostatistics	en_US
dc.subject.classification	Bioinformatics	en_US
dc.subject.classification	Statistics	en_US
dc.title	Statistical Inference of Variables Driving Systematic Variation in High-Dimensional Biological Data	en_US
dc.type	Academic dissertations (Ph.D.)	en_US
pu.projectgrantnumber	690-2143	en_US
Appears in Collections:	Quantitative Computational Biology

Files in This Item:

File	Description	Size	Format
Chung_princeton_0181D_11068.pdf		6.24 MB	Adobe PDF	View/Download

Show simple item record

Search

Browse