Skip navigation
Please use this identifier to cite or link to this item: http://arks.princeton.edu/ark:/88435/dsp01rv042w30x
Full metadata record
DC FieldValueLanguage
dc.contributor.advisorStorey, John Den_US
dc.contributor.authorChung, Neo Christopher Honghoonen_US
dc.contributor.otherQuantitative Computational Biology Departmenten_US
dc.date.accessioned2014-09-25T22:38:52Z-
dc.date.available2014-09-25T22:38:52Z-
dc.date.issued2014en_US
dc.identifier.urihttp://arks.princeton.edu/ark:/88435/dsp01rv042w30x-
dc.description.abstractModern genomic technologies collect an ever-increasing amount of information (e.g., gene expression and genotypes) about model organisms and humans. Systematic patterns of variation in such large-scale biological studies reflect the underlying molecular signatures of disease status, environment, and others, and can be quantified using principal component analysis (PCA) and related methods. For example, histological examination of tumor cells has long provided clinical classifications of cancer which are indirect, imprecise, and low-resolution. In contrast, we can infer different types of cancer directly from gene expression profiles of cancerous tumor samples. An unsolved problem in this context is how to systematically identify the observed variables that are drivers of systematic variation captured by PCA. My dissertation introduces a statistical framework to rigorously utilize a quantitative characterization of systematic variation. The key challenge in utilizing latent variable estimates -- such as principal components (PCs) -- is how to prevent overfitting. It is well established that conventional statistical tests for association using quantities estimated from the data itself will artificially inflate statistical significance, because the data is used twice. We introduce a general resampling approach, called the jackstraw, to calculate statistical significance of association between the observed variables and their latent variables, while automatically adjusting for how much PCA overfits the particular dataset. Furthermore, based on weights derived from the jackstraw, we developed significance-based shrinkage methods for the loadings of PCs and high-dimensional covariance matrices, called the jackstraw weighted shrinkage. Incorporating this set of proposed methods, we investigated genetic differentiation due to the global human population structure. Overall, the proposed statistical framework makes minimal assumptions and offers flexibility in exploring and analyzing the data, while providing a safeguard against an anti-conservative bias due to overfitting.en_US
dc.language.isoenen_US
dc.publisherPrinceton, NJ : Princeton Universityen_US
dc.relation.isformatofThe Mudd Manuscript Library retains one bound copy of each dissertation. Search for these copies in the <a href=http://catalog.princeton.edu> library's main catalog </a>en_US
dc.subjectdataen_US
dc.subjectjackstrawen_US
dc.subjectlatent variable modelen_US
dc.subjectprincipal component analysisen_US
dc.subjectresamplingen_US
dc.subjectsparse pcaen_US
dc.subject.classificationBiostatisticsen_US
dc.subject.classificationBioinformaticsen_US
dc.subject.classificationStatisticsen_US
dc.titleStatistical Inference of Variables Driving Systematic Variation in High-Dimensional Biological Dataen_US
dc.typeAcademic dissertations (Ph.D.)en_US
pu.projectgrantnumber690-2143en_US
Appears in Collections:Quantitative Computational Biology

Files in This Item:
File Description SizeFormat 
Chung_princeton_0181D_11068.pdf6.24 MBAdobe PDFView/Download


Items in Dataspace are protected by copyright, with all rights reserved, unless otherwise indicated.