Skip navigation
Please use this identifier to cite or link to this item: http://arks.princeton.edu/ark:/88435/dsp01nv935533t
Title: High-dimensional Covariance Learning
Authors: Wang, Weichen
Advisors: Fan, Jianqing
Contributors: Operations Research and Financial Engineering Department
Keywords: Empirical Eigen-structure
Matrix Quadratic Functionals
Principal Component Analysis
Robust Covariance Estimation
Semiparametric Factor Models
Subjects: Statistics
Issue Date: 2016
Publisher: Princeton, NJ : Princeton University
Abstract: Massive data analyses and statistical learning in many real applications require a careful understanding of the high dimensional covariance structure. Large covariance matrix typically plays a role through either its quadratic and spectral functionals or a structure of low-rank plus sparse components. Learning the large covariance and taking advantage of its structure are important because it (i) is directly applicable to high-dimensional regression, (ii) is featured in classification, Hotelling test, false discovery controls, etc., (iii) provides tools to extract the latent factors, (iv) is closely related to graphical models, (v) measures risks in portfolio allocation. Motivated by the computation of critical values of the high-dimensional tests, we investigate the difficulty of estimation of the quadratic functionals of sparse correlation matrices. Specifically, we show that simple plug-in procedures based on thresholded estimators of correlation matrices are sparsity-adaptive and minimax optimal over a large class of correlation matrices. Akin to previous results on functional estimation, the minimax rates exhibit an elbow phenomenon. For better understanding the spectral functionals of covariance matrices, we derive the asymptotic distributions of the empirical eigenvalues and eigenvectors under a generalized and unified asymptotic regime, which takes into account the spike magnitude of leading eigenvalues, sample size, and dimensionality. The results reveal the biases of the estimation of leading eigenvalues and eigenvectors by using principal component analysis, and lead to a new covariance estimator for the approximate factor model, called shrinkage principal orthogonal complement thresholding (S-POET). Our results are successfully applied to outstanding problems in estimation of risks of large portfolios and false discovery proportions for dependent test statistics. We consider extending the approximate factor models in two ways: semiparametrization and heavy-tailedness. The semiparametric modeling leverages extra information of covariates while the robust modeling allows heavy-tailed data. For the first extension, we introduce a Projected Principal Component Analysis (Projected-PCA), which employs principal component analysis to the projected data matrix onto a given linear space spanned by covariates. We propose a flexible semi-parametric factor model, which decomposes the factor loading matrix into the component that can be explained by subject-specific covariates and the orthogonal residual component. By using the newly proposed Projected-PCA, the rates of convergence of the factor and loading matrices are obtained, which are much faster than those of the conventional factor analysis. The convergence is achieved even when the sample size is finite and is particularly appealing in the high-dimension-low-sample-size situation. For the second extension, we propose a general Principal Orthogonal complEment Thresholding (POET) framework for large-scale covariance matrix estimation based on the approximate factor model. A set of high level sufficient conditions for the procedure to achieve optimal rates of convergence under different matrix norms is brought up to better understand how POET works. Such a framework allows us to recover existing results for sub-Gaussian data in a more transparent way that only depends on the concentration properties of the sample covariance matrix. As a new theoretical contribution, for the first time, such a framework allows us to exploit conditional sparsity covariance structure for the heavy-tailed data. In particular, for the elliptical distribution, we propose a robust estimator based on the marginal and spatial Kendall's tau to satisfy these conditions. In addition, we study conditional graphical model under the same framework.
URI: http://arks.princeton.edu/ark:/88435/dsp01nv935533t
Alternate format: The Mudd Manuscript Library retains one bound copy of each dissertation. Search for these copies in the library's main catalog: catalog.princeton.edu
Type of Material: Academic dissertations (Ph.D.)
Language: en
Appears in Collections:Operations Research and Financial Engineering

Files in This Item:
File Description SizeFormat 
Wang_princeton_0181D_11826.pdf1.13 MBAdobe PDFView/Download


Items in Dataspace are protected by copyright, with all rights reserved, unless otherwise indicated.