Please use this identifier to cite or link to this item: http://arks.princeton.edu/ark:/88435/dsp017p88cj77k
 Title: Inference on large-scale structures Authors: Ke, Zheng Advisors: Fan, JianqingJin, Jiashun Contributors: Operations Research and Financial Engineering Department Keywords: clusteringCovariate-Assisted Screeninghomogeneityphase diagramsparsityvariable selection Subjects: Statistics Issue Date: 2014 Publisher: Princeton, NJ : Princeton University Abstract: Big Data' has driven a new statistics branch Large-Scale Inference' (LSI). In many LSI problems, due to proximity in geography, time, etc., the data may contain graphical structures, low-rank structures, clustering structures, etc.. Carefully exploiting such structures enables us to significantly improve the inference. In a linear regression model, we explore two types of structures: sparse graphical structures and homogeneous clustering structures. The first part is largely motivated by the study of DNA copy number variation (CNV) and long-memory financial data: we assume the effects are rare and weak' such that only a small fraction of the regression coefficients are nonzero and the nonzero's are individually small; the main interest is to identify these nonzero's, i.e., variable selection. We consider the very challenging case where the columns of the design matrix are heavily correlated, but we recognize that in many situations, there is an underlying sparse graphical structure on variables. We propose a method Covariate Assisted Screening and Estimation (CASE)', at which heart is a graph-assisted multivariate screening procedure. We show that in a broad context, CASE achieves the minimax Hamming selection errors. CASE was successfully applied to a change-point problem and long-memory time series. CASE has advantages over the more well-known L0/L1-penalization methods and marginal screening. The second part is largely motivated by the study of gene regularity network (GRN) and housing price. We assume the regression coefficients are homogeneous such that they cluster into a few groups, and each group shares a common value; the main interest is to take advantage of homogeneity to estimate these coefficients. We propose a method `Clustering Algorithm in Regression via Data-driven Segmentation' (CARDS). We show that under mild conditions, CARDS successfully recovers the grouping and achieves the oracle estimation errors. The study provides additional insights on how to exploit low-dimensional structures in high-dimensional data. CARDS was successfully applied to predicting Polyadenylation signals and the S&P500 stock returns. CARDS has advantages over the more well-known methods of ordinary least squares (OLS) and the fused lasso. URI: http://arks.princeton.edu/ark:/88435/dsp017p88cj77k Alternate format: The Mudd Manuscript Library retains one bound copy of each dissertation. Search for these copies in the library's main catalog Type of Material: Academic dissertations (Ph.D.) Language: en Appears in Collections: Operations Research and Financial Engineering