Skip navigation
Please use this identifier to cite or link to this item: http://arks.princeton.edu/ark:/88435/dsp017p88cj77k
Title: Inference on large-scale structures
Authors: Ke, Zheng
Advisors: Fan, Jianqing
Jin, Jiashun
Contributors: Operations Research and Financial Engineering Department
Keywords: clustering
Covariate-Assisted Screening
homogeneity
phase diagram
sparsity
variable selection
Subjects: Statistics
Issue Date: 2014
Publisher: Princeton, NJ : Princeton University
Abstract: `Big Data' has driven a new statistics branch `Large-Scale Inference' (LSI). In many LSI problems, due to proximity in geography, time, etc., the data may contain graphical structures, low-rank structures, clustering structures, etc.. Carefully exploiting such structures enables us to significantly improve the inference. In a linear regression model, we explore two types of structures: sparse graphical structures and homogeneous clustering structures. The first part is largely motivated by the study of DNA copy number variation (CNV) and long-memory financial data: we assume the effects are `rare and weak' such that only a small fraction of the regression coefficients are nonzero and the nonzero's are individually small; the main interest is to identify these nonzero's, i.e., variable selection. We consider the very challenging case where the columns of the design matrix are heavily correlated, but we recognize that in many situations, there is an underlying sparse graphical structure on variables. We propose a method `Covariate Assisted Screening and Estimation (CASE)', at which heart is a graph-assisted multivariate screening procedure. We show that in a broad context, CASE achieves the minimax Hamming selection errors. CASE was successfully applied to a change-point problem and long-memory time series. CASE has advantages over the more well-known L0/L1-penalization methods and marginal screening. The second part is largely motivated by the study of gene regularity network (GRN) and housing price. We assume the regression coefficients are homogeneous such that they cluster into a few groups, and each group shares a common value; the main interest is to take advantage of homogeneity to estimate these coefficients. We propose a method `Clustering Algorithm in Regression via Data-driven Segmentation' (CARDS). We show that under mild conditions, CARDS successfully recovers the grouping and achieves the oracle estimation errors. The study provides additional insights on how to exploit low-dimensional structures in high-dimensional data. CARDS was successfully applied to predicting Polyadenylation signals and the S&P500 stock returns. CARDS has advantages over the more well-known methods of ordinary least squares (OLS) and the fused lasso.
URI: http://arks.princeton.edu/ark:/88435/dsp017p88cj77k
Alternate format: The Mudd Manuscript Library retains one bound copy of each dissertation. Search for these copies in the library's main catalog
Type of Material: Academic dissertations (Ph.D.)
Language: en
Appears in Collections:Operations Research and Financial Engineering

Files in This Item:
File Description SizeFormat 
Ke_princeton_0181D_11025.pdf1.55 MBAdobe PDFView/Download


Items in Dataspace are protected by copyright, with all rights reserved, unless otherwise indicated.