Skip navigation
Please use this identifier to cite or link to this item:
Title: Statistical Inference for Big Data
Authors: Zhao, Tianqi
Advisors: Liu, Han
Contributors: Operations Research and Financial Engineering Department
Keywords: Big Data
Statistical Inference
Subjects: Statistics
Issue Date: 2017
Publisher: Princeton, NJ : Princeton University
Abstract: This dissertation develops novel inferential methods and theory for assessing uncertainty of modern statistical procedures unique to big data analysis. In particular, we mainly focus on four challenging aspects of big data: massive sample size, high dimensionality, heterogeneity and complexity. To begin with, we consider a partially linear framework for modeling massive heterogeneous data. The major goal is to extract common features across all sub-populations while exploring heterogeneity of each sub-population. In particular, we propose an aggregation type estimator for the commonality parameter that possesses the (non-asymptotic) minimax optimal bound and asymptotic distribution as if there were no heterogeneity. This oracle result holds when the number of sub-populations does not grow too fast. The next problem focuses on the challenge of the high dimensionality. We propose a robust inferential procedure for assessing uncertainties of parameter estimation in high dimensional linear models, where the dimension p can grow exponentially fast with the sample size n. We develop a new de-biasing framework tailored for nonsmooth loss functions. Our framework enables us to exploit the composite quantile function to construct a de-biased CQR estimator. This estimator is robust, and preserves efficiency in the sense that the worst case efficiency loss is less than 30% compared to square-loss-based procedures. In many cases our estimator is close to or better than the latter. Next, we consider the problem of high dimensional semiparametric generalized linear models. We propose a new inferential framework which addresses a variety of challenging problems in high dimensional data analysis, including incomplete data, selection bias, and heterogeneity. First, we develop a regularized statistical chromatography approach to infer the parameter of interest under the proposed semiparametric generalized linear model without the need of estimating the unknown base measure function. Then we propose a new likelihood ratio based framework to construct post-regularization confidence regions and tests for the low dimensional components of high dimensional parameters. We demonstrate the consequences of the general theory by using examples of missing data and multiple datasets inference. Lastly, we study the rank likelihood as a powerful inferential tool in multivariate analysis. The computation of the full rank likelihood function is often intractable in large-scale datasets. Motivated by this, we resort to lower order rank approximations and propose a new family of local rank likelihood functions. In particular, we show that the maximizer of the second-order local rank likelihood coincides with the Kendall's tau correlation matrix for the transelliptical distribution family. Motivated by this new interpretation of the Kendall's tau, we then investigate the third-order local rank likelihood, whose maximizer defines a new estimator that can be viewed as the third-order counterpart of the Kendall's tau correlation matrix. We establish asymptotic normality and calculate its limiting variance under the Gaussian copula model, which enables the construction of confidence intervals based on this new estimator.
Alternate format: The Mudd Manuscript Library retains one bound copy of each dissertation. Search for these copies in the library's main catalog:
Type of Material: Academic dissertations (Ph.D.)
Language: en
Appears in Collections:Operations Research and Financial Engineering

Files in This Item:
File Description SizeFormat 
Zhao_princeton_0181D_12188.pdf3.43 MBAdobe PDFView/Download

Items in Dataspace are protected by copyright, with all rights reserved, unless otherwise indicated.