Skip navigation
Please use this identifier to cite or link to this item:
Title: Distributed and Robust Statistical Learning
Authors: Zhu, Ziwei
Advisors: Fan, Jianqing
Contributors: Operations Research and Financial Engineering Department
Keywords: distributed learning
high-dimensional statistics
low-rank matrix recovery
principal component analysis
robust statistics
Subjects: Statistics
Operations research
Issue Date: 2018
Publisher: Princeton, NJ : Princeton University
Abstract: Decentralized and corrupted data are nowadays ubiquitous, which impose fundamental challenges for modern statistical analysis. Illustrative examples are massive and decentralized data produced by distributed data collection systems of giant IT companies, corrupted measurement in genetic micro-array analysis, heavy-tailed returns of stocks and etc. These notorious features of modern data often contradict conventional theoretical assumptions in statistics research and invalidate standard statistical procedures. My dissertation addresses these problems by proposing new methodologies with strong statistical guarantees. When data are distributed over different places with limited communication budget, we propose to do local statistical analysis first and aggregate the local results rather than the data themselves to generate a final result. We applied this approach to low-dimensional regression, high-dimensional sparse regression and principal component analysis. When data are not over-scattered, our distributed approach is proved to achieve the same statistical performance as the full sample oracle, i.e., the standard procedure based on all the data. To handle heavy-tailed corruption, we propose a generic principle of data shrinkage for robust estimation and inference. To illustrate this principle, we apply it to estimate regression coefficients in the trace regression model and generalized linear model with heavy-tailed noise and design. The proposed method achieves nearly the same statistical error rate as the standard procedure while requiring only bounded moment conditions on data. This widens the scope of high-dimensional techniques, reducing the moment conditions from sub-exponential or sub-Gaussian distributions to merely bounded second or fourth moment.
Alternate format: The Mudd Manuscript Library retains one bound copy of each dissertation. Search for these copies in the library's main catalog:
Type of Material: Academic dissertations (Ph.D.)
Language: en
Appears in Collections:Operations Research and Financial Engineering

Files in This Item:
File Description SizeFormat 
Zhu_princeton_0181D_12532.pdf3.37 MBAdobe PDFView/Download

Items in Dataspace are protected by copyright, with all rights reserved, unless otherwise indicated.