Skip navigation
Please use this identifier to cite or link to this item: http://arks.princeton.edu/ark:/88435/dsp0112579w47z
 Title: Computationally and Statistically-efficient Methods in Data Science Authors: Guo, Yongyi Advisors: Fan, Jianqing Contributors: Operations Research and Financial Engineering Department Subjects: Statistics Issue Date: 2022 Publisher: Princeton, NJ : Princeton University Abstract: In many modern statistical learning tasks, datasets not only come in larger scales, but also possess complicated structures. On the one hand, the data collection processes are often complicated, e.g. having decentralized or streaming structures. On the other hand, highly-dependent covariates and heavy-tailed noise impose additional challenges for statistical analysis. My the- sis addresses several typical problems by proposing new methodologies with strong statistical guarantees.In the first part of this thesis, we study high-dimensional regression. We propose several computationally-efficient algorithms that can provably tackle dependent covariates and heavy-tailed noise under different settings and goals, including variable selection, estimation and prediction. In particular, we study the robust performance of the best subset selection against highly dependent covariates, and study how such robustness can carry over to any near best subset through more computational-efficient algorithms. In another line of work, we also study how time dependency between covariates affects regression under potentially heavy-tailed noise. The second part of this thesis is focused on statistical learning with distributed data. We provide two multi-step statistical estimators that are communication efficient, statistically ac- curate, and also robust to bad initializations and insufficient local sample size. This is achieved through optimizing a ‘gradient enhanced loss’ at each step on each local server. We prove that under very mild conditions, only a few steps suffice to obtain a statistical estimator as efficient as the centralized estimator. The final part of this thesis is about how efficient learning and optimized decision can be combined in the online setting. We study a specific application in feature-based dynamic pricing, where the true underlying market value of a product is linear in its observed features plus some market noise with unknown distribution. We propose an algorithm that combines semi-parametric statistics and online decision-making. We prove that, under mild conditions, the algorithm achieves near-optimal regret that is close to the lower bound where the market noise distribution belongs to a parametric class. URI: http://arks.princeton.edu/ark:/88435/dsp0112579w47z Alternate format: The Mudd Manuscript Library retains one bound copy of each dissertation. Search for these copies in the library's main catalog: catalog.princeton.edu Type of Material: Academic dissertations (Ph.D.) Language: en Appears in Collections: Operations Research and Financial Engineering

Files in This Item:
File Description SizeFormat
Guo_princeton_0181D_14220.pdf5.8 MBAdobe PDF

Items in Dataspace are protected by copyright, with all rights reserved, unless otherwise indicated.