Skip navigation
Please use this identifier to cite or link to this item: http://arks.princeton.edu/ark:/88435/dsp01cc08hj86h
Title: Survival Analysis in Distributed and High-Dimensional Environments and Theory of Cross-Validation
Authors: Bayle, Pierre
Advisors: Fan, Jianqing
Contributors: Operations Research and Financial Engineering Department
Keywords: Cox's proportional hazards model
Cross-validation
Distributed inference
High-dimensional
Machine learning
Model selection
Subjects: Statistics
Issue Date: 2023
Publisher: Princeton, NJ : Princeton University
Abstract: This thesis is devoted to developing efficient algorithms with theoretical guarantees for various statistical problems that arose in the era of big data. First, we design communication-efficient iterative distributed algorithms for estimation and inference in Cox's model under the sparse high-dimensional regime. These are crucial, as massive datasets are often split across machines for various reasons, e.g., privacy, storage, or computation. We prove that our methods for parameter estimation, confidence intervals for linear functionals, and hypothesis testing for coefficients all yield the same asymptotic statistical performance as the ideal, yet infeasible, procedures that would have access to all the data at once. Then, we propose Factor-Augmented Regularized Model for Hazard Regression (FarmHazard), which extends Cox's model and can efficiently handle the presence of correlated covariates with factor structure. We prove consistency of both model selection and estimation. Model selection permits the reduction of high-dimensional data to the true set of relevant predictors and is a challenge in the analysis of big data, especially when variables are correlated. To overcome this issue, our model builds upon the latent factors driving the covariate dependence. We also propose a factor-augmented variable screening procedure in ultra-high dimensional settings. Lastly, we study another fundamental problem with applications for both the statistical and the machine learning communities and no longer focus on a specific model. We tackle the issue of comparing and evaluating algorithms in a statistically sound way, which becomes pressing due to the increase in algorithm development.We prove central limit theorems for the cross-validation error of any learning algorithm that satisfies very mild stability conditions, and we design consistent variance estimators. We then construct practical, asymptotically-exact confidence intervals for test error evaluation as well as valid and powerful hypothesis tests for algorithm comparison. Our results hold for any number of folds, including the leave-one-out case.
URI: http://arks.princeton.edu/ark:/88435/dsp01cc08hj86h
Type of Material: Academic dissertations (Ph.D.)
Language: en
Appears in Collections:Operations Research and Financial Engineering

Files in This Item:
File Description SizeFormat 
Bayle_princeton_0181D_14455.pdf2.42 MBAdobe PDFView/Download


Items in Dataspace are protected by copyright, with all rights reserved, unless otherwise indicated.