Skip navigation
Please use this identifier to cite or link to this item:
Authors: Tian, Peter Mingru
Advisors: Klusowski, Jason
Contributors: Operations Research and Financial Engineering Department
Subjects: Statistics
Issue Date: 2023
Publisher: Princeton, NJ : Princeton University
Abstract: With the rise of data-driven technologies in high-stakes environments, decision trees have become ubiquitous because they are fast, interpretable, and can easily handle messy data. Despite their widespread empirical success, pinning down their theoretical properties has been challenging and therefore largely unexplored. This thesis conducts a rigorous study of the unique benefits and limitations of tree based algorithms in three major settings. In the first setting, we derive finite sample performance guarantees for variable selection in nonparametric regression models using a single-level CART decision tree (decision stump). We show that the marginal signal strength of each variable and ambient dimensionality can be con- siderably weaker and higher, respectively, than state-of-the-art nonparametric variable selection methods. Thus, surprisingly, even though decision stumps are highly inaccurate for estimation purposes, they can still be used to perform consistent model selection. In the second setting, we show that decision trees constructed with Classification and Regres- sion Trees (CART) and C4.5 methodology are consistent for regression and classification tasks, respectively. The ambient dimensionality may even grow sub-exponentially with the sample size. Furthermore, the theory applies to a wide range of models, including additive regression models with Borel measurable component functions and arbitrary joint distributions of the predictors. In the third setting, we study tree based algorithms for pointwise inference, with implications for heterogeneous causal effect estimation. We call into question the use of decision trees for such purposes by demonstrating that they can fail to achieve polynomial rates of convergence in uniform norm, even with pruning. Instead the convergence may be logarithmic—or in some cases—fail completely. We show that random forests significantly improve the performance via subsampling and random feature selection, which both contribute to nearly optimal performance.
Type of Material: Academic dissertations (Ph.D.)
Language: en
Appears in Collections:Operations Research and Financial Engineering

Files in This Item:
File Description SizeFormat 
Tian_princeton_0181D_14390.pdf1.18 MBAdobe PDFView/Download

Items in Dataspace are protected by copyright, with all rights reserved, unless otherwise indicated.