Skip navigation
Please use this identifier to cite or link to this item:
Title: Inference in Regressions with Many Controls
Authors: Li, Chenchuan
Advisors: Mueller, Ulrich K
Contributors: Economics Department
Keywords: Bound
Subjects: Economics
Issue Date: 2016
Publisher: Princeton, NJ : Princeton University
Abstract: In this thesis, we consider inference on a scalar coefficient of interest in a linear regression model with many potential control variables. Without any constraint on the control coefficients in a canonical model with Gaussian, homoskedastic errors, one cannot improve upon the standard t-test in the regression that includes all controls. In the following three chapters, we investigate (i) the possibility for inference to expand under constraints on the control coefficients and (ii) how to implement these improvements under general error structures. In the first chapter, we impose a bound on the weighted sum-of-squared control coefficients, which amounts to a bound on the R^{2} of controls. We develop a simple testing procedure to exploit this constraint, and we show that our procedure is, under asymptotics where the number of controls is a fraction of sample size, (i) of correct size under potential heteroskedasticity and clustered error structures and (ii) weighted-average-power maximizing under a sequence of bounds that shrinks to zero in the canonical model. We apply the new test to an empirical study of the relationship between crime and abortion by Donohue III and Levitt (2001), where we determine the marginal value of the R^{2} bound which induces a significant result. In the second chapter, we study how a sparsity assumption, which restricts the number of nonzero control coefficients, can admit improvements over the standard t-test in the canonical model. When the design satisfies a symmetry criterion, we are able to derive the infimum power bound against a point alternative over all valid tests under sparsity. For designs outside this family, we construct an algorithm to derive non-infimum power bounds over valid tests. For various sample sizes and designs, we find that when the R^{2} of the regressor of interest on the controls does not exceed .9, the power gained from assuming that no more than 10% of control coefficients are non-zero is roughly equivalent to increasing the number of observations by less than four-fold and applying the standard t-test. In the third chapter, we study the estimation of the asymptotic variance of linear statistics in the presence of many regressors. The natural and most popular estimator of variance under potential heteroskedasticity and clustering in a model with finitely-many regressors is an observation-weighted average of squared residuals, where consistency holds because the sampling uncertainty of coefficient estimates becomes asymptotically-negligible. In a high-dimensional model where the number of regressors is of the same order as the number of observations, the regression coefficients can no longer be estimated with sufficient precision for this reasoning to apply. We construct a cluster-robust variance estimator which is (i) conditionally-unbiased in finite samples, (ii) consistent under some regularity assumptions on the sequence of regressor values, and (iii) invariant to the true data-generating coefficients of the linear model. Our estimator is an extension to the cluster case of a procedure by Cattaneo, Jansson, and Newey (2015) for potentially-heteroskedastic, independent errors.
Alternate format: The Mudd Manuscript Library retains one bound copy of each dissertation. Search for these copies in the library's main catalog:
Type of Material: Academic dissertations (Ph.D.)
Language: en
Appears in Collections:Economics

Files in This Item:
File Description SizeFormat 
Li_princeton_0181D_11941.pdf1.69 MBAdobe PDFView/Download

Items in Dataspace are protected by copyright, with all rights reserved, unless otherwise indicated.