Skip navigation
Please use this identifier to cite or link to this item:
Authors: Li, Yuanzhi
Advisors: Arora, Sanjeev
Contributors: Computer Science Department
Subjects: Computer science
Issue Date: 2018
Publisher: Princeton, NJ : Princeton University
Abstract: Neural networks stand in the center of the recent breakthrough in Machine Learning. In the past few years, neural networks have found enormous of applications in various areas. However, due to the intrinsic non-convexity in the networks, theoretical understandings of them remain rather limited. The following major questions are largely unsolved: 1. Why can we train a neural network by minimizing a non-convex objective using local search algorithms? 2. Why does the training result generalizes instead of memorizing? On the theory side, the answer to these questions were rather pessimistic: It is known that even training a 2 layer neural network is NP-hard, and modern neural networks have so large capacity that allow them to fit arbitrary labels, so overfitting seems to be unavoidable. However, on the practice side, the answers to these questions are completely different: In practice, gradient descent (and its variants SGD, Momentum, Adagrad, Adam, RMSProp and AdamDelta) are quite effective in minimizing the training loss, moreover, the solutions found by these algorithms also often generalize well, even on neural networks with trillions of parameters. This thesis aims to build new theorems to address these questions in a way that matches the practical observations. We first consider neural networks with “identity mapping’ (Resnet). We show that by adding this mapping to the network, the loss function becomes “locally convex” in a neighborhood of the optimal. We then consider neural networks with “quadratic activations”. We show that even if we over-parametrize the network arbitrarily, the set of solutions given by gradient descent will remain on a manifold with small intrinsic dimensions, thus they will generalize. iii Next we consider neural networks with multiple outputs. We show that as long as the inputs and the labels are actually “structured” and the network is trained using gradient descent, the result will still generalize. In the end we extend our result to non-negative matrix factorization. We show that if we initialize the weights using “pure samples”, and train it using an analog of gradient descent, then we are able to recover the hidden structures in the data almost optimally.
Alternate format: The Mudd Manuscript Library retains one bound copy of each dissertation. Search for these copies in the library's main catalog:
Type of Material: Academic dissertations (Ph.D.)
Language: en
Appears in Collections:Computer Science

Files in This Item:
File Description SizeFormat 
Li_princeton_0181D_12714.pdf1.49 MBAdobe PDFView/Download

Items in Dataspace are protected by copyright, with all rights reserved, unless otherwise indicated.