ON THE ABILITY OF GRADIENT DESCENT TO LEARN NEURAL NETWORKS

Li, Yuanzhi

Please use this identifier to cite or link to this item: http://arks.princeton.edu/ark:/88435/dsp014j03d2429

Full metadata record

DC Field	Value	Language
dc.contributor.advisor	Arora, Sanjeev	-
dc.contributor.author	Li, Yuanzhi	-
dc.contributor.other	Computer Science Department	-
dc.date.accessioned	2018-10-09T21:11:42Z	-
dc.date.available	2018-10-09T21:11:42Z	-
dc.date.issued	2018	-
dc.identifier.uri	http://arks.princeton.edu/ark:/88435/dsp014j03d2429	-
dc.description.abstract	Neural networks stand in the center of the recent breakthrough in Machine Learning. In the past few years, neural networks have found enormous of applications in various areas. However, due to the intrinsic non-convexity in the networks, theoretical understandings of them remain rather limited. The following major questions are largely unsolved: 1. Why can we train a neural network by minimizing a non-convex objective using local search algorithms? 2. Why does the training result generalizes instead of memorizing? On the theory side, the answer to these questions were rather pessimistic: It is known that even training a 2 layer neural network is NP-hard, and modern neural networks have so large capacity that allow them to fit arbitrary labels, so overfitting seems to be unavoidable. However, on the practice side, the answers to these questions are completely different: In practice, gradient descent (and its variants SGD, Momentum, Adagrad, Adam, RMSProp and AdamDelta) are quite effective in minimizing the training loss, moreover, the solutions found by these algorithms also often generalize well, even on neural networks with trillions of parameters. This thesis aims to build new theorems to address these questions in a way that matches the practical observations. We first consider neural networks with “identity mapping’ (Resnet). We show that by adding this mapping to the network, the loss function becomes “locally convex” in a neighborhood of the optimal. We then consider neural networks with “quadratic activations”. We show that even if we over-parametrize the network arbitrarily, the set of solutions given by gradient descent will remain on a manifold with small intrinsic dimensions, thus they will generalize. iii Next we consider neural networks with multiple outputs. We show that as long as the inputs and the labels are actually “structured” and the network is trained using gradient descent, the result will still generalize. In the end we extend our result to non-negative matrix factorization. We show that if we initialize the weights using “pure samples”, and train it using an analog of gradient descent, then we are able to recover the hidden structures in the data almost optimally.	-
dc.language.iso	en	-
dc.publisher	Princeton, NJ : Princeton University	-
dc.relation.isformatof	The Mudd Manuscript Library retains one bound copy of each dissertation. Search for these copies in the library's main catalog: <a href=http://catalog.princeton.edu> catalog.princeton.edu </a>	-
dc.subject.classification	Computer science	-
dc.title	ON THE ABILITY OF GRADIENT DESCENT TO LEARN NEURAL NETWORKS	-
dc.type	Academic dissertations (Ph.D.)	-
pu.projectgrantnumber	690-2143	-
Appears in Collections:	Computer Science

Files in This Item:

File	Description	Size	Format
Li_princeton_0181D_12714.pdf		1.49 MB	Adobe PDF	View/Download

Show simple item record

Search

Browse