Please use this identifier to cite or link to this item:
http://arks.princeton.edu/ark:/88435/dsp015q47rs12x
Title: | Methods for Efficient and Scalable Deep Learning |
Authors: | Xia, Wenhan |
Advisors: | Hazan, Elad |
Contributors: | Electrical and Computer Engineering Department |
Subjects: | Artificial intelligence Computer science |
Issue Date: | 2024 |
Publisher: | Princeton, NJ : Princeton University |
Abstract: | The past few years have seen significant breakthroughs in artificial intelligence, from being able to generate realistic images to having meaningful conversations with chatbots. State-of-the-art deep learning models are rapidly expanding in size as researchers find that larger models tend to generalize better and demonstrate superior performance on downstream tasks. However, the massive computational cost, memory requirement, and energy expenditure associated with the ever-growing size of models pose critical challenges in both training and inference phases, especially for resource- constrained scenarios. To this end, this thesis tackles challenges along three dimensions for deep learning training and deployment: (1) Fine-tuning: Large language models (LLMs) are powerful tools for diverse applications. To tailor these models to downstream tasks, adaptation is usually conducted by fine-tuning all the parameters of the model. Despite its effectiveness, this full parameter fine-tuning paradigm demands extensive computational resources and encounters memory limitation. (2) Training: Hyperparameters utilized during the training phase, often referred to as the training recipe, have significant influence over the performance of deep learning models. Typically, practitioners resort to a grid-search approach, which is a computationally demanding brute-force exploration. However, as both model complexity and dataset size increase, this strategy can become impractical and computationally expensive. (3) Inference: Performing inference with modern deep learning models costs massive floating-point operations (FLOPs), especially considering the ever growing model size from millions to billions of parameters. The high FLOPs at inference phase can pose challenges to model deployment in resource-constrained and time-sensitive scenarios, such as edge-side inference and self-driving cars. To address these challenges, this thesis proposes techniques and algorithms to take a step forward in improving efficiency of deep learning training, fine-tuning, and inference. We first introduce an iterative optimization framework COLA to efficiently fine- tune large language models with significantly less parameters while maintaining high task accuracy. COLA is inspired by the Frank-Wolfe algorithm and employs a residual learning procedure on top of low-rank adaptation methods. We present theoretical convergence guarantees as well as empirical results on various tasks to showcase the effectiveness of our algorithm. For example, COLA brings a relative 6.47% gain in test accuracy compared to LoRA in fine-tuning OPT-1.3B. LLama2-7B experiments show up to 4.36% relative test score improvement. Then, we develop an adaptive gradients method SAMUEL to automate the learning rate schedule, thereby reducing the computational cost of hyperparameter selection for training recipes. Our proposed method is built upon the multiplicative weight frame- work and has provable adaptive regret guarantees vs. the best local preconditioner. Empirically, we demonstrate the robustness of our method in automated selection of the optimal learning to form a learning rate schedule for both vision and language benchmarking tasks. Finally, the thesis introduces an approach for efficient inference of deep learning models with reduced FLOPs. We propose a fully dynamic paradigm that learns to tailor and execute only salient sub-graphs of the deep neural networks for different input instances. Two compact auxiliary networks are introduced to the backbone model to predict on a per-instance basis which layers or filters/channels are redundant and therefore should be skipped. These two auxiliary networks also learn attention scores to scale the outputs of retained computation outputs to maximize task accuracy. We show on CIFAR-10 that our method achieves up to 11.9X fewer FLOPs and up to 3.3% higher accuracy compared to related methods. On the ImageNet dataset, our proposed method reduces FLOPs by up to 1.4X and brings up to 4.6% higher top-1 accuracy than the other methods. |
URI: | http://arks.princeton.edu/ark:/88435/dsp015q47rs12x |
Type of Material: | Academic dissertations (Ph.D.) |
Language: | en |
Appears in Collections: | Electrical Engineering |
Files in This Item:
File | Description | Size | Format | |
---|---|---|---|---|
Xia_princeton_0181D_15132.pdf | 3.83 MB | Adobe PDF | View/Download |
Items in Dataspace are protected by copyright, with all rights reserved, unless otherwise indicated.