Skip navigation
Please use this identifier to cite or link to this item: http://arks.princeton.edu/ark:/88435/dsp017d278w882
Title: Utility Scheduling for Multi-Tenant Clusters
Authors: Stafman, Logan Lee
Advisors: Freedman, Michael J
Contributors: Computer Science Department
Keywords: Approximate Computing
Distributed Systems
Scheduling
Utility-Aware
Subjects: Computer science
Issue Date: 2019
Publisher: Princeton, NJ : Princeton University
Abstract: The rapid increase in data size along with the complex patterns of data usage amongst data scientists presents new challenges for large-scale data analytics systems. Modern dis- tributed computing frameworks must support complex applications that range from answer- ing database queries to training machine learning models. As data centers have grown, managing their resources has become an increasingly important task. New applications have become popular that make traditional scheduling systems inadequate. In this thesis, we present distributed scheduling systems aimed at increasing cluster resource utilization by taking advantage of specific characteristics of data processing ap- plications. First, we identify a set of applications whose characteristics make them prime targets for utility-based scheduling. We then focus on two specific types of these applica- tions in the following systems: (i) SLAQ: a cluster scheduling system for machine learning (ML) training jobs that aims to maximize the qualities of all models trained. In exploratory model training, models can be improved more quickly by redirecting resources to jobs with the highest potential for improvement. SLAQ reduces latency and maximizes the quality of models being trained by a distributed ML cluster. (ii) ReLAQS: a cluster scheduling system for incremental approximate query process- ing (AQP) systems that aims to minimize the error of all approximate results. In AQP, queries compute approximate results by sampling data. In AQP, error can be reduced more quickly by allocating resources to queries with higher error. ReLAQS reduces the latency required to reach a query result with a given level of error in a shared AQP environment. These works demonstrate a novel set of methods that can be used in fine-grained scheduling to build responsive, efficient distributed systems. We have evaluated these systems on standard benchmark workloads and datasets, as well as popular ML algorithms, and show both reduced latency and increased accuracy of intermediary results.
URI: http://arks.princeton.edu/ark:/88435/dsp017d278w882
Alternate format: The Mudd Manuscript Library retains one bound copy of each dissertation. Search for these copies in the library's main catalog: catalog.princeton.edu
Type of Material: Academic dissertations (Ph.D.)
Language: en
Appears in Collections:Computer Science

Files in This Item:
File Description SizeFormat 
Stafman_princeton_0181D_13000.pdf7.3 MBAdobe PDFView/Download


Items in Dataspace are protected by copyright, with all rights reserved, unless otherwise indicated.