Skip navigation
Please use this identifier to cite or link to this item:
Authors: Ngo, Quang Minh Khiem
Advisors: Lloyd, Wyatt
Contributors: Computer Science Department
Subjects: Computer science
Issue Date: 2021
Publisher: Princeton, NJ : Princeton University
Abstract: Replicated state machines (RSMs) are linearizable, fault-tolerant groups of replicas that are coordinated using a consensus algorithm. RSMs are used throughout large-scale systems, such as distributed databases, cloud storage, and service managers. At such scale, it is common for some machines to be slow. In this dissertation, we address the problem of developing slowdown-tolerant RSMs that continue to operate as fast RSMs despite the presence of slow replicas. We define s-slowdown-tolerance and identify why existing consensus protocols are not slowdown-tolerant. As the first and pragmatic step toward slowdown-tolerant RSMs, we design, implement, and evaluate two 1-slowdown-tolerant consensus protocols, Copilot and Latent Copilot.Copilot replication is the first 1-slowdown-tolerant consensus protocol: it delivers normal latency despite the slowdown of any 1 replica. Copilot uses two distinguished replicas—the pilot and copilot—to proactively add redundancy to all stages of processing a client’s command. Copilot uses dependencies and deduplication to resolve potentially differing orderings proposed by the pilots. To avoid dependencies leading to either pilot being able to slow down the group, Copilot uses fast takeovers that allow a fast pilot to complete the ongoing work of a slow pilot. Copilot includes two optimizations—pingpong batching and null dependency elimination—that improve its performance when there are 0 and 1 slow pilots respectively. Our evaluation of Copilot shows its performance is lower but competitive with Multi-Paxos and EPaxos when no replicas are slow. When a replica is slow, Copilot is the only protocol that avoids high latencies. Latent Copilot, a variant of Copilot, is another design and implementation of a 1-slowdown-tolerant consensus protocol. Latent Copilot operates with one active pilot, which actively proposes commands, and one latent pilot, which proposes commands only when necessary. In this way, Latent Copilot achieves an intermediate tradeoff between Multi-Paxos and Copilot in terms of throughput and slowdown tolerance. Our evaluation of Latent Copilot shows that it achieves 1-slowdown tolerance and that its performance with no slowdowns is comparable to Multi-Paxos. Finally, we generalize the design of Copilot replication to handle more than one slowdown. We provide the design of an s-slowdown-tolerant protocol that achieves slowdown tolerance despite s slow replicas.
Alternate format: The Mudd Manuscript Library retains one bound copy of each dissertation. Search for these copies in the library's main catalog:
Type of Material: Academic dissertations (Ph.D.)
Language: en
Appears in Collections:Computer Science

Files in This Item:
File Description SizeFormat 
Ngo_princeton_0181D_13869.pdf1.34 MBAdobe PDFView/Download

Items in Dataspace are protected by copyright, with all rights reserved, unless otherwise indicated.