Skip navigation
Please use this identifier to cite or link to this item: http://arks.princeton.edu/ark:/88435/dsp01mc87pq32q
Title: New Systems and Algorithms for Scalable Fault Tolerance
Authors: Sen, Siddhartha
Advisors: Freedman, Michael J.
Tarjan, Robert E.
Contributors: Computer Science Department
Keywords: balanced trees
Byzantine fault tolerance
database access methods
expander graphs
join-leave attacks
partial broadcast
Subjects: Computer science
Applied mathematics
Issue Date: 2013
Publisher: Princeton, NJ : Princeton University
Abstract: Users of Internet services are increasingly intolerant of delays and outages, while demanding a consistent online experience. A website that is down or misbehaving is reported within seconds, often with an embarrassing screenshot that spreads through the news like wildfire. Among these failures, the most notorious are the ones that manifest arbitrary behavior, such as returning the wrong content to users or accidentally deleting their data. Unfortunately, protecting against such failures---whether due to misconfigurations, bugs, or even malice---is prohibitively expensive, because most existing solutions do not scale beyond a single server's performance. As a result, these solutions are not used for customer-facing services, where scalability is required to cope with large user populations. This thesis describes new systems and algorithms for tolerating arbitrary failures in Internet services, inspired by real-world debacles. Unlike prior work, our solutions are highly scalable. Our approach integrates theoretical innovations into the later stages of system design, giving robust guarantees that are also practical. We begin with a real failure that occurred in the indexing technique used by a certain database provider, and explain theoretically why the technique failed. We remedy the technique by introducing a new class of tree data structures, called relaxed trees, with provably good properties. Our analysis of relaxed trees makes use of exponential potential functions. Then, we describe a general system for tolerating arbitrary failures, called Prophecy, that delivers scalable performance on read-mostly workloads. With a modest trust assumption, Prophecy is practical for modern Internet services, as our evaluation confirms. Finally, we devise two techniques to scale this fault tolerance to very large-scale systems and general workloads. The first is an algorithm for securely composing many small replica groups, subject to an adversary that can coordinate faulty nodes across the groups dynamically. The second is a technique for improving the fault tolerance within each replica group, by adding small, trusted broadcast channels that mitigate the impact of faulty nodes.
URI: http://arks.princeton.edu/ark:/88435/dsp01mc87pq32q
Alternate format: The Mudd Manuscript Library retains one bound copy of each dissertation. Search for these copies in the library's main catalog
Type of Material: Academic dissertations (Ph.D.)
Language: en
Appears in Collections:Computer Science

Files in This Item:
File Description SizeFormat 
Sen_princeton_0181D_10607.pdf1.33 MBAdobe PDFView/Download


Items in Dataspace are protected by copyright, with all rights reserved, unless otherwise indicated.