New Systems and Algorithms for Scalable Fault Tolerance

Sen, Siddhartha

Please use this identifier to cite or link to this item: http://arks.princeton.edu/ark:/88435/dsp01mc87pq32q

Title:	New Systems and Algorithms for Scalable Fault Tolerance
Authors:	Sen, Siddhartha
Advisors:	Freedman, Michael J. Tarjan, Robert E.
Contributors:	Computer Science Department
Keywords:	balanced trees Byzantine fault tolerance database access methods expander graphs join-leave attacks partial broadcast
Subjects:	Computer science Applied mathematics
Issue Date:	2013
Publisher:	Princeton, NJ : Princeton University
Abstract:	Users of Internet services are increasingly intolerant of delays and outages, while demanding a consistent online experience. A website that is down or misbehaving is reported within seconds, often with an embarrassing screenshot that spreads through the news like wildfire. Among these failures, the most notorious are the ones that manifest arbitrary behavior, such as returning the wrong content to users or accidentally deleting their data. Unfortunately, protecting against such failures---whether due to misconfigurations, bugs, or even malice---is prohibitively expensive, because most existing solutions do not scale beyond a single server's performance. As a result, these solutions are not used for customer-facing services, where scalability is required to cope with large user populations. This thesis describes new systems and algorithms for tolerating arbitrary failures in Internet services, inspired by real-world debacles. Unlike prior work, our solutions are highly scalable. Our approach integrates theoretical innovations into the later stages of system design, giving robust guarantees that are also practical. We begin with a real failure that occurred in the indexing technique used by a certain database provider, and explain theoretically why the technique failed. We remedy the technique by introducing a new class of tree data structures, called relaxed trees, with provably good properties. Our analysis of relaxed trees makes use of exponential potential functions. Then, we describe a general system for tolerating arbitrary failures, called Prophecy, that delivers scalable performance on read-mostly workloads. With a modest trust assumption, Prophecy is practical for modern Internet services, as our evaluation confirms. Finally, we devise two techniques to scale this fault tolerance to very large-scale systems and general workloads. The first is an algorithm for securely composing many small replica groups, subject to an adversary that can coordinate faulty nodes across the groups dynamically. The second is a technique for improving the fault tolerance within each replica group, by adding small, trusted broadcast channels that mitigate the impact of faulty nodes.
URI:	http://arks.princeton.edu/ark:/88435/dsp01mc87pq32q
Alternate format:	The Mudd Manuscript Library retains one bound copy of each dissertation. Search for these copies in the library's main catalog
Type of Material:	Academic dissertations (Ph.D.)
Language:	en
Appears in Collections:	Computer Science

Files in This Item:

File	Description	Size	Format
Sen_princeton_0181D_10607.pdf		1.33 MB	Adobe PDF	View/Download

Show full item record

Search

Browse