Skip navigation
Please use this identifier to cite or link to this item:
Title: Data Processing Across Continents
Authors: Arye, Matvey
Advisors: Freedman, Michael J
Contributors: Computer Science Department
Keywords: Data Analytics
Video Streaming
Subjects: Computer science
Issue Date: 2016
Publisher: Princeton, NJ : Princeton University
Abstract: An increasing number of data sources create data across the globe. These include everything from server logs owned by Internet-scale companies to military intelligence systems. This thesis addresses the question of how to enable near-real-time analytical queries on such data. Existing systems tend to centralize such data into a single datacenter before analyzing it. However, in light of low and asymmetric bandwidth provisioning in and between certain geographic regions, centralizing all data can be both slow and costly. This thesis have addressed this problem with three complementary research directions. First, we describe a system that queries the data in a distributed manner and centralizes only the data that is needed to fulfill the query. Our system incorporates edge storage and customizable degradation operators in its programming model. These elements allow the system to adjust the data-volumes transferred to match available bandwidth. Second, we explore some challenges due to the interaction between an application-level dynamic quality adaptation control loop and TCP (which has its own control loop). These two control loops can create negative feedback effects which reduces system throughput below what the network can sustain. These insights are translated into the domain of Internet video streaming and several solutions are proposed. Our solutions enable video flows to achieve above 90\% of its fair-share of throughput, while industry players often achieve less than 50\% of their fair-share. Finally, we present a case study of how to optimize queries for wide-area analysis. We optimize the top-k query, which addresses questions of popularity and is thus ubiquitous in modern computer systems. Our algorithms reduce both the bandwidth usage and number of rounds used by such queries. In particular, we propose the first exact two-round top-k algorithm (which still transfers 19\% less bytes than the best previously-known exact 3-round algorithm). Our 2-or-3-round exact algorithm transfers 31\% less bytes than the best previously-known {\it approximate} algorithm. Finally, our approximate algorithm uses 40\% less bandwidth than previous algorithms with stronger guarantees.
Alternate format: The Mudd Manuscript Library retains one bound copy of each dissertation. Search for these copies in the library's main catalog:
Type of Material: Academic dissertations (Ph.D.)
Language: en
Appears in Collections:Computer Science

Files in This Item:
File Description SizeFormat 
Arye_princeton_0181D_11808.pdf1.32 MBAdobe PDFView/Download

Items in Dataspace are protected by copyright, with all rights reserved, unless otherwise indicated.