Data Processing Across Continents

Arye, Matvey

Please use this identifier to cite or link to this item: http://arks.princeton.edu/ark:/88435/dsp01gh93h192h

Full metadata record

DC Field	Value	Language
dc.contributor.advisor	Freedman, Michael J	-
dc.contributor.author	Arye, Matvey	-
dc.contributor.other	Computer Science Department	-
dc.date.accessioned	2016-06-09T15:00:57Z	-
dc.date.available	2016-09-01T05:23:12Z	-
dc.date.issued	2016	-
dc.identifier.uri	http://arks.princeton.edu/ark:/88435/dsp01gh93h192h	-
dc.description.abstract	An increasing number of data sources create data across the globe. These include everything from server logs owned by Internet-scale companies to military intelligence systems. This thesis addresses the question of how to enable near-real-time analytical queries on such data. Existing systems tend to centralize such data into a single datacenter before analyzing it. However, in light of low and asymmetric bandwidth provisioning in and between certain geographic regions, centralizing all data can be both slow and costly. This thesis have addressed this problem with three complementary research directions. First, we describe a system that queries the data in a distributed manner and centralizes only the data that is needed to fulfill the query. Our system incorporates edge storage and customizable degradation operators in its programming model. These elements allow the system to adjust the data-volumes transferred to match available bandwidth. Second, we explore some challenges due to the interaction between an application-level dynamic quality adaptation control loop and TCP (which has its own control loop). These two control loops can create negative feedback effects which reduces system throughput below what the network can sustain. These insights are translated into the domain of Internet video streaming and several solutions are proposed. Our solutions enable video flows to achieve above 90\% of its fair-share of throughput, while industry players often achieve less than 50\% of their fair-share. Finally, we present a case study of how to optimize queries for wide-area analysis. We optimize the top-k query, which addresses questions of popularity and is thus ubiquitous in modern computer systems. Our algorithms reduce both the bandwidth usage and number of rounds used by such queries. In particular, we propose the first exact two-round top-k algorithm (which still transfers 19\% less bytes than the best previously-known exact 3-round algorithm). Our 2-or-3-round exact algorithm transfers 31\% less bytes than the best previously-known {\it approximate} algorithm. Finally, our approximate algorithm uses 40\% less bandwidth than previous algorithms with stronger guarantees.	-
dc.language.iso	en	-
dc.publisher	Princeton, NJ : Princeton University	-
dc.relation.isformatof	The Mudd Manuscript Library retains one bound copy of each dissertation. Search for these copies in the library's main catalog: http://catalog.princeton.edu/	-
dc.subject	Data Analytics	-
dc.subject	Networking	-
dc.subject	Top-K	-
dc.subject	Video Streaming	-
dc.subject	Wide-Area	-
dc.subject.classification	Computer science	-
dc.title	Data Processing Across Continents	-
dc.type	Academic dissertations (Ph.D.)	-
pu.projectgrantnumber	690-2143	-
pu.embargo.terms	2016-09-01	-
Appears in Collections:	Computer Science

Files in This Item:

File	Description	Size	Format
Arye_princeton_0181D_11808.pdf		1.32 MB	Adobe PDF	View/Download

Show simple item record

Search

Browse