Skip navigation
Please use this identifier to cite or link to this item:
Authors: Soviani, Adrian M.
Advisors: Singh, Jaswinder Pal
Li, Kai
Contributors: Computer Science Department
Keywords: Coarse Grain Dataflow
High Performance Computing
Parallel Computing
Subjects: Computer science
Issue Date: 2014
Publisher: Princeton, NJ : Princeton University
Abstract: The design of parallel programming models that achieve a good trade-off between productivity and efficiency, while maintaining performance portability and cost transparency, remains a challenging task. Similarly, parallel runtime cost modeling is essential for application and architecture design, as well as performance optimization; however, cost accuracy remains limited when modeling the effect of bandwidth bottlenecks for globally unbalanced communication. This dissertation proposes a hybrid dataflow model (CGD) that leverages the simplicity and elegance of dataflows and the good performance scalability of Single Program Multiple Data (SPMD) computations. Benchmark analysis shows that the CGD model increases the productivity while maintaining or exceeding the performance of the MPI and pthreads models. The thesis also presents a hierarchical bandwidth machine model (αDBSP) that can estimate the execution time of CGD collective communication by naturally extending and improving the Decomposable Bulk Synchronous Parallel (DBSP) model. The CGD model is a dataflow graph with SPMD computation nodes and datastructure decomposition data nodes, which exploits dataflow semantics to express data and task parallelism at a high-level, and relies on imperative languages to express efficient sequential computations. Data and computation partition and assignment are explicit, while communication, synchronization, and machine specific optimizations are handled automatically. This dissertation introduces a coordination language with dataflow semantics that implements the CGD model, and presents several applications and their optimizations implemented in this language. The CGD runtime supports MPI, SHMEM, and pthreads running on both shared memory and cluster machines. The results from an 128 processor SGI Altix 4700 system show that the optimized CGD FT outperforms NPB2.3 MPI by 27%, the optimized CGD stencil is 41% faster vs. handwritten MPI, and the CGD Barnes-Hut particle simulation improves SPLASH2 by 14%. The αDBSP model extends DBSP by associating a bandwidth growth factor α to message patterns, improves DBSP in terms of execution time, and helps machine bandwidth budgeting by estimating application hierarchical bandwidth. Consequently, for some globally unbalanced problems the \αDBSP analysis is more accurate, and sometimes simpler. E.g., the single-element nearest-neighbor message exchange running on a pruned butterfly requires O(log^{3}(p)) on αDBSP vs. O(\sqrt{p}) on DBSP, while optimally modeling the one-to-all broadcast requires a single communication step on αDBSP vs. O(log(p)) steps on DBSP. We present three scientific computing kernels that illustrate the differences between αDBSP and DBSP analysis.
Alternate format: The Mudd Manuscript Library retains one bound copy of each dissertation. Search for these copies in the library's main catalog
Type of Material: Academic dissertations (Ph.D.)
Language: en
Appears in Collections:Computer Science

Files in This Item:
File Description SizeFormat 
Soviani_princeton_0181D_10945.pdf1.71 MBAdobe PDFView/Download

Items in Dataspace are protected by copyright, with all rights reserved, unless otherwise indicated.