Skip navigation
Please use this identifier to cite or link to this item: http://arks.princeton.edu/ark:/88435/dsp01wm117r22g
Full metadata record
DC FieldValueLanguage
dc.contributor.advisorMartonosi, Margaret Ren_US
dc.contributor.advisorShaw, Kelly Aen_US
dc.contributor.authorJia, Wenhaoen_US
dc.contributor.otherElectrical Engineering Departmenten_US
dc.date.accessioned2014-11-21T19:33:37Z-
dc.date.available2014-11-21T19:33:37Z-
dc.date.issued2014en_US
dc.identifier.urihttp://arks.princeton.edu/ark:/88435/dsp01wm117r22g-
dc.description.abstractIn response to the ever growing demand for computing power, heterogeneous parallelism has emerged as a widespread computing paradigm in the past decade or so. In particular, massively parallel processors such as graphics processing units (GPUs) have become the prevalent throughput computing elements in heterogeneous systems, offering high performance and power efficiency for general-purpose workloads. However, GPUs are difficult to program and design for several reasons. First, GPUs are relatively new and still receive frequent design changes, making it challenging for GPU programmers and designers to determine which architectural resources have the highest performance or power impact. Second, a lack of virtualization in GPUs often causes strong and unexpected resource interactions. It also forces software developers to program for specific hardware details such as thread counts and scratchpad sizes, imposing programmability and portability hurdles. Third, though some GPU components such as general-purpose caches have been introduced to improve performance and programmability, they are not well tailored to GPU characteristics such as favoring throughput over latency. Therefore, these conventionally designed components suffer from resource contention caused by high thread parallelism and do not achieve their full performance and programmability potential. To overcome these challenges, this thesis proposes statistical analysis techniques and software and hardware optimizations that improve the performance, power efficiency, and programmability of GPUs. These proposals make it easier for programmers and designers to produce optimized GPU software and hardware designs. The first part of the thesis describes how statistical analysis can help users explore a GPU software or hardware design space with performance or power as the metric of interest. In particular, two fully automated tools--Stargazer and Starchart--are developed and presented. Stargazer is based on linear regression. It identifies globally important GPU design parameters and their interactions, revealing which factors have the highest performance or power impact. Starchart improves on Stargazer by using recursive partitioning to identify not only globally but also locally influential design parameters. More importantly, Starchart can be used to solve design problems formulated as a series of design decisions. These tools ease design tuning while saving design exploration time by 300-3000 times compared to exhaustive approaches. Then, inspired by two Starchart case studies, the second part of the thesis focuses on two key GPU software design decisions: cache configuration and thread block size selection. Compile-time algorithms are proposed to make these decisions automatically, improve program performance, and ease GPU programming. The first algorithm analyzes a program's memory access patterns and turns caching on or off accordingly for each instruction. This improves the performance benefit of caching from 5.8% to 18%. The second algorithm estimates the sufficient number of threads to trigger either memory bandwidth or compute throughput saturation. Running programs with the estimated thread counts, instead of the hardware maximum, reduces GPU core resource usage by 27-62% while improving performance by 5-10%. Finally, to show how well-designed hardware can transparently improve GPU performance and programmability, the third part of the thesis proposes and evaluates the memory request prioritization buffer (MRPB). MRPB automates GPU cache management, reduces cache contention, and increases cache throughput. It does so by using request reordering to reduce cache thrashing and by using cache bypassing to reduce resource stalls. In addition to improving performance by 1.3-2.7 times and easing GPU programming, MRPB highlights the value of tailoring conventionally designed GPU hardware components to the massively parallel nature of GPU workloads. In summary, using GPUs as an example, the high-level statistical tools and the more focused software and hardware studies presented in this thesis demonstrate how to use automation techniques to effectively improve the performance, power efficiency, and programmability of emerging heterogeneous computing platforms.en_US
dc.language.isoenen_US
dc.publisherPrinceton, NJ : Princeton Universityen_US
dc.relation.isformatofThe Mudd Manuscript Library retains one bound copy of each dissertation. Search for these copies in the <a href=http://catalog.princeton.edu> library's main catalog </a>en_US
dc.subjectCacheen_US
dc.subjectCompileren_US
dc.subjectDesign space explorationen_US
dc.subjectGPGPUen_US
dc.subjectGraphics processing uniten_US
dc.subjectHigh-performance computingen_US
dc.subject.classificationComputer engineeringen_US
dc.subject.classificationComputer scienceen_US
dc.subject.classificationElectrical engineeringen_US
dc.titleAnalysis and Optimization Techniques for Massively Parallel Processorsen_US
dc.typeAcademic dissertations (Ph.D.)en_US
pu.projectgrantnumber690-2143en_US
Appears in Collections:Electrical Engineering

Files in This Item:
File Description SizeFormat 
Jia_princeton_0181D_11168.pdf3.02 MBAdobe PDFView/Download


Items in Dataspace are protected by copyright, with all rights reserved, unless otherwise indicated.