Skip navigation
Please use this identifier to cite or link to this item:
Title: Architectural Support for Large-scale Shared Memory Systems
Authors: Fu, Yaosheng
Advisors: Wentzlaff, David
Contributors: Electrical Engineering Department
Keywords: Cache Coherence
Distributed Systems
Fault Tolerance
Parallel Simulator
Shared Memory
Subjects: Computer engineering
Electrical engineering
Computer science
Issue Date: 2017
Publisher: Princeton, NJ : Princeton University
Abstract: Modern CPUs, GPUs, and data centers are being built with more and more cores. Many popular workloads will require even more hardware parallelism in the future. Shared memory is a popular parallel programming model with many advantages, but it is historically difficult to scale to a large number of cores/nodes. This thesis investigates hardware and software techniques that enable shared memory systems to scale. To be specific, this work focuses on improving two key challenges of large-scale shared memory systems: scalability and fault-tolerance. The primary scalability challenge of shared memory systems is the need to maintain cache coherence across all cores/nodes, which is difficult at scale. The fault-tolerance challenge arises mainly for distributed shared memory (DSM) systems because they are usually tightly integrated and thus do not provide good fault isolation between nodes. In order to solve those challenges, this thesis first develops a parallel simulator named PriME to simulate shared memory systems at scale. PriME is a parallel and distributed simulator that supports multi-threaded workloads as well as multi-programmed workloads. To address scalability challenges, this thesis introduces Coherence Domain Restriction (CDR) as a cache coherence framework that sidesteps traditional scalability challenges and enables systems to scale to thousands of cores within a manycore chip or millions of cores across the entire data center. The entire CDR framework has been implemented on the 25-core Princeton Piton processor. For fault-tolerance, this thesis has developed both a software-centric solution with resilient memory operations (REMO) and a hardware-centric solution with a fault-tolerant cache coherence framework (FTCC). REMO is a set of load and store instructions that can return faults that programmers can select to handle. REMO provides fault isolation in DSM systems, thereby enabling them to scale without sacrificing resilience. On the other hand, FTCC is a fault-tolerant cache coherence framework that extends DSM systems with native fault-tolerant ability in hardware without hurting their performance advantages. In sum, this thesis demonstrates that shared memory systems have the potential to achieve comparable scalability and fault-tolerance ability as current cluster-based designs while still maintaining other benefits such as ease of programming and efficient memory accesses.
Alternate format: The Mudd Manuscript Library retains one bound copy of each dissertation. Search for these copies in the library's main catalog:
Type of Material: Academic dissertations (Ph.D.)
Language: en
Appears in Collections:Electrical Engineering

Files in This Item:
There are no files associated with this item.

Items in Dataspace are protected by copyright, with all rights reserved, unless otherwise indicated.