Skip navigation
Please use this identifier to cite or link to this item: http://arks.princeton.edu/ark:/88435/dsp01zs25x849r
Title: Fault Tolerant Architectures for On-Chip Networks
Authors: Aisopos, Konstantinos
Advisors: Peh, Li-Shiuan
Contributors: Electrical Engineering Department
Keywords: architecture
coherence protocol
fault tolerant
network on chip
reliable
resilience
Subjects: Computer engineering
Computer science
Electrical engineering
Issue Date: 2012
Publisher: Princeton, NJ : Princeton University
Abstract: Technology scaling has reached miniaturization levels, where multiple processor cores can be integrated onto the same die. During the last four decades, this scaling has been the primary driver behind improving system performance, at the expense of higher temperatures and power densities. However, when scaling down to deep submicron technologies, a new evil rises: unreliable silicon. The reason behind the increasing concerns for transistor reliability is that the effects of process variation, transistor aging, electrical noise, and high temperatures are becoming stronger when shrinking the transistor dimensions. Consequently, industry projects that future chips will be exposed to large numbers of failures and is researching fault-tolerant designs. At the same time, the number of processor cores in a single chip is increasing steadily, and an efficient on-chip communication medium between them is necessary. Packet-switched on-chip networks have been gaining increased importance in this area, due to their modularity and scalable bandwidth. However, due to extreme transistor scaling, these interconnection networks are expected to experience permanent defects and runtime failures in future technology generations. On top of this, a single failure in the network may cascade across several routers and ultimately cause interruption of network service. Hence, resilient on-chip networks, which can tolerate both permanent and runtime failures transparently to upper layers, are emerging. In this dissertation, we present a characterization study of network faults, and a full-system solution to tackle them. Our characterization is conducted with an accurate circuit-level tool, which we developed to explore the impact of faults in architecture. Specifically, we present a case study where we pinpoint the common fault types in the network, their probabilities, and their architectural outcome. This way, we diagnose the vulnerable components of the interconnection network that need protection, and identify the fault types that resilient network architectures must address. We then propose a resilient architecture that can tolerate both permanent and transient faults in the interconnection network. To address permanent network faults, which disable communication links and network routers, we suggest a network architecture that can reconfigure at runtime and utilize its surviving network resources to enable continued chip operation. Our solution, namely Ariadne, explores the surviving topology upon each permanent failure, and discovers resilient routes to connect functional nodes. We also address transient network faults, which result in corrupted or lost coherence messages. We do so by developing a systematic methodology to incorporate resilience into the coherence protocol, so that it resends lost and corrupted messages, to replay the corresponding transaction after a timeout. Overall, this dissertation argues that designing chips that never experience network failures will not be economically feasible in the future, because this would result in enormous performance degradation, as well as financial losses for chip vendors, since a large number of chips would not meet the required specifications during testing. Instead, we propose to continue exploiting transistor scaling to maintain the current rate of performance improvement, but tolerate failures, so that a chip can gracefully degrade its performance over time only after actual faults occur.
URI: http://arks.princeton.edu/ark:/88435/dsp01zs25x849r
Alternate format: The Mudd Manuscript Library retains one bound copy of each dissertation. Search for these copies in the library's main catalog
Type of Material: Academic dissertations (Ph.D.)
Language: en
Appears in Collections:Electrical Engineering

Files in This Item:
File Description SizeFormat 
Aisopos_princeton_0181D_10160.pdf3.38 MBAdobe PDFView/Download


Items in Dataspace are protected by copyright, with all rights reserved, unless otherwise indicated.