Skip navigation
Please use this identifier to cite or link to this item: http://arks.princeton.edu/ark:/88435/dsp01ns064940k
Title: Utilizing Subword Serialization and Parallelism to Design Efficient High-Performance Processors
Authors: Jackson, Paul
Advisors: Wentzlaff, David
Contributors: Electrical and Computer Engineering Department
Keywords: Bit-level Parallelism
GPUs
Microarchitecture
Parallelism
Subjects: Computer engineering
Issue Date: 2024
Publisher: Princeton, NJ : Princeton University
Abstract: Since the first digital computers, architects have exploited transistor scaling to drive innovation. Decreasing transistor sizes, and subsequently increasing transistor budgets, has naturally led to increasingly complex processor designs. Ultimately, advancements in processor architecture aim to improve performance on computational workloads which enable computers to perform high-level applications. For decades, transistors have scaled exponentially but are approaching fundamental physical limits. The power wall that was identified by the end of Dennard Scaling prompted a revolution in architectural design. Because the power density of transistors no longer stayed constant as transistors scaled, architects developed methods to extract better performance without sacrificing power. A similar wall looms over the architecture community: the end of Moore's law. When transistors stop scaling, architects must find ways to improve performance without using more transistors. This transitional point marking the shifting priorities between different design parameters provides the opportunity to direct research into many directions. This thesis presents a forward looking perspective where silicon area is a fixed limitation and must be considered along with performance and power. This thesis identifies bit-level parallelism as a particularly impactful form of parallelism, the nature and implications of which have not been thoroughly studied in prior art. The effects of bit-level parallelism are studied using Nibbler, a parameterized subword-serial SIMD architecture. Subword-serial architectures like Nibbler perform traditional word-wide computation over multiple cycles, operating on a one-subword-sized portion of the inputs per cycle. This work is the first to isolate and directly study the impact of bit-level parallelism in processor microarchitecture. The analysis presents multiple design points ranging from a fully bit-serial processor to one with a full word-wide datapath of 32 bits. Nibbler is evaluated and characterized for its area, timing, energy, and throughput performance. These characterizations, in combination with additional sensitivity studies, bring light to the effectiveness of serialization as a design technique. This thesis shows that a subword-serial SIMD processor can show simultaneous improvements in all four metrics when compared to a non-serial, word-wide SIMD processor. The results of this study are further analyzed to discuss the impacts of serialization on processor microarchitecture, identifying opportunities for potential future improvements and highlighting key considerations when designing processors which utilize serialized execution. One core idea presented throughout this thesis is that architectural concepts find strength when grounded in reality. While high-level models prove effective in estimating the limits of a technology or concept, physical prototypes demonstrate the feasibility of ideas and provide a minimum bound on performance. Following this ideology, this thesis contains detailed characterizations of two manycore academic chips: Piton and the DECADES test chip. These data, when used in conjunction with their open-sourced RTL and EDA infrastructure, provide anchoring data points which others can use as reference in future architecture evaluation studies. Piton is a 25-core homogeneous tiled processor taped out in the IBM 32nm SOI process technology. Two characterization studies are performed on Piton. The first breaks down power and energy consumption of the chip on SPECint 2006 benchmarks. The second analysis compares the effectiveness of two parallelization techniques, multicore execution and fine-grained multithreading, to provide insight in which techniques work better when optimizing for power, energy, or area. DECADES is a 108-tile heterogeneous tiled processor taped out in the IBM 12nm process technology. The design efforts behind the development of Nibbler culminate in its contribution to the DECADES chip. DECADES contains 23 instances of a 64-lane 8-bit wide Nibbler processor. This thesis details the considerations which must be made when reconciling a theoretically optimal design point with the feasibility of creating a performant chip. This thesis presents an end-to-end analysis of serialization as a design parameter as well as related parallelization techniques. The analyses use a forward-looking perspective emphasizing feasibility and practicality, resulting in a complete picture ranging from theoretical reasoning to realization in Silicon.
URI: http://arks.princeton.edu/ark:/88435/dsp01ns064940k
Type of Material: Academic dissertations (Ph.D.)
Language: en
Appears in Collections:Electrical Engineering

Files in This Item:
File Description SizeFormat 
Jackson_princeton_0181D_15011.pdf3.98 MBAdobe PDFView/Download


Items in Dataspace are protected by copyright, with all rights reserved, unless otherwise indicated.