Thermal-Aware Design Space Exploration of 3D Systolic ML Accelerators

,


I. INTRODUCTION
M ACHINE Learning (ML) algorithms are composed of both computationally and memory-intensive matrix multiplication operations. Systolic array architectures [1] achieve high throughput with modest bandwidth for matrix multiplication operations and hence make a good choice for ML acceleration. Systolic array-based ML accelerators have seen deployment in data-centers [2] [3] as well as in mobile platforms [4] [5]. As the ML application space continues to expand with big data and as the Neural Network (NN) models continue to grow bigger to achieve higher accuracy, the accelerators must scale to meet the increasing demands of computation and energy-efficiency.
At the same time, the typical gains in energy-efficiency that dimensional scaling has brought over the past several decades are slowing down [6] [7] [8]. 2D enhanced architectures [9] place dies side-by-side and interconnect them through mediums such as a silicon interposer [10], or embedded bridge [11] [12] to achieve higher interconnect densities compared to mainstream packages. 3D architectures like hybrid wafer bonding [13] [14] directly stack two or more dies on top of each other without using the agency of the package, This manuscript was submitted for review on 04/28/2021. R. Mathur further reducing distances and increasing interconnect densities between dies. 3D architectures may offer complementary gains to traditional dimensional scaling for achieving high performance, low power, high bandwidth, fast time-to-market, all in a small footprint. Larger 2D dies can be replaced by a few smaller ones with potentially higher manufacturing yields [15] [16]. Besides, 3D allows heterogeneous integration of parts from different technologies instead of having to redesign every component for a specific process [17]. As 3D technologies evolve, increasingly finer pitches of 3D connections become viable [18] [19]. This opens interesting possibilities for designers to partition and fold designs onto multiple tiers [20] [21]. Deep Neural Network (DNN) processing is heavy in computation and data movement [22]. 3D makes it possible to pack more compute or memory in the same 2D footprint while reducing interconnect delay and power by bringing the blocks closer. Hence, 3D provides an opportunity to build powerful and energy-efficient accelerators.
Traditional 2D systolic array design involves careful partitioning of the silicon real estate between the PE units and SRAM buffers to balance the throughput and external memory transfer bandwidth. 3D accelerator design additionally involves the optimal distribution of the increased silicon real estate available in the same 2D footprint between the PE units and memory. Further, the power density of systolic accelerators is high due to their desired high computing capability and closely packed PEs. This is exacerbated in 3D due to higher logic integration density which may lead to worse thermal characteristics [23]. Hence, the designer must take into account the thermal implications when partitioning the accelerator components among 3D tiers. A systematic methodology for navigating the 3D systolic accelerator design space accounting for the thermal issues is necessary. This paper makes the following contributions to address this issue: ‚ Provide a systematic framework to navigate the design space of 3D systolic array-based ML accelerators under different workload conditions. ‚ Perform system-level analysis to evaluate and compare different 3D-partitioned accelerator approaches for performance, power, and thermal characteristics. ‚ Provide insights and takeaways for system designers to perform thermal-aware design of such 3D accelerators.
The remainder of the paper is organized as follows: Section II provides background and prior work on 3D integration technologies and systolic architectures. Section III describes the 2D baseline design and different 3D partitioned configurations. Section IV delineates the simulation framework used to perform performance, power, and thermal analysis. Section V describes the experimental setup. Section VI presents results from a comparative analysis of different 3D accelerator configurations. Section VII provides concluding remarks.

II. BACKGROUND AND PRIOR WORK
This section provides a brief overview on various 3D IC technologies. A refresher is also provided on the basic principles of a systolic array-based DNN accelerators.

A. Overview of 3D Integration Technologies
Traditionally, two or more dies are flip-chip attached to an organic package substrate and interconnected with the agency of the package. Certain 2D enhanced (also referred to as 2.5D integration) utilize an interposer made of silicon, glass or, ceramic for high-density communication between separate dies mounted side by side ( figure 1a). The interposer may contain Through Silicon Vias (TSVs) [24] which are essentially holes etched out in the silicon wafer and then filled with a conductive metal like copper.
3D stacked ICs involve a die containing TSVs attached to the package substrate using conventional flip-chip technology and a second die, fabricated separately and bonded to the first die using micro bumps [25] or hybrid wafer bonds [13]. This leads to a back-to-face (B2F) configuration, as the back of the first die is bonded to the face of the second die ( figure  1b). Similarly, other configurations like back-to-back (B2B) and face-to-face (F2F) are possible, especially when multiple dies are stacked in this manner. Compared to 2.5D (lateral) integration, 3D stacking worsen thermals due to increased power density with die overlap, and heat dissipation from tiers away from the heat sink is a challenge [26].
Monolithic 3D ICs consist of multiple device layers fabricated sequentially on the same die and connecting using Monolithic Inter-tier Vias (MIVs) which are essentially the same size as intra-tier vias [27]. MIVs offer better parasitics and a higher integration density compared to TSVs due to their smaller size [28]. Since monolithic 3D enables the finest pitch of 3D connection, it holds the most promise. However, more breakthroughs in low-temperature processing to fabricate transistors in the upper layers while preserving the transistors and Back end of line (BEOL) of the lower layer are desired [29]. Monolithic 3D suffers from limited lateral thermal conductivity due to the absence of substrate on upper layers. Besides, high device integration density and thin layers lead to strong tier-to-tier thermal coupling [30].
This work uses 3D stacked ICs using hybrid wafer bonding technology to model the design of 3D ML accelerators.  Nonetheless, some of the ideas discussed in this paper around efficient partitioning of a ML accelerator design into 3D tiers can be helpful to design for other 3D IC technologies as well. Next, we will discuss the basic principles of operation of a systolic array-based ML accelerator system.

B. Systolic ML Accelerators
A systolic array consists of a simple and regular grid of PEs wired together using the nearest-neighbor interconnect [31] [32]. Data from banked scratchpad memory made of SRAMs is injected from the edges of the array in a rhythmic pipelined manner (similar to a systolic beat). The PEs perform the same operation on their inputs, typically multiply and accumulate, and pass the intermediate results or the original inputs to adjacent PEs. The key idea is to exploit data re-use so that fewer data transfers from memory are needed. Further, purely local data movement (neighbor to neighbor) means simpler interconnect and control. PEs operating in parallel achieve high computational concurrency. Moreover, systolic architectures are modular making them easy to floorplan and scale. Figure 2 shows a high-level diagram of a typical systolic system with an array of PEs and scratchpad memory for storing input, filter, and output.
DNN computation is a highly parallel workload of dense matrix multiplication operations between the input matrix (or the output of the previous layer) and the filter matrix. Systolic array architectures can effectively leverage the abundant data reuse opportunity in DNNs by using their local data shifting movement and keeping the PEs busy to provide high throughput. Each PE performs a simple multiply-andaccumulate (MAC) operation, while data is streamed through the array in a pre-defined synchronized dataflow. An example of dataflow is weight stationary where weights of the filter matrices corresponding to each DNN layer are pre-loaded from the filter memory into the systolic array before any matrix multiplication operation is performed. Input data is then streamed in from the input memory and the array elements perform matrix multiplication with the weights already stored in them. The output data is continuously accumulated, passed through activation and/or quantization functions before eventually being stored in the output memory. The cost of fetching data from memory is amortized over several compute cycles leading to high energy-efficiency. The systolic array has been utilized as the underlying fabric to achieve orders of magnitude gains in performance and energy-efficiency over traditional CPUs, and GPUs for DNN acceleration [2] [3] [5].
III. 2D AND 3D SYSTOLIC ARRAY ACCELERATORS Traditional 2D systolic array design involves selecting an appropriate size and dimension of the PE array as well as the size of memory, which would store the NN input feature maps (IFMAP SRAM), filters (FILTER SRAM), and output feature maps (OFMAP SRAM). In theory, a designer can choose an arbitrary number of PEs. One would expect that a large number of PEs improves the local data reuse, especially for computelimited (or large) networks. This may lead to an increase in the throughput of operations, thereby reducing the number of total cycles (latency) needed to process the network. However, for applications targeting small networks, a large PE array can increase the latency of NN computation as inputs have to traverse the entire length and height of the array before the output is ready. Regarding buffer sizing, a larger SRAM would minimize expensive data transfers to main memory (DRAM). But again, over-provisioned SRAMs can lead to area and cost inefficiencies. In summary, designers must consider the aforementioned trade-offs for both the PE array and SRAM buffer sizes, keeping in mind the target application workload to achieve an optimal design. For this study, the baseline 2D accelerator was selected to have a 32x32 PE array and 128KB of Filter, IFMAP, and OFMAP SRAM each, which is representative of common DNN inference use-case [4].
3D systolic accelerator design further involves distributing the additional silicon real estate available within the same 2D footprint between PE elements or SRAM buffers to balance network throughput and external memory transfers. Moreover, the partitioning method of the PE array and SRAM buffers among the vertical tiers may have thermal implications. In order to evaluate and compare 3D accelerators with different partitioning styles, design points described in Table I are   selected. The 3D configurations considered were limited to 4 stacks of PE array or SRAM. Increasing SRAM stacks has diminishing returns in energy reduction, and increasing PE stacks leads to worsening thermals, as explained in section VI. It must be noted that while a 4x larger 2D design with increased compute or memory resources is possible, a 4-tier 3D system packs equivalent resources in the same footprint as the baseline 2D accelerator. The physical design of a 3D system will incur lower 2D interconnect delay and power due to shorter distances and fewer buffers compared to a 4x larger 2D system but may incur an additional 3D interconnects delay and power. 3D configurations selected for further analysis include multiple PE array tiers (configuration 2) or multiple SRAM tiers (configuration 3-5), a scaled-up version (configuration 6), and a scaled-out version (configuration 7) of the 2D baseline accelerator. The floorplans for all design points are shown in Figure 3. It should be noted that configurations 3, 4, and 5 have the same amount of overall compute and memory resources but differ in the method of how these resources are partitioned among vertical tiers. Scaling-up simply means a larger system folded into multiple tiers, while scaling-out means multiple smaller systems in separate tiers [33]. In contrast to configuration 6, the different tiers in configuration 7 do not share the same SRAM and only share an offchip DRAM. As shown in figure 4(c), the scale-out version does not require any connections in the vertical direction between PE elements in different tiers as the four systolic arrays operate independently in this configuration. Vertical connections would still be needed to transfer the data from DRAM to SRAM in different tiers and for power and ground lines.

IV. SIMULATION FRAMEWORK
The simulation framework developed and used in this work is depicted in Figure 5. It comprises two flows which are explained in this section.

A. Power and performance analysis flow
An open-source simulator SCALE-Sim [33] is chosen for the power and performance analysis. Accelerator design parameters such as PE array dimensions, SRAM buffer sizes, and dataflow can be selected and mapped to a list of configuration files. Simulation benchmarks are translated to topology files having a layer-wise description of the network. The simulator runs a stall-free DNN inference and after processing the entire network, reports the latency in cycles, array utilization, SRAM accesses, DRAM accesses, and DRAM bandwidth requirements.
The power of different configurations is computed from the layer-wise average utilization of the PEs and average bandwidth for SRAM and DRAM reads/writes provided by SCALE-Sim in conjunction with the technology data from [34] (Table II). DRAM accesses can contribute a major part of the total energy [35]. For 3D accelerators, the DRAM transfers may incur an additional energy overhead in transferring data to accelerator components in different tiers. The energy per bit overhead for F2F is reported as 0.013 pJ at nominal voltage [14]. The energy overhead of F2B over F2F is reported as 12X [36]. Hence, to incorporate an average case impact of vertical interconnect energy on the overall DRAM access energy of a 4-tier system, 1.35 pJ per byte (one F2F, one F2B) is added to all DRAM transfers of 3D accelerator configurations.
Power consumed in the PE array is calculated using the following equation: where, n is the number of layers in the network, util(i) is the average utilization of the PE array for computing layer i (between 0-100), cyc(i) is the number of cycles taken for computing layer i, arr h and arr w are the PE array height and width respectively, freq is the frequency of operation, (e mac) is the energy consumed per 8-bit multiply-accumulate (MAC) operation. The e mac of 0.3 pJ (Table II) is per cycle energy consumed in the PE at 1 GHz, based on a place-and-routed design of an 8-bit precision MAC in 16nm process node [34].
SRAM power is calculated using the following equation: where, n is the number of layers in the network, srd bw(i), swt bw(i) are the average SRAM read and write bandwidth in bytes per cycle for the execution of layer i, cyc(i) are the number of cycles taken for computing layer i, (e srd) and write (e swt) are the SRAM energy consumed in access of bytewide data. The e srd of 1.1 pJ and e swt of 1.5 pJ (Table II) is based on 32KB SRAM macros generated from an industrystandard memory compiler at 16nm and takes both dynamic and static energy into account [34]. DRAM power is calculated using the following equation: where, n is the number of layers in the network, d if(i), d filt(i), d of(i) are the average bandwidth to access input feature map, filter, and store output feature map in DRAM for the layer i respectively, and (e mem) is the DRAM energy consumed per byte access. The e mem of 120 pJ (Table II) is based on off-chip DRAM accesses energy per byte assuming an LPDDR3 interface [35]. Performance in terms of latency of different 2D and 3D configurations is computed from the layer-wise cycle count provided by SCALE-Sim. For configurations 1-6, the total number of cycles to complete the entire benchmark is computed by summing the cycles taken to complete each network layer. Since the computation in the vertical tiers is parallel in configuration 7, the sum of cycles per network layer can be directly computed by simulating a single tier. Performance in terms of throughput can be calculated in Tera Operations Per Second (TOPS) using the following equation: T OP S " util˚arr h˚arr w˚2 where, util is the average utilization of the PE array for computing the NN (between 0-100), arr h and arr w are the PE array height and width, respectively, and freq is the frequency of operation. The delay overhead of 3D F2F vertical interconnect can be "5 ps at nominal voltage [14]. The energy overhead of a F2B connection (through TSVs) over F2F is 3.2X [36]. Hence, to incorporate a worst-case impact of the vertical interconnect delay on the frequency of a 4-tier system, 42 ps (two F2F, two F2B) is added to the cycle time (1/freq) of 3D accelerator configurations.

B. Thermal analysis flow
To the first order, the temperature rise in 3DIC is primarily proportional to the effective power density in the 2D footprint [23]. Floor plan dimensions of different 2D and 3D configurations are calculated based on the PE and SRAM area at 14/16 nm technology node from table II. A spatial tile-based power map is created for each tier by using the power data computed for PE and SRAM regions in conjunction with the respective floor plan dimensions. Figure 6 depicts a typical tile-based power map which is essentially a division of the entire tier into equal-sized tiles. The power of each tile is the sum of the power associated with the blocks within the tile. The power map contains the metal density and thermal conductivity properties of all the layers in the BEOL stack. Abstracting the power consumed by the PEs and SRAM in terms of per-tier power maps allows us to mix and match different tiers and build and analyze thermal characteristics for different 3D configurations with relative ease.
Cadence Celsius Thermal Solver [37] is used to run static thermal simulations. The tool uses the power map file along with a complete physical description of the package stackup, bumps, molding compound, lid, thermal-interface material (TIM), and a detailed description of the vertical stack, i.e., devices, interconnects, and dielectrics along with their thermal conductivity properties. The package comprises 10 build-up layers with overall dimensions of 10x10 mm 2 with an 11x11 mm 2 copper lid on top. TSVs of diameter 5 µm are modeled at every 50 µm in the die stack-up. Thermal simulations are run for different benchmarks with the same package and die size assumptions maintained for all the configurations for a fair comparison. However, a significant change in package thermal design power (TDP) (for instance, configurations 2-7 vs. configuration 1), the heat spreader dimensions may need to be redesigned, and boundary conditions may have to be re-calibrated. Setting up realistic boundary conditions for the tool is critical for getting accurate results. Thermal boundary conditions calibrated with actual hardware measurement data using on-die temperature sensors are sourced from [38]. The tool generates thermal heat maps and maximum temperature data of different dies in each configuration.
V. EXPERIMENTAL SETUP SCALE-Sim is configured with micro-architecture features like PE array dimension, aspect ratio, memory buffer sizes for different 2D and 3D accelerator configurations listed in Table I. The simulator, by default, only supports a 2D systolic configuration. 3D design points of configurations 2-6 can be mapped to SCALE-Sim using their respective PE and SRAM sizes as specified in table I. Configuration 7 is equivalent to four separate systolic systems and can be mapped to SCALE-Sim with PE and SRAM size of configuration 1 with the benchmarks split 4-way along their output channels. The dataflow is set to weight stationary. Although this limits the design space explored, it still enables for a like to like comparison between different 3D accelerator configurations. The topology files having a layer-wise description of the network like input and filter dimensions, input channels, number of filters, and strides are setup for SCALE-Sim for some common NN benchmarks like AlexNet [39], AlphaGo Zero [40], Deep Speech 2 [41], Faster R-CNN [42], GoogLeNet [43], Neural Collaborative Filtering (NCF) [44], ResNet-50 [45], Sentiment Seq-CNN [46], and Transformer [47]. The geometric mean of results from all benchmarks is included to illustrate the overall difference between configurations across all benchmarks. The metric for performance is the number of cycles required to process the benchmark (measure of latency) and TOPS (measure of throughput). The metric for energy-efficiency is TOPS/W. The metric for thermal is the maximum temperature increase in°C relative to the coolest point of the 2D baseline.

VI. RESULTS
This section presents the simulation results comparing different 3D accelerator configurations. Insights are drawn for optimal partitioning strategy for energy, performance and thermal for different network workloads.

A. Energy
Intuitively, it can be said that stacking multiple SRAM tiers would lower the DRAM transfers bringing down the total energy (Figure 7), especially for memory-limited networks.    Table I across different benchmarks and also illustrates the breakdown of energy between computations, SRAM, and DRAM transfers. As expected, configurations 3-6 which contain 4-stack SRAM reduce the total energy to process the network compared to configuration 1 (2D baseline). However, the energy reduction factor varies widely between benchmarks from 1.0x for NCF to 3.8x for Deep Speech 2. NCF being relatively small already fits within a single SRAM stack and additional SRAM stacks in 3D bring no benefit. Configuration 6 (scale-up) achieves the lowest energy since it also contains 4 stacks of PEs along with 4 stacks of SRAMs increasing the local data reuse within the PEs hence minimizing both SRAM and DRAM transfers. Configuration 7 (scale-out) operating on partitioned output channel requires input feature maps to be duplicated in the SRAMs, causing multiple DRAM accesses to fetch the same input data leading to high total energy.

B. Performance
The number of cycles taken to complete a benchmark should decrease with the increase in the number of PEs, especially for compute-limited (large) networks. As expected, figure 9 shows that configurations 2, 6, and 7 which contain 4-stack PE arrays take fewer cycles to process the network compared to configuration 1 (2D baseline). However, the speedup varies widely between benchmarks from 1.1x for NCF to 3.9x for AlexNet. NCF has much smaller layer features like IFMAP dimensions compared to AlexNet and is unable to utilize the additional PE tiers to achieve any more compute parallelism. Configuration 7 (4x scale-out of 2D) shows slightly better   performance than configuration 6 (4x scale-up of 2D) for some benchmarks such as AlexNet, AlphaGo Zero, and ResNet-50. This is due to fewer cycles for filling up the smaller independent PE arrays of configuration 7 compared to a single larger folded PE array of configuration 6 which suffers from this overhead at the start of computation of each layer. For other benchmarks such as Deep Speech 2, which contain a small number of output channels and large input feature maps, configuration 7 loses its advantage and suffers from low PE utilization. The power-performance in TOPS and TOPS/W (including the delay and energy overheads of the vertical interconnects for 3D configurations) is presented in table III.
C. Thermal Figure 10 shows the steady-state heat maps of configuration 2 (4-stack PE array) and configuration 3 (4-stack SRAM) to highlight the difference in thermal characteristics of logicover-logic and memory-over-memory. Both configurations are running the ResNet-50 benchmark. The temperature values are relative to the coolest point on configuration 1 (2D baseline). The heat maps clearly emphasize that the PE array part of the die runs hotter by around 5°C. It can be further observed that the maximum temperature of configuration 2 is about 13°C higher than configuration 3. This is because the average power density of the 3D stack of PE array is higher compared to the SRAM stack. suffer from a temperature rise of up to 24.8°C relative to the coolest point on configuration 1.
The increase in temperature can have an impact on the overall energy of the accelerator. For example, assuming the coolest point on the 2D baseline to be 75°C, an increase in temperature by 25°C has a marginal effect on transistor onstate current but increases the off-state current by 1.9X ( Figure  11). Configuration 7 partially avoids overlapping hotspots by staggering the PE array and SRAM between tiers but fares only slightly better. Configuration 3 and 4 which stacks mul- Fig. 11: Effect of temperature rise on the on-state current (black line) @SS/V NOM -10% and off-state current (red line) @FF/V NOM +10% of a transistor with standard V TH option at 14/16nm (0.8V V NOM ) tiple tiers of SRAM are only up to 7.2°C hotter. Furthermore, changing the ordering and stacking the PE array on top of the SRAM stack as in the case configuration 5 (logic-overmemory) limits the temperature rise to only up to 5.6°C making it the best choice from a thermal standpoint. The reason behind this is that the tier containing PE array is significantly hotter than ones containing SRAM and placing it on top reduces its relative proximity to the heat sink.
In summary, 3D stacking of PE arrays (configurations 2, 6, and 7) can reduce the latency of the network computation, but the speedup depends on the network size. Further, these configurations suffer from the worst thermal characteristics due to logic-over-logic stacking. On the other hand, stacking multiple SRAM tiers (configurations 3, 4, 5, and 6) lowers the DRAM transfers making them a good choice where energyefficiency is important. Furthermore, stacking PE array on top of the SRAM stack (configuration 5) in a logic-over-memory fashion can not only achieve low energy but also mitigate the thermal impact of 3D.
VII. CONCLUSION Systolic accelerators have been deployed for training and inference, on edge devices as well as on the cloud for a wide variety of workloads. These use cases may constrain accelerator requirements for latency, energy, and area differently. 3D integration packs more compute or memory in the same 2D footprint allowing more powerful and energyefficient accelerators. However, it also presents more options to the designer for partitioning the PE array and memory among 3D tiers. Since different choices may have different performance, power, and thermal implications, it becomes imperative for designers to understand the trade-offs under different application workload conditions. In this work, a systematic methodology for navigating the 3D systolic accelerator design space is presented. Using this framework, 3D configurations with different partitioning styles are evaluated and compared providing several insights and takeaways for designers. This work can pave the pathway for future thermal aware 3D systolic accelerator designs.