Device Quantization Policy and Power-Performance-Area Co-Optimization Strategy in Variation-Aware In-memory Computing Design

Device quantization of in-memory computing (IMC) that considers the non-negligible variation and finite dynamic range of practical memory technology is investigated, aiming for quantitatively co-optimizing system performance on accuracy, power, and area. Architecture- and algorithm-level solutions are taken into consideration. Weight-separate mapping, VGG-like algorithm, multiple cells per weight, and fine-tuning of the classifier layer are effective for suppressing inference accuracy loss due to variation and allow for the lowest possible weight precision to improve area and energy efficiency. Higher priority should be given to developing low-conductance and low-variability memory devices that are essential for energy and area-efficiency IMC whereas low bit precision (< 3b) and memory window (<10) are less concerned.


Introduction
Deep neural networks (DNNs) have achieved numerous remarkable breakthroughs in applications such as pattern recognition, speech recognition, object detection, etc. However, traditional processor-centric von-Neumann architectures are limited in energy efficiency for computing contemporary DNNs with rapidly increased data, model size, and computational load. Data-centric inmemory computing (IMC) is regarded as a strong contender among various post von-Neumann architectures for reducing data movement between the computing unit and memories for accelerating DNNs 1,2 . Furthermore, quantized neural networks (QNNs) that truncate weights and activations of DNNs have been proposed to further improve the hardware efficiency 3 . Compared with the generic DNNs using floating-point weights and activations, QNNs demonstrate not only substantial speedup but also a tremendous reduction in chip area and power 4 . These are accomplished with nearly no or minor accuracy degradation in the inference tasks of complex CIFAR-10 or ImageNet data 3 . While the emergence of QNNs opens up the opportunity of implementing IMC using emerging non-volatile memory (NVM), the practical implementation is largely impeded by the imperfect memory characteristics, in particular, only a limited number of quantized memory (weight) states is available at the presence of intrinsic device variation. The intrinsic device variation of emerging NVM such as PCM 5 , RRAM 6 , MRAM 7 , leads to a significant degradation in inference accuracy.
Although a number of studies have investigated this critical issue focusing on the impact of the DNN inference accuracy [7][8][9] , a comprehensive and quantitative study that links the critical device-level specs, namely quantization policy, memory dynamic range, and variability with those system-level specs, namely power, performance (accuracy), and area (PPA), is still lacking. This paper intends to provide a useful guide on NVM technology choice for quantized weights considering the variation-aware IMC PPA co-optimization. Our evaluation takes into account practical IMC design options in the architecture level, such as DNN-to-IMC mapping schemes, types of DNN algorithms, and using multiple cells for representing a higher-precision weight, as well as the circuit-level constraints, such as limited current summing capability and peripheral circuit overhead.

Background
In the IMC architecture, the weight matrix WNM is represented using cell conductance in an orthogonal memory array (GNM).
The vector-matrix multiplication (VMM) is performed by applying the voltage input vector VN1 to the array and measuring the current output vector I1M by summing currents flowing through all cells in every column. Each memory cell could be regarded as a multiply-accumulate (MAC) unit. Thus, the high-density array allows extremely high parallelism in computing. The IMCbased VMM accelerates general matrix multiply (GEMM), which counts for over 70% of DNN computational load 10 , by using stationary weights in the memory array.

DNN-to-IMC mapping
The weights in DNNs algorithms are signed values. It is important to allow negative weights for capturing the inhibitory effects of features 11 . To implement signed weights using only positive conductance of memory devices, an appropriate mapping scheme is required. Depending on the choice of the activation function, the activation values and also the input values in neural networks are either with negative values (e.g., hard tanh) or without negative values (e.g., ReLU). This also affects the choice of DNN-to-IMC mapping schemes.
In this work, we consider a one-transistor one-resistor (1T1R) memory array for illustrating various DNN-to-IMC mapping schemes. Each memory unit cell in the 1T1R array consists of a selection transistor and a two-terminal memory device with changeable resistance. One of the terminals of the memory device is connected to the drain of the transistor through a back-endof-line via. The word line (WL), bit line (BL), and source line (SL) are connected to the transistor gate, the other terminal of the memory device, and the transistor source, respectively. The WLs and SLs are arranged orthogonally to the BLs.
Three commonly used mapping schemes are considered in this work. The naïve IMC (N-IMC) scheme ( Figure 1a) uses a single memory unit cell and a single WL to represent a positive/negative weight (+/w) and positive input (+ IN), respectively 2,12-14 . A constant voltage bias is clamped between BLs and SLs. When the input is zero, WL is inactivated and no current generates from the cells on the selected WL. When the input is high, WL is activated and the summing current flowing from the cells on the same BL is sensed. To represent both the sign and value of weights using a single cell, an additional reference current is required to compare with the BL current via a sensing amplifier (SA) or an analog-to-digital converter (ADC) to obtain the final MAC In QNNs based on all three schemes, the quantized inputs could be encoded using multi-cycle binary pulses applied to the WL (transistor gate) without using high-precision digital-to-analog converters (DACs). An analog current adder is used to combine MAC results in multiple cycles to obtain the final activation values through ADCs 20 . Note that the 1-bit input/activation by using the simple SA is first assumed in our later discussion to avoid the high energy and area overheads in ADCs. In Variation-aware PPA co-optimization section, we will further discuss the impact of high-precision input/activation on the IMC design.

Quantized weight
To implement quantized weight in QNNs, the multi-level-cell (MLC) memory technology that provides sufficient precision is the most straightforward choice, which we refer to straightforward MLC (S-MLC) 12,[15][16][17]19 . Besides, multiple memory cells where each has a lower precision could be used to implement a weight with higher precision. This allows using even binary (1-bit) memory technology to realize versatile QNNs at the expense of area. Two schemes, which we refer to digital MLC (D-MLC) 13,18 and analog MLC (A-MLC) 14 , are possible (Figure 2a-b). The former sums the BL currents of the most-significant-bit (MSB) cell to the less-significant-bit (LSB) cell using the power of two weighting while the latter uses the unit weighting. For example, the numbers of cells per weight are N and 2 N -1, respectively, for an N-bit weight in the N-IMC mapping by using a 1-bit memory cell. Table 1

Finite quantized memory state
Although rich literature has discussed various process innovation 21 or closed/open-loop programming schemes 22 to increase the number of quantized memory states, the ultimate number of quantization levels in a memory device is determined by the dynamic range, e.g. conductance ratio (GH/GL) in a resistance-based memory, and the device-to-device (DtD) variation. The DtD variation limits how accurate weight placement is. We found the standard deviations () in the log-normal conductance distribution does not change significantly with the conductance value in the same device. Figure 3 shows the statistical histograms for binary MRAM, ferroelectric tunnel junction (FTJ) 23 , MLC PCM 5 , and RRAM 6 , respectively. The G-independent  is used as the device variation model in the following discussion. representing + w appears broader compared with the GL states for representingw in the linear scale for the N-IMC scheme (Figure 4a) 6 . While the weight distribution is asymmetric in N-IMC, it is symmetric for +/w in WS-IMC (Figure 4b). This is because the same conductance difference of two adjacent cells is used to represent the value of the signed weights. Although C-IMC utilizes two cells in the same column to represent one weight, only one cell between the two is accessed at a time because of the complementary inputs applied to the transistor gate terminal of the 1T1R cell. Therefore, both the weights of C-IMC and N-IMC schemes are based on the difference between the device conductance of one cell and the reference. So the weight distribution of C-IMC is identical to that of N-IMC.

Quantization policy for accurate inference
All three schemes discussed in DNN-to-IMC mapping section could achieve comparable accuracy after appropriately training the models when the device variation is negligible. However, their immunity against device variation differs substantially. Figure   5 shows the inference accuracy of VGG-9 DNNs for CIFAR-10 classification with different levels of variability. The weight placement considering the log-normal conductance distribution and G-independent  was evaluated using the Monte Carlo simulation of at least 200 times. The distribution of these 200 data points was plotted in Figure 5. As  increases, the inference accuracy degrades. N-IMC is the worst mainly due to the error accumulation from + w with broader distributions of GL states as compared withw, as apparent in Figure 4a. C-IMC shows improvement on inference accuracy compared with N-IMC because of the error cancellation effect originated from the complementary input. Note that the generation of complementary inputs requires additional hardware cost. WS-IMC is the most superior against variation among three because of the error cancellation from the symmetric and tighter +/-w distribution (Figure 4b) that is constituted by two cells but not one, and it requires no complementary input. More detailed comparison between these three schemes with different GH/GL could be found in Figure S1 (Supporting Information). For the rest of this paper, only the median values of inference accuracy in the Monte Carlo simulation and the WS-IMC mapping scheme are discussed for simplicity.

Network Choice
Variation immunity is known to be sensitive to the choice of DNN algorithms 8 . Both VGG-16 and ResNet-18 are compared by using a more complex Tiny ImageNet dataset, as shown in Figure S2  Our simulation shows that after appropriate training both Log-Q and Lin-Q achieve comparable accuracy in the ideal quantization case without variation. However, Lin-Q shows more robust immunity against variation than Log-Q, as shown in Figure 6. This is explained by their different weight distributions. In Log-Q, more weights are located at +/-11 states which have a wider weight distribution. Therefore, the larger sensing margin between levels in Log-Q does not necessarily guarantee better immunity against variation. Only Lin-Q is further discussed in this study.

Weight quantization precision and dynamic range
The immunity to variation is further investigated in the models with different weight precision from one to three bits in Figure   7. The focus on the lower weight precision considers only inference but not training applications and also the reality of using the existing memory technology for realizing MLC. Here we also take into account the influence of conductance dynamic range GH/GL. The major conclusions are: (1) Although the high weight precision improves the baseline accuracy in the ideal case, it is more susceptible to variation. The accuracy could be even worse than using low weight precision if the variation is substantial.
For the first order, this effect could be explained as follows: For a higher weight precision, a larger number of weight states are placed within a given dynamic range. The margin between each state becomes less compared with the case with a lower weight precision. The same degree of variation (same ) would distort the pre-trained model more significantly and result in more severe accuracy degradation. (2) Enlarging the dynamic range is beneficial to the variation immunity for a given However, at the same normalized i.e. /lnGH/GL), a smaller dynamic range with smaller device variation is favorable than a larger dynamic range with larger device variation, as shown in Figure 8. The result suggests that a low absolute value of is still critical for the model accuracy. Higher priority should be given to suppressing variation rather than enlarging the dynamic range. (3) A more complicated dataset (Tiny ImageNet vs. CIFAR-10) is more susceptible to variation since the model itself also becomes more complicated, but it does not change the general trends aforementioned.

Variation-aware accurate DNN
Two approaches are further evaluated to improve the immunity against variation. First, the D-MLC and A-MLC weights, as introduced in Quantized weight section, are more robust against variation than the S-MLC weight. Figure 9 shows an example of the weight distribution of 3-bit linear quantization in the WS-IMC mapping scheme by using the D-MLC and A-MLC weight, respectively. The D-MLC and A-MLC weights consist of three and seven binary (1-b) memory cells, respectively, with the identical GH/GL and as those in Figure 4b for the S-MLC weight. Because more cells are used to represent a weight for D-MLC and A-MLC, the "effective" for a given quantized weight precision is reduced due to the averaging effect from the law of large numbers. Second, the inference accuracy degradation could be partially recovered by fine-tuning the last fully-connect classifier layer in the network 7 . The last classifier layer is a full-precision layer that could be easily implemented using the conventional digital circuits. After placing weights in all IMC layers, the weights in the digital classifier layer is retrained with all weights in the IMC layers fixed. The computing efforts for retraining only the classifier layer is relatively small. The retrain speed is fast because it requires only a subset of data instead of a complete training epoch 7 . Table 2 & 3 summarize the maximum tolerable variation for CIFAR-10 and Tiny ImageNet, respectively, by using different quantization policies, including quantization precision, dynamic range, and weight implementation scheme. The pre-defined target accuracy for CIFAR-10 using VGG-9 and Tiny ImageNet using VGG-16 are 88% and 48 %, respectively. To achieve the proposed targets with relatively high accuracy, higher weight precision (2/3b vs. 1b) is beneficial because it increases the baseline accuracy, thus allowing more variation tolerance. Enlarging GH/GL is also beneficial. Among the three weight implementation schemes, A-MLC shows the best variation tolerance due to its smallest "effective"  obtained from multiple devices. Furthermore, the fine-tuning technique is extremely useful for boosting variation tolerance. So it should be applied whenever possible if devicelevel solutions for reducing are not available.

Quantization policy for accurate inference
Some of the strategies for improving IMC variation immunity accompany penalties in power and area. A larger GH/GL implies that the GH cell is forced to operate in a higher current regime. Here we assume the minimum of GL is finite and limited by the leakage in given memory technology. Previous studies have shown that a high BL current creates substantial voltage drop on the parasitic line resistance and results in inaccurate MAC results. Partitioning a large array with high BL currents to smaller ones is necessary to guarantee the model accuracy 23 . The higher GH thus restricts the attainable maximum sub-array size because of the excessively large accumulated current on BLs. The increased BL current with higher GH deteriorates energy efficiency while the smaller sub-arrays with higher GH deteriorates area efficiency due to higher peripheral circuit overhead. D-MLC and A-MLC by using more memory cells also increase the area and energy consumption of IMC. Therefore, the variation tolerance should be carefully traded off with efficient hardware design. To fairly evaluate the PPA of IMC with different device specifications, we completed a reference design based on the foundry 40-nm CMOS technology with a 256×256 1T1R RRAM array macro. The major circuit blocks in the macro are similar to the illustration shown in Figure 1. We assume a hypothetical memory with a fixed low conductance state (GL=0.5 S) and GH/GL = 10, 1-bit input/activation, and the WS-IMC/S-MLC mapping. The IMC subarray size is limited by the maximum allowed BL current of 300 through current-mode sensingFigure 10 shows a simulated power and area breakdown of the IMC macro, which includes bias clamping and current scaling circuits, current-mode SAs, analog adders to accumulate the partial sums from different sub-arrays, and driver circuits for WL/BL/SL. Other IMC designs using different GH/GL ratios (assuming GL is fixed), D-MLC/A-MLC weights, multi-cycle inputs, and multibit ADCs are then extrapolated using the reference design.
The area and energy of feasible designs for VGG-9 that satisfy the pre-defined accuracy target (e.g. Table 2) are compared in Figure 11. The trends for VGG-16 are similar and not shown here. The lowest weight precision is used whenever possible to relax device requirements and system overhead. The energy is estimated by the total energy consumption of 10000 CIFAR-10 inferences. We summarize the strategies on the PPA co-optimization as follows: (1) For a low-variation device (, a binary cell with low GH/GL allows the highest area and energy efficiency. (2) For a moderate-variation device (, S-MLC with moderate GH/GL (<10) achieves better efficiency. (3) For a high-variation device ( using S-MLC becomes challenging unless the fine-tuning is considered. Using D-MLC/A-MLC with moderate GH/GL is practical alternatives to maintain accuracy at a reasonable cost of energy and area. Other variation-aware strategies that could affect the PPA of IMC include using a higher (3-bit) input/activation precision and more channels (2 times more) in a wider VGG network. The complete area and energy estimations of these variation-aware strategies are shown in Figure S3 (Supporting Information). Only those most efficient schemes using the lowest possible bit precision for satisfying the target accuracy are plotted in Figure 12 for each dynamic range.
Our evaluations show that the substantial penalties on area and energy restrict these strategies only competitive in specific conditions, especially when is large.

Conclusion
In this paper, we provided an end-to-end discussion for the impact of intrinsic device variation on the system PPA cooptimization. We considered critical device-level constrains, such as limited quantization precision and memory dynamic range, circuit-level constraints, such as limited current summing capability and peripheral circuit overhead, architecture-/algorithm-level options, such as DNN-to-IMC mapping schemes, types of DNN algorithms, and using multiple cells for representing a higherprecision weight.
The WS-IMC mapping scheme, DNN-like algorithm, and linear quantization shows more robust immunity against variation.
Although higher weight precision of S-MLC improves the baseline accuracy, it is also more susceptible to variation when the variation is high and the dynamic range is low. Multiple cells per weight and fine-tuning are two effective approaches to suppress inference accuracy loss if device-level solutions for reducing variation are not available. As for the PPA co-optimization, we found that memory devices with a large number of analog states spanning in a wide dynamic range do not necessarily lead to better IMC design. Low-bit MLC or even binary memory technology with GH/GL < 10 and low variability, e.g. binary MRAM 25 and FTJ 23 with low conductance, deserves more attention.

Network structure
The VGG-9 network for CIFAR-10 classification consists of 6 convolutional layers and 3 fully connected classifier layers.
Images are processed through a stack of convolutional layers using 3×3 filters with a stride of one. Max-pooling after every two convolutional layers is performed using a 2×2 window. Batch normalization and hard tanh as the activation function are applied to the output of each convolutional layer. The channel width of the convolutional layers starts from 128 in the first layer and increases by a factor of two after each max-pooling layer. For input data in N-IMC and WS-IMC with only positive values, the output of the hard tanh activation function is scaled and normalized to between 0 and 1.
Similarly, the VGG-16 network for Tiny ImageNet classification consists of 13 convolutional layers and 3 fully connect layer.
Max-pooling after every 2 or 3 convolutional layers is performed using a 2×2 window. The channel width of the convolutional layers starts from 64 in the first layer and increases by a factor of 2 after each max-pooling layer.    Figure 11. Area and energy estimation of feasible IMC designs that guarantees CIFAR-10 inference (VGG-9) with at least 88 % accuracy (see Table 2). Designs considering different standard deviation of conductance distribution, GH/GL ratio, and S-MLC/D-MLC/A-MLC scheme are compared, and the lowest possible weight precision is used to simplify the hardware implementation. 1-bit activation is assumed. Dark and light colors indicate the estimation w/o and w/ considering fine-tuning. The fine-tuning results are shown only when fine-tuning helps to reduce the weight precision required. The lowest weight precision required is also indicated. 14/ 3 Figure 12. Area and energy estimation of IMC designs using the same criteria as Fig. 11 but with either wider channel or 3-bit activation. The improvements only exist in specific conditions with high .