Combined HW/SW Drift and Variability Mitigation for PCM-Based Analog In-Memory Computing for Neural Network Applications

Matrix-Vector Multiplications (MVMs) represent a heavy workload for both training and inference in Deep Neural Networks (DNNs) applications. Analog In-memory Computing (AIMC) systems based on Phase Change Memory (PCM) has been shown to be a valid competitor to enhance the energy efficiency of DNN accelerators. Although DNNs are quite resilient to computation inaccuracies, PCM non-idealities could strongly affect MVM operations precision, and thus the accuracy of DNNs. In this paper, a combined hardware and software solution to mitigate the impact of PCM non-idealities is presented. The drift of PCM cells conductance is compensated at the circuit level through the introduction of a conductance ratio at the core of the MVM computation. A model of the behaviour of PCM cells is employed to develop a device-aware training for DNNs and the accuracy is estimated in a CIFAR-10 classification task. This work is supported by a PCM-based AIMC prototype, designed in a 90-nm STMicroelectronics technology, and conceived to perform Multiply-and-Accumulate (MAC) computations, which are the kernel of MVMs. Results show that the MAC computation accuracy is around 95% even under the effect of cells drift. The use of a device-aware DNN training makes the networks less sensitive to weight variability, with a 15% increase in classification accuracy over a conventionally-trained Lenet-5 DNN, and a 36% gain when drift compensation is applied.


I. INTRODUCTION
I N THE era of big data a plethora of applications require low-power-consumption computations involving large amount of information [1], [2]. In this scenario, the performance of conventional digital computers is limited by the intrinsic communication bottleneck of the Von Neumann architecture, which needs data to be moved back and forth between the memory and the processing unit [3], [4]. In recent years, novel computational approaches have been investigated to overcome this limitation. Among those, Analog In-memory computing (AIMC) based on resistive memory devices has proven to be a promising non-Von Neumann solution for the fast and energy-efficient execution of Matrix-Vector Multiplication (MVM) [5]. As an example, MVMs represent a heavy workload for both training and inference in deep learning applications, and being able to perform them at O(1) time complexity through AIMC solutions would lead to huge benefits.
The goal of AIMC is to perform computations within the memory unit, typically leveraging the physical properties of the memory devices themselves, and taking advantage of Ohm's and Kirchhoff's laws [4], [6], [7]. Among the technologies that have been considered for AIMC, Phase-Change Memory (PCM) is one of the most promising due to its long-term storage of multi-bit quantities and its compatibility with the traditional CMOS fabrication processes [8], [9]. However, due to the peculiarities of the technology, only limited computation precision can be achieved. This, in turn, affects the accuracy of the analog accelerators and thus the performance at the application level.
The main issues associated to PCM devices are: i) the non linearity of the I-V characteristic; ii) the random conductance drift over time; and iii) the variability of the programmed cell states [10]. The I-V non linearity is typically addressed by encoding input values in time as the width of a constant-voltage signal applied across each cell of the array This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ [11], [12], [13], [14], [15], [16]. Conductance drift over time is usually compensated by post-processing techniques [13], [17] or employing HW-aware training solutions [18]. Finally, state-of-art iterative program-and-verify algorithms [19] can be used to reduce, but not completely remove, the variability in the state of identically programmed cells. Therefore, deviceto-device variability, due to both drift and programming state, is still a widely investigated subject that limits the accuracy of PCM-based systems.
A deeper look at solutions applied to limit the effects of conductance drift in AIMC applications highlights the possibilities available at the technological, circuit and software level.
In [20] an investigation on two back end fabrication processes of Ge-rich GeSbTe (Ge-GST) PCM cells is presented along with their influence on devices electrical characteristics, whose empirical model is discussed in [21].
Circuit solutions are analyzed in [13] and [22]. In the former, each analog weight matrix is extended, as time progresses, by the introduction of additional columns (i.e., neurons) to account for the lower dynamic range of the MVM output as conductances become progressively smaller. In the latter, conversely, it is observed that the typical implementation of negative weights with positive-only conductance, i.e. having a second analog array whose output is subtracted from the first, already leads to some measure of drift compensation. Again, the dynamic range of the output is shrinked, thus requiring a renormalization to preserve performance. The renormalization proposed therein requires an additional array of PCM cells to estimate the drift of the SET state conductance (for binarylevel applications, i.e. only using cells in the SET and RESET state).
Finally, solutions can be applied at the software level or in any case in the digital section of the processing chain. Authors in [23] define an ad-hoc regularization function applied during the NN training to limit the variability observed at the neuron level resulting from perturbations of the individual conductances. In [24] drift is addressed by renormalizing the drifted MVM output by modeling the median conductance decay and rescaling the argument of the nonlinear activation function following each layer to ensure that the entire nonlinearity domain is excited as expected for non-drifting weights. In [18] a periodic calibration procedure is used to update the parameters of the batch normalization layers, so that even when weights start to drift, those layers can still remap their outputs to zero-mean, unit-variance distributions.
Obviously, each technique comes with its own set of drawbacks, i.e. requiring a different fabrication process technology [25], a considerable area overhead associated to the AIMC unit [13], [22], reliance on accurate device models [24] or the periodic recalibration of the system [18]. By applying multiple techniques simultaneously the requirements on each of them can be relaxed, with potential reduction of the incurred cost.
This paper analyzes a circuit-level solution to mitigate the impact of the conductance drift on the inference task of Deep Neural Networks (DNNs), employing PCM devices as the core synaptic model. The technique being used in this work is potentially compatible with many of the methodologies previously described and could lead to even better performance in their simultaneous application. This work is supported by empirical results obtained with an AIMC unit testchip [26] designed to implement Multiplyand-Accumulate (MAC) operations, i.e., the core of MVMs. The prototype, implemented in a 90-nm STMicroelectronics CMOS technology, includes the AIMC unit and a PCM IP, and it is conceived to extend the features of the memory IP without modifying its internal structure. The hardware architecture reduces the effects of PCM cells conductance drift on MAC operations at the circuit level with negligible area overhead, so that no post-processing conditioning is required; moreover, non-linearities in the device I-V characteristics are not excited. The conductance variability is then addressed in the context of a DNN-based classification task, by implementing the DNN training phase in a device-aware fashion, with measurementsbased models. The benefits of the proposed solutions are evaluated on the CIFAR-10 dataset, employing two neural networks having significantly different complexities, i.e., Lenet-5 and VGG-8.
The paper is organized as follows. In Section II the architecture and the implementation of the employed testchip are recalled. In Section III the experimental results of hardware drift-compensation and accuracy are expanded and discussed. In Section IV numerical models are fit against programming variability and drift measurements. The combination of hardware compensation and device-aware training procedures is validated in Section V on the CIFAR-10 classification task. Finally, the conclusions are drawn.

II. AIMC PROTOTYPE
This section briefly describes the architecture of the AIMC prototype proposed in [26]. The system is used to execute onestep Multiply-and-Accumulate (MAC) operations with both signed inputs and coefficients. The prototype, depicted in Fig. 1(a), contains a peripheral AIMC unit interfaced with an embedded PCM (ePCM) IP [27]; the whole testchip is manufactured in a 90-nm STMicroelectronics CMOS technology, which features a specifically optimized Ge-rich GeSbTe (GST) alloy for PCM cells. The AIMC unit is directly connected to the Main BitLines (MBL) of the ePCM IP ( Fig. 1(b)), and extends its functionalities by adding MAC execution features, while preserving its use as a standard binary ePCM memory array.

A. Computing Architecture
To perform MAC tasks, the AIMC unit sets the read voltage of each MBL and reads the current of the cells belonging to the addressed WordLine (WL). Therefore, the computation of a MVM requires multiple sequential activations of different WLs. The proposed implementation of the AIMC unit architecture has been conceived to mitigate part of the aforementioned PCM non-idealities cells. In particular, the adoption of time-coded inputs, along with cells being read at a fixed voltage, address the issue of a nonlinear device I-V characteristic. In addition, as explained in detail in Section III-C, PCM cells drift compensation is achieved by a hardware technique that makes the MAC output dependent on a conductance ratio, instead of an absolute conductance value.
Let us first consider the mathematical description of a MVM, i.e., z = Wx, with z ∈ R m , x ∈ R n and W ∈ R m×n . The j-th element of z, is indeed computed through the MAC operation: where W j is the j-th row of W. Each element w j,i of W j = [w j,1 , . . . , w j,n ] is expressed through a conductance g j,i for its magnitude, and through g S, j,i for its sign, each stored in a single PCM cell of the j-th wordline; thus, from a functional point of view, the implementation of each weight w j,i is: where g th is the conductance threshold to encode a positive or negative weight sign. The actual details of the device-level implementation can be found in [26].  [26], which are interfaced with each MBL of the ePCM IP. The weighted summation is then obtained integrating over a fixed time the sum of I j,i , each having a duration proportional to x i . The expression of the differential output V OUT j , associated to the j-th wordline is: where k is a dimensionless constant value accounting for the effects of circuit parameters. The sign of the elementary product, which corresponds to the direction of the I j,i current being integrated, is managed by the Readout Circuit according to the sign of the original product w j,i x i [26]. Considering V i and g j,i /g REF as the absolute values of x i and w j,i , respectively, V OUT j = k i w j,i x i = kz j is therefore proportional to the ideal signed MAC result z j already expressed in (1). This result, obtained with PCM devices being read at a constant voltage, does not suffer from memory cells nonlinearity. The adoption of ramp-driven time-coded inputs is indeed a common solution to overcome the behavior of PCM devices I-V characteristic [11], [12], [13], [14], [15]. Though, the key-feature of the proposed architecture lies in the introduction of a conductance ratio in implementing the MAC weights, which is beneficial to mitigate the effects of cells drift on the MAC results. It has been proven [28] that the drift of a generic cell conductance g(t) follows the power law g(t) = g 0 (t/t 0 ) −α , where g 0 is the conductance at an arbitrary initial time t 0 , and α is the drift coefficient, which is positive and cell-to-cell variable. Being the MAC result proportional to the conductance ratio g j,i (t)/g REF (t), and plugging the drift model of g(t) in (3), the MAC operation evaluated at time t 0 + t, becomes: where g 0, j,i and g 0,REF are the conductances of the weight and reference cells at t 0 , respectively. Here time t has to be intended as significantly longer than an actual integration window, wherein conductances are assumed to be constant. The notable aspect of (4) is the effective reduction of the drift coefficient to α j,i − α REF , so that drift is partially compensated. At the circuit level, the compensation mechanism acts on the slope of the shared ramp signal which decreases as g REF experiences drift. This leads to an increase in the integration time and compensates for the driftinduced drop of the weight cells currents. To implement this, the architecture exploits only a single PCM cell to generate a reference ramp signal [26]. In case no compensation is adopted, the reference ramp still must be generated through a reference conductance. Therefore, the overhead due to the drift compensation consists only of the use of one PCM cell per WL, which can not be consequently used to store MAC weights. The fact of having the MAC output depend on the ratio of two correlated quantities, i.e. PCM conductances, leads to a general resilience of the system towards variabilities.

B. Testchip Implementation
The testchip prototype [26] allows for n = 12, i.e., twelve inputs and coefficients. Its circuits have been designed for a V DD = 1.2 V power-supply and a read voltage V REF = 0.3 V, leading to PCM cells currents ranging from hundreds of nA to tens of µA. The result of each MAC operation was obtained measuring the voltage V OUT j on an output pin. Experimental data were converted using a 16-bit ADC available on a dedicated evaluation board.
The design and implementation of this prototype are mainly focused on demonstrating the validity of its hardware drift compensation feature, and no specific strategy has been adopted to enhance the computing power consumption. Although the energy required to implement in-memory inference tasks depends upon the interconnection scheme of the network layers [2], [29], theoretical energy efficiency estimation of the architecture can be still provided. In this scenario, a MAC operation of size n = 12 is estimated to be implemented with a ∼ 0.34 TOPS W −1 energy efficiency, which is approximately two orders of magnitude lower than the state of the art [12], [18], [30], [31], [32], but still improvable in future designs based on this first prototype. In particular, the power consumption is firstly dominated by the PCM cells current. As in more competitive solutions [12], [18], cells should be read at lower BL voltages, as well as programmed at lower conductances. A considerable portion of power consumption is ascribed to the comparators too, which have been designed to get high switching speed, so that the MAC computation is more precise. A different strategy to encode the inputs may be considered (e.g., with fullydigital solutions [33]). Finally, the high power consumption associated to the output integrator is due to the large amount of cells currents being integrated on its feedback capacitance [26]. Alternative solutions to the current integrator, as for example a direct current-to-digital conversion [12], must be taken in consideration.

III. EXPERIMENTAL EVALUATION OF THE AIMC UNIT
In this section, the performance of the AIMC unit is observed in terms of peripheral-circuit accuracy and of drift-compensation capability, providing a more detailed characterization with respect to [26]. The accuracy of the peripheral-circuit is estimated by first replacing the PCM devices with programmable integrated resistors, having a 32-levels resolution, so as to neglect the effects of the memory devices. In order to understand the variability in the cells programmed conductance, the iterative programming algorithm is detailed below as well. Drift compensation is then tested employing PCM cells as both MAC coefficients and reference conductance.

A. Accuracy of the AIMC Unit
To evaluate the accuracy of the peripheral circuitry, without the effects of the PCM cells non-idealities, the AIMC prototype was initially tested by performing m = 10000 random MAC operations z = [z 1 , . . . , z m ], with z j = 12 i=1 w j,i x i . In this test mode, the ePCM array has been replaced with an integrated test unit, where integrated resistors act both as weight coefficients and compensation reference conductance. Each conductance can be selected among 32 available levels. Measurements have been performed on four different testchips, which showed no significant differences among them. Results related to a sample testchip are reported in Fig. 2(a), where the yellow curve refers to MAC operations with negative weights, whereas the others employs positive weights. The experimental data z are distributed around the red lines representing the ideal MAC output z id , obtained by evaluating (1) with the nominal values of the integrated resistors used as g i, j and g REF .
The distribution of the MAC error ε = z id − z for the two cases is reported in Fig. 2(b). The accuracy of the circuit, defined as (1 − σ ε ) [12], where σ ε is the standard deviation of ε, is then equal to 98.9% for the positive-weights MACs, and it is equal to 98.4% for the negative ones.

B. PCM Cells Programming Algorithm
Various works have shown the possibility to reach multilevel conductance levels in PCM devices through the application of suitable SET and RESET current pulses [34]. These pulses cause the heating of a relevant portion of the cell, modifying his internal structure by an electro-thermal process. The current pulses are characterized by their maximum amplitude, time duration and falling-edge slope: • A SET pulse is a trapezoidal current pulse, consisting of a melting phase, followed by a slow crystallization phase, leading the cell resistance to decrease.
• A RESET pulse is a single, high-amplitude, rectangular current pulse that melts the central portion of the cell, which will rearrange itself randomly, bringing it into a high-resistance state. The experimental environment employed in this work is the one presented in [35]. The programming algorithm, based on a SET Stair-Case (SSC) approach [36], is outlined in Fig. 3. Once the target conductance level g and its absolute tolerance δg have been defined, a power RESET pulse of amplitude A R is applied to the cell. Then a sequence of partial SET pulses is applied to the cell, starting with an initial amplitude A S0 . The cell conductance value is measured after the application of each pulse. If the conductance level falls within the target interval, the programming algorithm is then terminated. If the target interval is exceeded, the algorithm will start again from the power RESET; otherwise, if the  reached conductance is below the target conductance range, a new SET pulse is applied, whose amplitude is increased by A S . To prevent the algorithm from running indefinitely, a maximum number of programming attempts is enforced. Whenever the targeted conductance value is not reached within the maximum number of iterations, the cell will be excluded from further calculations. This operation corresponds to the Iter check box in Fig. 3.
The minimum amplitude A S0 of the initial SET pulse is selected taking into account the target conductance of each cell, in order to reduce the number of SET pulses to be applied. In particular, A S0 is gradually increased from a minimum of A MIN to a maximum of 3A MIN as a function of the target conductance range, and the SET-amplitude step A S was chosen equal to A MIN /5.
In this work, 32 conductance levels have been chosen to program 6400 PCM devices. The targets are equally spaced below a maximum target g MAX , with an absolute tolerance δg = ±0.025g MAX . The result of the programming procedure at t = t 0 is shown in Fig. 4, where the cells cumulative distribution functions (CDFs) for each programming target is reported. All the 6400 cells were programmed successfully within a maximum number of iterations equal to 250. The conductance level distributions are overlapped to allow the feasible programming of all 32 levels within the maximum reachable conductance g MAX .

C. Drift Compensation
Drift-compensation is the main target of the described AIMC unit and its key element consists in the use of a reference PCM cell g REF for the ramp generation. Its level can be chosen: i) to maximize the V OUT output swing, and ii) to compensate the drift effects on MAC operations.
The output voltage V OUT can vary between ±V MAX OUT , a limit determined by the design of the output integrator. The maximum output swing, V MAX OUT , enforces an upper bound on the maximum MAC operation, i.e., from (3) . Thus, one can obtain a minimum value for g REF : where V MAX IN is the analog value corresponding to the maximum input x MAX . Condition (5) represents the worst-case constraint on g REF , as it assumes the maximum programmable conductance g MAX for each stored weight g j,i [35]. However, in practical implementations, where the w j,i values and consequently the g j,i of the whole array are known, the previous condition can be relaxed considering the maximum amount of conductance per WL: If the inequality in (6) is satisfied, all possible MAC operations are mapped within the available output swing (as shown in the black line in Fig. 5); otherwise, the output voltage may saturate (as represented by the purple line). Thanks to the compensation technique, the first condition is maintained over time, whereas the drift-induced random drop of MAC weights would translate into a sensible reduction of the output swing, as depicted by the yellow curve of Fig. 5, with consequent issues in any elaboration of the output.
In the proposed AIMC architecture, the value of the reference cell conductance is crucial for the effectiveness of the drift compensation, as PCM devices tend to assume drift coefficients with cell-to-cell variability, and a correlation to their initial conductance [35], [37], [38], both effects leading to an imperfect compensation of the drift exponents in (4). The optimal value of g REF , which satisfies (6), has been found by simulating 10000 random MAC operations z (the exact same set of inputs and weights used in Section III-A). V OUT has been computed according to the model (3) at time t 0 with the target values of cells programming (i.e., without drift),  obtaining the target MAC values z; then, the effects of drift have been simulated in z(t) with (4), where the drift coefficient values have been taken from a previous work [38]. Fig. 6 depicts the MAC accuracy, already defined as (1 − σ ε ), as a function of the g REF g MAX value. Different curves refers to three considered time intervals, i.e. 2 hours and 18 hours at room temperature, and after 24-hours 90 • C bake. To simulate the condition where no compensation is adopted, the reference conductance has been kept constant in accordance with [26]   measured z(t) are plotted as a function of the ideal MAC z id , in the three considered time instants. The results are also compared with the same operations performed with no compensation. In this case, the reference conductance g REF is implemented with an integrated resistance; thus, being the ramp reference current I REF constant in time, no drift compensation is applied. The distribution of the MAC error ε = z id − z(t) is also reported in the same figure both with and without drift compensation. MAC accuracy becomes quite constant over time when compensation is adopted (97.7% after 2 hours and 96.8% after 14 hours at room temperature), even after a drift-induced 24-hours 90 • C bake (94.8%) [17]; otherwise, its standard deviation σ (ε) tends to increase with a consequent decrease of MAC accuracy over time (92.2%, 90.3% and 81.9%, respectively). It is evident that when no compensation is adopted, the output swing is reduced, as previously discussed.

IV. MODELING THE CONDUCTANCE VARIABILITY
Validating the device and circuit performance in a (simulated) application requires a numerical model for the device properties, namely the variability of the programmed conductance under the effect of the iterative programming, and the conductance drift, both with and without hardware compensation. To this end, PCM cells were characterized by executing a MAC operation for each g j,i . To isolate a single cell, among the 12 that determine a MAC operation, one external input V k has been applied at a time, setting the others to 0. The AIMC output, as expressed in (3), then reads: and depends on the single cell g j,k behavior only. The only nonzero input, V k , was chosen equal to its maximum value V MAX IN for increased accuracy. Each individual level l can be reasonably approximated by a normal probability density N (µ (l) p , σ (l) p ), whose standard deviations are depicted as crosses in Fig. 8 against the mean normalized conductance. From the data, a model for conductances affected by programming variability can be defined as g 0 + g p (g 0 ), where g 0 is the nominal conductance and g p (g 0 ) a zero-mean gaussian perturbation. To obtain the standard deviation of the g p term a continuous function has been fit to the data, using the equation The parameters σ 0 , σ 1 and γ 0 have been found by a nonlinear least squares fit using the Levenberg-Marquardt algorithm. As negative weights are implemented mapping their magnitude and sign onto different devices, and assuming that only errors on the former can be observed at the output, the model is extended towards negative g values by setting σ p (g) = σ p (−g). This decision was grounded in the observation of a similar behavior for positive and negative weights in Fig. 2.
The standard deviation of cells spread at the end of the programming phase (time t 0 ) is reported in Fig. 8(a) as a function of the mean programmed conductance of each target level. As expected, all levels have been programmed with a spread standard deviation under the programming tolerance δg = ±0.025g MAX . This tolerance is arbitrarily tunable, and directly impacts on the energy required for the programming phase.
Conductance drift has been then observed, both for compensated and uncompensated cells, in the same settings described in Section III-C, i.e., after 2 hours, 18 hours and after a 24 hours bake at 90 • C. The mean µ (l) d and standard deviation σ (l) d of the conductance variation g d = g(t 1 )−g(t 0 ) observed in each programmed level are shown in Fig. 8(b) and (c). Note how the hardware compensation scheme reduces the mean component of the drift by up to one order of magnitude (for the cells which underwent a bake), while the spread of the level is (slightly) increased. This is caused by the reference cell conductance g REF being affected by its own variability, thus introducing an additional perturbation in the PCM-implemented levels. The standard deviation data has been fitted by model (8). Conversely, a polynomial of order 3 has been used for the error in the mean value of the programmed level, with a saturation applied so that it does not become positive for sufficiently low conductance values. The resulting functions µ d (g) and σ d (g) are the solid lines in Fig. 8. Though measurements are scarce in the region around 0.1, the trend predicted by the model is in line with what is observed in the device characterizations performed in other works [18].
The curve describing the standard deviation of an uncompensated drift in Fig. 8(b) after the 24-hours bake results in an almost straight line. The intuitive explanation is that conductances observed after the bake are more densely packed in the lower half of the conductance domain, as showed in Fig. 9. They end up in the linear-growth region for the 2-hours and 18-hours models. Moreover, as larger conductances experience a more significant drift, they also spread out more, leading to large standard deviations. When such large deviations are mapped to the initial target, the small post-drift domain occupied by the conductances is expanded to fill the entire horizontal domain, hence the yellow curve in the bottom plot of Fig. 8(b) surpassing the other two and having a linear trend.
As a final note, the models derived for the drift are not continuous over time, i.e. their description only refers to the specified test conditions. Furthermore, the models are extended towards negative weights by assuming the standard deviation is an even function, i.e., σ d (g) = σ d (−g) and the mean as an odd one, i.e., µ d (g) = −µ d (−g). This choice ensures that in any case drifts makes the cells more resistive as time goes by.

V. PCM-AWARE DNN TRAINING AND EVALUATION
To evaluate the performance of the proposed variability mitigation strategies on an actual application, a classification task on the well know CIFAR-10 dataset has been selected as a testbench [39]. Two popular neural networks have been used, the Lenet-5 [40] and the VGG-8 [13], having significantly different complexities, with ∼8 × 10 5 and ∼4 × 10 7 trainable parameters, respectively. Their implementation has been suitably modified so that each synapse would emulate a PCM device, with the possibility of enabling conductance programming variability and drift at will.
With reference to a typical dense layer, the description of the j-th neuron output is h j = f (b j + i w j,i x i ), with inputs x i , weights w j,i , bias terms b j and nonlinear activation f (·). A PCM-based layer driven by time-encoded inputs would instead be represented by: where equation (3) has replaced the MAC in the original formulation. This same reasoning can be trivially extended to convolutional layers and allows the definition of a fully PCM-based DNN. If programming noise and drift are being introduced, the elementary synapse conductance becomes where g p (g 0 ) is the programming-induced variability, having distribution N 0, σ p (g 0 ) and g d (g 0 , t) models the drift by drawing from a N (µ d (g 0 , t), σ d (g 0 , t)) distribution, using the models depicted in Fig. 8. Both neural networks have been trained with the Adam optimizer [41], using the following parameters: exponential decay rate for the 1st and 2nd moments equal to 0.9 and 0.99, and learning rate equal to 10 −2 for the Lenet-5 network and 10 −3 for the VGG-8 one. Whilst training, the learning rates have been halved whenever the process would reach a plateau for a predefined amount of epochs.
Let us first observe how the two DNNs, trained without any weight variability, perform when the g p term is introduced only at inference time. To widen the scope of the analysis, the injected perturbation is scaled by a multiplying factor. One reason to do it could be to relax the tolerance δ g = ±0.025g MAX of the programming algorithm described in Section III-B, allowing it to converge in a lower number of iterations, speeding up the initial setup of the memory or a possible refresh of its values. The dotted curves in Fig. 10 highlight the subitaneous loss of performance as soon as noise is injected in the Lenet-5 DNN. The larger VGG-8 network, other than having a higher accuracy, is also more resilient towards the injected perturbation. This is thought to be the effect of the additional redundancy introduced by the larger number of weights, as in [42]. The datapoint corresponding to a spread multiplier (SM) of 1 has been highlighted, as it corresponds to the performance observed under the current programming parameters.
To make the network aware of the programming spread affecting its weights, a training methodology inspired by the fake quantization procedure [43] has been employed. This is a known methodology for the construction of NNs robust against synapse variability, and has been used extensively in the Literature [18], [44]. It requires, at train time, the addition of a perturbation before the weights are actually applied to the inputs. This obviously affects the network result, hence the starting point of the backpropagation algorithm [45]. The weight-update process then computes the derivative with respect to the original, nominal weights. Empirical evidence shows that this makes the network more resilient to weight variations. The original technique was devised for the purpose of making the network robust towards weight quantization. In that case, the properties of the injected variability would have been dependent on the number of allowed levels. For the PCM-based layers, instead, the injected perturbation models the programming-induced variability, i.e., the g p term in (10). Results in Fig. 10 plotted as solid lines refer to DNNs trained and evaluated with an identical spread multiplier. The performance gain is much more pronounced for the smaller Lenet-5 than the larger VGG-8, so much so that the former becomes implementable also on the currently available technology. At a multiplier of 1, the Lenet-5 shows a 2.2% drop (69.4% down to 67.2%) in accuracy compared to the ideal, unperturbed, setup and a 15% increase (52.2% to 67.2%) with respect to the conventionally-trained DNN with weight perturbation injected at evaluation-time. This result, in conjuction with recent observations on the issues with the IR drop in large PCM arrays [46], highlights the value of the device-aware training technique to construct small and robust DNNs.
Meaningful observations follow from the behavior described by the dashed lines in Fig. 10. They represent the DNNs accuracy as a function of the increasing spread multiplier applied in the inference phase, while the device-aware training is performed with a constant multiplier of 1. Performance is improved with respect to the conventional training, allowing the network to tolerate higher spreads on the network coefficients. A direct consequence is the possibility of relaxing the requirements of the programming technique, without retraining the network, and with benefits in terms of programming speed and energy efficiency. At the same time, additional nonidealities, e.g., quantization of pre-and postactivation signals or the presence of parasitic elements along the conductive paths, to mention but a few, could be already Classification accuracy when quantizing the signals applied to and read from every layer, for NNs trained to exclusively address PCM programming spread using a multiplier (SM) of 1. managed by a network not trained specifically to address those issues, up to a certain level.
Indeed it has been observed how device-aware training techniques do not need to accurately describe the variability of interest, because of an inherent ability of the training to lead to networks robust against effect different from the perturbations used in training [18], [44]. As an example, Fig. 11 shows the classification accuracy of the two networks trained at SM=1 to address only PCM programming error, but evaluated with the introduction of quantized activations between each layer. Results prove that both networks can tolerate up to 6 bits of quantization with a performance degradation below 1%, while 5 bits introduce a loss around 5% points. More severe perturbations should be explicitly addressed during the training procedure [47]. In any case, the same perturbation-injection principle used in this work could be used to address signal quantization (the original purpose of the technique) or even the presence of parasitic elements in the analog array [44].
Having a network that can tolerate programming variability, the final step is to observe its robustness against weight drift. Both networks, trained with a spread multiplier of 1 (i.e., with programming tolerace δg = ±0.025g MAX ), have been re-evaluated by introducing the drift component of the conductance g d at inference time. From Fig. 12 it is clear how the presence of the hardware compensation allows the accuracy to be retained over time. The accuracy gain after the 24-hours 90 • C bake is 36% for the Lenet-5 (even though the corresponding point for the uncompensated evaluation falls outside the range of the plot) and 22% for the VGG-8 DNN. While the drop with respect to the no-drift condition is 3% and 0.2%. Still, the benefit is larger for the smaller network. However, even the VGG-8 one, which would lose significant accuracy after the 24-hours bake, would be able to preserve its original performance with the introduction of the compensation technique.

VI. CONCLUSION
In this paper a combined hardware and software technique is employed to enhance Deep Neural Networks (DNNs) performance in inference tasks using measurements from an Analog In-memory Computing (AIMC) prototype based on a Phase Change Memory (PCM) IP. Empirical results show that the employed test-chip Multiply-and-Accumulate (MAC) accuracy is kept at around 95% over time, thus validating the proposed hardware compensation scheme. The spread and retention of the programmed conductances states have been characterized and modeled, including the effects of the hardware drift-compensation technique. The results have been used in a classification task on the CIFAR-10 dataset, where a device-aware training procedure was employed to make the DNNs resilient to weight variability. The tests show that the proposed combined techniques allows a 15% increase in accuracy for the Lenet-5 network compared to the conventionally trained one, with a marginal drop with respect to the ideal reference setup. Drift compensation enables the networks to retain accuracy over time and is especially beneficial for smaller DNNs, recovering up to 36% in accuracy compared to the uncompensated drift.