Training of Quantized Deep Neural Networks using a Magnetic Tunnel Junction-Based Synapse

Quantized neural networks (QNNs) are being actively researched as a solution for the computational complexity and memory intensity of deep neural networks. This has sparked efforts to develop algorithms that support both inference and training with quantized weight and activation values, without sacrificing accuracy. A recent example is the GXNOR framework for stochastic training of ternary (TNN) and binary (BNN) neural networks. In this paper, we show how magnetic tunnel junction (MTJ) devices can be used to support QNN training. We introduce a novel hardware synapse circuit that uses the MTJ stochastic behavior to support the quantize update. The proposed circuit enables processing near memory (PNM) of QNN training, which subsequently reduces data movement. We simulated MTJ-based stochastic training of a TNN over the MNIST, SVHN, and CIFAR10 datasets and achieved an accuracy of 98.61%, 93.99% and 82.71%, respectively (less than 1% degradation compared to the GXNOR algorithm). We evaluated the synapse array performance potential and showed that the proposed synapse circuit can train ternary networks in situ, with 18.3TOPs/W for feedforward and 3TOPs/W for weight update.


Introduction
Deep neural networks (DNNs) are the state-of-the-art solution for a wide range of applications such as computer vision and natural language processing. The classic DNN approach requires frequent memory accesses and is compute-intensive, requiring numerous multiply and accumulate (MAC) operations. For example, the ResNet50 network requires 3.9 billion MAC operations, while storing and accessing 25.5MB of weights [1]. As such, DNN performance is limited by computing resources and power budget. Therefore, efforts have been made to design dedicated hardware for DNNs [2,3,4]. These solutions support training with high resolution, such as 32-bit floating point. Still, DNN models are power-hungry and tend not be suitable to run on low-power devices.
Ternary neural networks (TNNs) and binary neural networks (BNNs) are being explored as a way to reduce the computational complexity and memory footprint of DNNs. By reducing the weight resolution and activation function precision to quantized binary {−1, 1} or ternary {−1, 0, 1} values, the MAC operations are replaced by much less demanding logic operations, and the number of required memory accesses is significantly reduced. Such networks are also known as quantized neural networks (QNNs) [5]. The potential efficiency of QNNs has motivated research efforts to design novel algorithms that can support BNNs and/or TNNs without sacrificing solution performance (usually measured by prediction accuracy). These efforts include data quantization during training. In this work, we focus on the GXNOR training algorithm [6]. This algorithm uses a stochastic update function to facilitate the training phase. Unlike other algorithms [7,8,5], GXNOR does not require storing the full value (e.g., in a floating point format) of the weights and activations. Hence, GXNOR enables further reduction of the memory capacity during the training phase.
Emerging memory technologies, such as spin-transfer torque magnetic tunnel junction (STT-MTJ), can be used to design dedicated hardware to support in-situ DNN training, with parallel and energy-efficient operations. The near-memory computation enabled by these technologies also reduces overall data movement. The MTJ is a binary device, with two stable resistance states. Switching the MTJ device between resistance states is a stochastic physical process. While typically, stochastic switching is not a desirable property for memory cells to have, we exploit this feature to support QNN training.
Previous works used the stochastic behavior of the STT-MTJ, or other memristive technologies such as resistive RAM (RRAM), to implement hardware accelerators for BNNs [9,10,11,12]. In [9], the research focus was on the architecture level of BNN accelerators, without supporting training. Other works implemented hardware for bioinspired artificial neural networks (ANNs), using the spike-timing-dependent plasticity (STDP) training rule [10,11]. Although STDP is widely used for bio-inspired ANNs, common DNNs are trained with gradient-based optimization such as stochastic gradient descent (SGD) and adaptive moment estimation (ADAM) [13]. A recently proposed MTJbased binary synapse comprising a single transistor and a single MTJ device (1T1R) [12] supports training QNNs with binary weights and real value activations. [12] exploited analog computation to support processing near memory (PNM). Their design, however, requires two update operations to execute the SGD updates. Using real-valued activation will require high-resolution data converters, thereby increasing the area and power consumption of the proposed solution.
In this paper, we explore the stochastic behavior of the MTJ and leverage it to support fully quantized training (GXNOR). Our solution reduces the overall weight and read operations and the cost of the update phase. We propose a four-transistor, two-MTJ (4T2R) circuit for a ternary stochastic synapse and a two-transistor, single-MTJ (2T1R) circuit for a binary stochastic synapse, where the intrinsic stochastic switching behavior of the MTJ is used to perform a stochastic update function. Such a design enables highly parallel, energyefficient, and accurate in-situ computation. Our designed synapse can support various DNN optimization algorithms, such as SGD and ADAM, which are used regularly in practical applications.
We evaluated TNN and BNN training using the proposed MTJ-based synapse with PyTorch over the MNIST [14], SVHN [15],and CIFAR10 [16] datasets, where the circuit parameters were extracted from SPICE simulations using a GlobalFoundries 28nm FD-SOI process. Our results show that using the MTJ-based synapse for training yielded similar results as the ideal GXNOR algorithm, with a small accuracy loss of 0.7% for the TNN and 2.4% for the BNN.
This paper makes the following contributions. It • Exploits the MTJ stochastic properties to support QNN stochastic training.
• Demonstrates MTJ applicability within the GXNOR framework. We show that PNM of stochastic QNN training is enabled using the MTJ-based synapse, with only a small accuracy reduction.
• Offers MTJ-based ternary and binary synapse circuits. These circuits: -Exploit the stochastic switching of the MTJ device to support a stochastic weight update algorithm, -Support in-situ weight update of standard optimization algorithms such as SGD and ADAM, without reading the weight data out of the synapse array, -Support near-memory processing of the feedforward and backpropagation computations, enabling high parallelism.
The rest of the paper is organized as follows. In Sections 2, background on DNN and QNN training and MTJ is given. Section 3 addresses the motivation of MTJ-based training. Section 4 describes the proposed MTJ-based ternary synapse. In Section 5, the ability of the proposed circuits to support TNN training is evaluated as well as their energy efficiency. In Section 6 a comparison to previous works is given. A conclusion is provided in Section 7. In the supplementary we explain how to modify the proposed circuits to support BNNs.

Deep Neural Networks
DNNs are machine learning models, that use connected layers, composed of neurons, to learn a desired functionality F. The different neuron layers are connected through weighted connections called synapses. For simplicity, this section focuses on a fully connected (FC) layer; however, a similar computation is done for layers with other weight connections, such as convolution layers (CONV) [2,7,8,17]. In FC, the output is given by a matrix-vector where the elements of matrix W are the synapse weights, x is the input neuron vector and o is the output. Hence, each element in the output vector om is the weighted sum of the input, where N , om, wmn, and xn are, respectively, the number of input neurons, the output m, the synapse weights between neuron m and neuron n, and the value of input neuron n. The following neuron layer is computed by passing o through a non-linear function, called an activation function σ(·) and is, therefore, given by where l is the layer index.

Training a DNN
In supervised learning, the network is trained to find the set of parameters, i.e., synapse weights, which approximates the desired functionality. The network is trained using a dataset is the input vector of the network and di is the desired output. During training, the network parameters w are calibrated to find the desired relation d = F( x (0) , W ). To this end, a measure of quality is defined: the cost function is the output of the network. The goal of the learning algorithm is to find w that minimizes the value of C( d, O) with respect to the dataset. Hence, optimization algorithms are used to find the minimum of C. Different optimization algorithms, such as SGD and ADAM, are used during DNN training [13]. During training, first the output and the cost function are computed in a stage called feedforward. After the cost function is known, the error of each layer y l is computed in a stage called backpropagation. The error is computed using the chain rule and is given by where W T is the transpose of the weight matrix W , and σ is the derivative of σ with respect to o. Taking the computed error of the layer, the weight gradients are computed and used to update the weights. Usually, the weight update rule is given by where fopt is defined by the optimization algorithm that is used. General DNNs do not limit the value of the weights, which can be any real value. Typically, the unconstrained parameters are represented by precision higher than 1 or 2-to 32-bit floating point. The following section describes a framework to train QNNs.

Training Quantized Neural Networks with a Stochastic Update Rule
In recent years, efforts have been made to make DNN models more hardware-compatible. Quantization methods have been explored, where the DNN weights and activation functions are constrained to being discrete values such as binary {−1, 1} or ternary {−1, 0, 1} values. For BNNs and TNNs, MAC operations are replaced with the simpler XNOR or Gated-XNOR (GXNOR) logic operations, respectively. The memory footprint of the quantized network is dramatically reduced (for example, for ResNet50, with ternary weights and activations, the memory capacity is cut in half during training and by 16 during inference).
This section describes the GXNOR framework [6] that constrains the weights and activations to the quantized space while training the QNN. We focus on the differences between the GXNOR training algorithm and regular DNN training.

Quantized Weights and Activations
The quantized space ZN is defined by where N is a non-negative integer that defines the space values. For example, the binary space is given for N = 0 and the ternary space for N = 1. The quantized space resolution, i.e., the distance between two adjacent states, is given by

Feedforward and Backpropagation
In QNNs, the quantized activation function is a step function, where the number of steps is defined by the space. To support backpropagation through the quantized activations (ϕr), the derivative of the activation function is approximated. In this work, the ideal derivative is approximated by a sum of window functions. The window function is given by where r and a are positive hyperparameters, defining the sparsity of the neurons (i.e., the quantization range) and the window function width, respectively. Using the approximated derivative, the backpropagation of the GXNOR training algorithm is computed with no further changes compared to regular DNNs.

Weight Update
To support training with weights constrained to the discrete weight space (DWS), the GXNOR algorithm uses a stochastic gradient-based method to update the weights. First, the update value is computed by an optimization algorithm (e.g., SGD, ADAM, RMSprop). Then, a boundary function is defined to guarantee that the updated value will not exceed the [−1, 1] range. The boundary function is where W l ij ∈ ZN is the synaptic weight between neurons j and i of the following layer (l + 1), ∆W l ij ∈ R is the gradient-based update value, and k is the update iteration. Then, the update step is given by where ∆w l ij (k) = P( ) ∈ Z is the discrete update value, obtained by projecting (∆W l ij (k)) to a quantized weight space. P( ) is a probabilistic projection function defined by where κij and νij are, respectively, the quotient and remainder values of divided by ∆zN , and where m is a positive hyperparameter. Hence, where Bern(η(νij)) is a Bernoulli random variable with parameter η(νij). In this paper, which focuses on TNN, the ternary weight space (TWS) is given by N = 1 and ∆z1 = 1. Figure 1 illustrates examples of TNN weight updates for W = −1 and W = 0. Further discussion of the BNN implementation is found in the supplementary material.
Dedicated hardware for TNN and BNN can fully exploit the potential of these networks. In this paper, we propose to use emerging memory technology, i.e., STT-MTJ, to support PNM of TNNs and BNNs.

Magnetic Tunnel Junction
An MTJ device comprises two ferromagnetic layers, a fixed magnetization layer and a free magnetization layer, separated by an insulator layer, as illustrated in Figure 2. The resistance of the device is defined by the relative magnetization of the free layer as compared to the fixed layer. A parallel magnetization state (P) leads to low resistance (Ron) and an anti-parallel state (AP) leads to high resistance (R of f ). The device resistance can be switched by the current flow through the device. When the current flows from the free layer to the fixed layer, the resistance may switch to Ron. Likewise, when the current flows from the fixed layer to the free layer, the resistance may switch to R of f . The switching probability of the MTJ device depends on the current's magnitude, when three work regimes are defined as: 1) low current, low switching probability, 2) intermediate current, and 3) high current, high switching probability [11]. As we are interested in fast switching time, this work focuses on the high current regime. Therefore, current I is substantially higher than critical current Ic 0 , and is given by where α, Ms, V , P , M ef f are, respectively, the Gilbert damping, the saturation magnetization, the free layer volume, the spin polarization of the current, and the effective magnetization [18]. The switching time is, therefore, where γ is the gyromagnetic ratio, and θ is the initial magnetization angle [18], given by a normal distribution θ ∼ N (0, θ0), θ0 = kBT /(µ0H k MsV ), where H k , K b and T are the shape anisotropy field, the Boltzmann constant and temperature, respectively. In this work, we use current pulses with varying time intervals to control the MTJ switching probability and to support the stochastic weight update given by (13).

Stochastic In-Situ Training
The training scheme suggested in [6] reduces the memory footprint of the training phase. Still, every update iteration (Eq. (10)) requires reading the weights, computing the stochastic update step, and writing the new weight value. Computing the stochastic update will require the use of a random number generator (RNG). Adding RNG will increase the complexity of the design, in terms of having to transfer the random numbers to all the weights, area overhead of the RNG circuit and the resulting power consumption. For example, assume 128 weights are read in 100ns [17], to generate 128 8-bit random numbers at this rate, the RNG design requires 64 PRNG circuits [19]. We suggest replacing the RNG functionality with the stochastic write operation of the MTJ device. Our approach replaces the read-PRNG-write loop with a single stochastic-write operation. Moreover, working with the MTJ in a stochastic write regime allows us to work with shorter write intervals. Other emerging memory technologies also have stochastic write models that might be a good fit to the expression in (12). Nevertheless, training requires numerous write operations; for example, one network training with 1000 training epochs requires 5 · 10 7 , and 10 8 writes per device for CIFAR-10 and ImageNet datasets, respectively. Thus, a STT-MTJ device, which has the high reported endurance, is a leading candidate for stochastic in-situ training [20].

MTJ-Based Ternary Synapses
We now describe the proposed ternary synapse circuit that supports stochastic GXNOR training. In the supplementary, we explain how the proposed synapse can support binary weights as well.

Training TNN Using an MTJ-Based Synapse
First, we describe how we leverage the stochastic switching behavior of the MTJ device to perform the stochastic update function. Two MTJ devices are needed to represent ternary weight, where the weight is defined and stored as the combination of the resistances of the two MTJs. Table 1 lists the different values of the synapse weight as a function of the MTJ's resistance. To support the stochastic weight update, both MTJ devices might be switched during an update. To switch the state of the MTJ device, a voltage pulse Vup is applied across the device, for time interval ∆t ∈ [0, Tup]. For a fast update operation, the update is performed in the high current domain guaranteed by Vup. The resultant current direction and the pulse time interval determine the switching probability. Using (15) and the voltage pulse, the switching probability of the MTJ is where C = 2Ic 0 αγµ 0 Ms , and R is the device resistance. As indicated in Eq. (16), Psw is a function of the voltage pulse amplitude and time interval. Therefore, Tup is set to guarantee that if ∆t = Tup, then Psw ≈ 1. Moreover, Psw is a function of the current direction flows through the MTJ and the state of the MTJ device.
To better understand the update operation of a single synapse, in this section we consider the simplified synapse illustrated in Figure 2. Each MTJ update is independent; this is guaranteed by applying different voltage pulses V1, V2 on each synapse and connecting the node between the MTJs to the ground. In this manner, each MTJ is updated according to (16). To support the GXNOR update, we need to control the switching probability of each MTJ device according to the update value (∆W ) and the synapse weight. To this end, we (i) define Vapp = V1 − V2, (ii) enforce opposite polarities of V1 and V2 (i.e., sign(V1) = sign(V2)), and (iii) set sign(Vapp) = sign(∆W ).  Table 1 each weight has four possibles values w ∈ {−1, 0 s , 0 w , 1}.
Following this work scheme, if ∆W > 0, the current directions guarantee that only a synapse with weight W = −1 or W = 0s can switch. Similarly, if ∆W < 0, the current directions guarantee that only a synapse with weight W = 1 or W = 0w can switch.
Next, we need to ensure that the switching probability will follow Eq. (13) and will be a function of the update value ∆w = κ + sign(ν)Bern(η(ν)). As indicated by Eq. (16), the pulse duration ∆t sets the switching probability. To support (13), we set the pulse duration of V1 and V2 to be a function of κ or ν, where κ = {0, 1, 2} and ν ∈ [0, 1]. If ∆W > 0, the pulse duration of V1 is set by κ. Hence, ∆tV 1 Following this methodology, at each weight update, one MTJ is updated as a function of κ, while the other is updated as a function of ν, depending on the sign of ∆W . Thus, if κ = 0, the MTJ switching probability is approximately 1 and the switching probability is given by the indicator variable Psw,κ = 1 κ =0 . Since ν is a fraction, the switching probability of the other MTJ with respect to ν is a Bernoulli variable with probability Psw,ν = P (νTup). Therefore, the MTJ-based synapse update is given by ∆w =sign(∆W )(Psw,κ + Psw,ν ) = sign(∆W )(1 κ =0 + Bern(P (νTup))); (17) see examples in Section 4.4. The MTJ-based synapse update differs from the ideal GXNOR update in that it supports two zero states, and uses similar, but not identical, switching probabilities (Psw ≈ η).

Proposed Synapse Circuit and Synapse Array
Synapse Circuit A schematic of the proposed ternary synapse is shown in Figure 3a. The ternary synapse is composed of two MTJ devices connected via their fixed layer port. The free layer port of each MTJ is connected to two access transistors. This synapse is inspired by previous work [4,21], but we replace the RRAM by the MTJ device, and two synapse structures are added together to support the ternary weight. In contrast to [4,21] which supports full-precision analog weight values, the MTJ-based synapse supports quantized weights and stochastic weight updates. Sections 4.3 and 4.5 describe how our design is optimized to support quantized weights.

Synapse Array
The synapse circuit shown in Figure 3a is the basic cell of an array structure, as shown in Figure 3b. The synapses are arranged in an M × N array, where

Stochastic Weight Update
We now explain how the synapse circuit is designed and how the input and control signals are set to support the GXNOR stochastic update scheme. Unlike weight updates in standard DNN, the proposed synapse supports the quantized update scheme suggested in [6].

Weight Update
Step Similar to [4,22], four transistors are added to support parallel synapse updates. The control signals of these transistors dictate the weight update functionality by controlling the current direction and the voltage pulse time interval in the synapse array. The update step can be performed in parallel for all the synapses in the same array, depending on the optimization algorithm used. Since the GXNOR algorithm can use any optimization algorithm to compute the gradient-based update value (Section 2.3.3), we consider two update cases: (i) supporting general optimization algorithms, such as ADAM, and (ii) supporting the SGD algorithm. Table 2 summarizes the circuit level signals as a function of the GXNOR variables.

Support of General Optimization Algorithms
To support general optimization algorithms, the update value ∆W is computed in a peripheral circuit to the synapse array; thus, ∆W is given as an input to the array. The array columns are updated sequentially, i.e., a single column is updated per iteration. During this operation, the input voltages are set to u1 = u2 = Vup > 0 in the active column, u1 = u2 = −V dd for the rest of the columns, and the output row interface connects the rows to ground. To support the stochastic update (Section 4.1), the control signals are given by where ∆W , ν, and κ are as defined in Section 2.3. Hence, the MTJ is updated proportionally to κ = |∆W | and ν = remainder(∆W/∆z1), meaning that for a single synapse, one MTJ is updated using a pulse width of ∆t = 1 |κ|>0 Tup, while the other is updated using a pulse width of ∆t = |ν|Tup. We assume that κ and ν are inputs to the synapse array.

Support of Stochastic Gradient Descent
This update scheme is similar to the update scheme proposed in [4]. When the SGD algorithm is used to train the network, all the synapses in the array are updated in parallel. Therefore, in this section, we denote the array row and column indexes by i and j, respectively. To support SGD training, minor changes need to be made to the general update scheme. Using SGD, the update is given by the gradient value, and is equal to ∆W = xy T , where y is the error propagated back to the layer, achieved using the backpropagation algorithm, and x is the input.
The functionality of the control signals remains unchanged compared to the general update scheme, except that the voltage source is selected according to y, and the voltage sign and the effective update duration are set as a function of the integer κ and the fraction ν values of y, respectively. Therefore, the update equation is given by where sign(yi)sign(uj) = sign(∆Wij).

Ternary Synapse Update Examples
Example 1: W = −1 and positive update value Figure 4a shows the case where a synapse weight is W = −1 and the update value is ∆W = 1.5. Thus, κ = 1 and ν = 0.5. Hence, e1,p = −e2,p = −V dd ; therefore, P1 is ON and P2 is OFF for time interval Tup. e1,n = −V dd for Tup and e2,n is ON for 0.5Tup, as given by Hence, the probability of R1 and R2 switching is Psw,1 ≈ 1 and Psw,2 = P (0.5Tup), respectively. In this example, the synapse weight will be updated from W = −1 to W = 0 with a probability of and can switch to 1 with a probability of P−1→1 = Psw,1Psw,2 ≈ Psw,2.
Note that when W = −1, {R1, R2} = {R of f , Ron}. Thus, if ∆W < 0, due to the current that flows from R2 to R1, both MTJs cannot switch and the state will remain unchanged. Figure 4b shows the case where a synapse weight is W = 0w and the update value is ∆W = −0.5. Thus, κ = 0 and ν = −0.5. Consequently, e1,p = e2,p = V dd , so both P1 and P2 are closed for Tup. e2,n = −V dd for Tup and the transistor connected to e1,n is open for 0.5Tup, as given by

Example 2: W = 0 and negative update value
Therefore, only R1 can switch with a probability of Psw,1 = P (0.5Tup). In this example, the synapse weight is updated from W = 0w to W = −1, with a probability P = Psw,1.
Although, theoretically, no current should flow through R2, it might switch from Ron to R of f due to leakage currents with probability Psw,2 << 1.

Feedforward and Backpropagation
TNN training requires the circuits to support the feedforward and backpropagation stages [21]. The feedforward stage requires to compute matrix-vector or matrix-matrix multiplication; in TNNs, the multiplication is replaced with the gated XNOR (GXNOR) operation. The GXNOR logic outputs zero if one of the inputs is zero; otherwise, it outputs the XNOR operation between the inputs. In this section, we first explain how our synapse performs the GXNOR operation and then how the synapse array is used to perform the near-memory quantized matrixvector multiplication. When supporting training, the feedforward stage is followed by computation of the error of each layer, known as the backpropagation stage. This stage requires computation of the matrix-vector multiplication between the transposed weight matrix (W T ) and the layer error vector (y). In the following sections, we denote the matrixvector multiplication of W T y as "backpropagation".

Gated XNOR
To perform the GXNOR logic operation between the synapse and activation values [6], we denote the input neuron values as the voltage sources. Accordingly, u = V (a) is the voltage representing input value a. The logic values of the input neuron a ∈ {−1, 0, 1} are represented by u ∈ {−V rd , 0, V rd }. V rd is set to guarantee the low current regime of an MTJ, so the switching probability is negligible. During this operation, u1 = u, u2 = −u, {e (1,n) , e (1,p) , e (2,n) , e (2,p) } = {−V dd , −V dd , V dd , V dd } and the synapse output node is grounded. The result is given by the output current sign, where G {1,2} is the conductance of each MTJ device, respectively. As listed in Table 1, the polarity of Iout depends on the input voltage and the synapse weight. If u = 0 or W = {0w, 0s}, the output current is Iout ≈ 0. If the weight and input have the same polarity, then sign(Iout) = 1, else sign(Iout) = −1.

Feedforward The quantized feedforward operation is given by
where Om is the result of row m. During this operation, each column voltage is mapped to the corresponding input activation (un = V (an), ∀n ∈ [1, N ]), the output row interface connects the rows to ground. Thus, each synapse computes the GXNOR operation between its input and stored weight and the output currents from all synapses are summed based on KCL. Thus, the current through row i is where Gij,R {1,2} is the conductivity of each MTJ, N is the number of synapses per row, N+1,i is the total number of positive products in row i, and N−1,i is the total number of negative products in row i.

Backpropagation
To train the TNN, backpropagation of the error must be performed. Therefore, the synapse array also supports the matrix-vector multiplication W T y.
Rather than storing W T in a dedicated array, we reuse the same array, which stores W , similarly to [4]. During this operation, the output row interface is used as an input and the output is given by the current measured in the columns. Due to the synapse structure, the current is separated into two columns, as shown in Figure 3c. Therefore, the operation result is given by the current difference where yi is the layer's error. The current through each column pair is converted to voltage, and the result is computed using a voltage comparator.

Evaluation and Design Considerations
This section presents the performance evaluation of the MTJ-based QNN training. The functionality, power and area of the synapse circuit and array were evaluated in Cadence Virtuoso and used for the training simulations. The MTJ-based design of the GXNOR algorithm (MTJ-GXNOR) is compared to a software implementation of the algorithm (GXNOR in our terminology).

Methodology
Our proposed circuit is a hardware implementation of the GXNOR framework used to train QNNs [6]. We evaluate our design using four metrics: (i) Circuit Evaluation (Section 5.2). We validated the circuit operations needed to support the MTJ-GXNOR framework. Our circuit needs to support a stochastic update, the GXNOR, and backpropagation operations. The MTJ switching operation was evaluated by running Monte Carlo simulation of the MTJ transient response. We evaluated the GXNOR and backpropagation operations using SPICE simulations. (ii) Training Simulation (Section 5.3). To validate that our proposed synapse can be used to train QNNs and reach comparable results to the GXNOR algorithms and other stateof-the-art QNN frameworks [6,7,8], we simulated MTJ-GXNOR training in PyTorch with the hardware circuit parameters extracted from the circuit evaluation. (iii) MTJ-GXNOR Sensitivity to Process Variation (Section 5.4). The MTJ-GXNOR training performance is influenced by the device variation and environmental changes. Hence, we evaluated the sensitivity of the MTJ-GXNOR test accuracy considering process and environmental variations. (iv) Hardware Performance Evaluation (Section 5.5). Our design can be integrated into different architectures, each leading to a different performance. Here, we report on the performance of our basic cells -the hardware synapse and synapse array. We also consider a simple system test case comparable with previous solutions.

Circuit Evaluation
The synapse circuit was designed and evaluated in Cadence Virtuoso for the GlobalFoundries 28nm FD-SOI process. The MTJ device is based on device C from [18] and its parameters are listed in Table 3. To achieve higher switching probability, the magnetization saturation (µ0Ms) was changed according to [23]. The read voltage, V rd , was set to guarantee a low-current regime and negligible switching probability for the feedforward and backpropagation operations. Likewise, the update voltage, Vup, was set to guarantee a high-current regime. The update time period was set to match Psw Tup ≈ 1.

Circuit Schematic Model
The transistors and interconnect affect the circuit's functionality and performance. Therefore, we adopted a circuit model that considers the parasitic resistance and capacitance of wires and transistors. the model considers the location of the synapse. An illustration of the schematic circuit model appears in the supplementary material (Section 2). Using the schematic circuit model, and SPICE simulations, we evaluate the circuit array and operations. We considered the corner cases, i.e., the synapses located at the four corners of the synapse array, to evaluate the worst-case scenario for the effect of the wires and transistors on operation results, latency, and power consumption. For all circuit simulations, we considered the worst case, where the wire resistance and capacitance are the most significant, i.e., for an array of size M × N , the synapse located at [M,1].

MTJ Switching Simulation
To evaluate the transition in the resistance of the MTJ and its impact on the operation of the synapse, we performed a Monte Carlo simulation of the MTJ operation. The simulation numerically solves the Landau-Lifshitz-Gilbert (LLG) [24,11] differential equation (assuming the MTJ is a single magnetic domain) with the addition of a stochastic term for the thermal fluctuations [25] and Slonczewski's STT term [26]. For each iteration of the Monte Carlo simulation, a different random sequence was introduced to the LLG equation and the resulting MTJ resistance trace was retrieved. The equation was solved using a standard midpoint scheme [27] and was interpreted in the sense of Stratonovich, assuming no external magnetic field [28] and a voltage pulse waveform. The resistance of the MTJ was taken as Ron 1+P 2 1+P 2 cosθ [29], where θ is the angle between magnetization moments of the free and fixed layers and P is the spin polarization of the current. To approximate the time-variation resistance of an MTJ during the switch between states, all the traces from the Monte Carlo simulation were aligned using the first time that the resistance of the MTJ reached Ron+R of f 2 . After the alignment, a mean trace was extracted and used for the fit. This fit was used as the time-variation resistance when the MTJ made a state switch.

GXNOR Operation
The GXNOR operation for a single synapse is shown in Figure 5. When either the activation (input) or the weight (W ) is zero, the output current is one order of magnitude lower than in the other cases.

MTJ-GXNOR Training Simulation
To evaluate the training performance of our solution, we determined the test accuracy, and compare it to the original GXNOR algorithm implemented in software and to other state-of-the-art frameworks. We denote our results as 'MTJ-GXNOR' and the ideal GXNOR algorithm as 'GXNOR'. We tested the quantized networks over the MNIST, SVHN and CIFAR10 datasets [14,15,16]. The following three quantization resolutions were simulated in PyTorch: (i) a full ternary network ('MTJ-GXNOR TNN'), (ii) a full binary network ('MTJ-GXNOR BNN'), and (iii) a network with ternary weight and binary activations ('MTJ-GXNOR Bin-Activation'). For the MNIST and SVHN dataset we 20 0.5 ‡ To achieve higher switching probability, the value of µ 0 Ms was changed according to [23]. . During the GXNOR operation (read operation), V rd is 0.1V to guarantee a low-current domain and low switching probability. For V in = 0 and W = 0 w/s , the output current is not zero. This is a source for error when the GXNOR results are summed to compute the activation value. Limiting the dimensions of the synapse array can mitigate this effect.
trained the same convolution neural networks described in [6]. The network architecture is "32CONV5-MP2-64CONV5-MP2-512FC" for the MNIST dataset, and "2×(128CONV3)-MP2-2×(256CONV3)-MP2-2×(512CONV3)-MP2-1024FC" for the SVHN dataset, where CONV, MP and FC are the convolution layer, maxpool layer and fully connect layer, respectively. For the CIFAR10, we trained the VGG16 network [30]. We trained the networks using ADAM optimization algorithms with batch sizes of 100 for the MNIST, 1000 for SVHN and 750 for CIFAR10. Table 4 lists the test accuracy of MTJ-GXNOR compared to GXNOR and other state-of-the-art algorithms (ideal software implementations). For the MNIST and SVHN datasets our solution achieved accuracy similar to that of the state-of-the-art algorithms, implemented in software. For the CIFAR10, the MTJ-GXNOR reached accuracy comparable to that of GXNOR, but lower compared to the other algorithms. Notwithstanding, considering the application and the hardware performance improvement, some accuracy degradation might be acceptable. Although the BNN [7] and BWN [8] frameworks achieve better results compared to the GXNOR BNN [6], they retain the full-precision weights during the training phase, which increases the frequency of memory accesses and requires support of full-precision arithmetic. Hence, their potential hardware implementation will be much less efficient than GXNOR. The MTJ-GXNOR TNN results are similar to the results of the GXNOR training, showing Table 4: Accuracy of State-of-the-art Algorithms

Training Performance Sensitivity to Process Variation
Device variation and environmental changes may affect the performance of the proposed circuits, including their training performance. In this section, we evaluate the sensitivity of the TNN training performance to process variation.

Resistance Variation and θ Distribution Variation Two cases of process variation
were considered: (i) resistance variation and (ii) variation in the distribution of θ. These variations may lead to a different switching probability for each MTJ device. To evaluate the sensitivity of the training to the device-to-device variation, we simulated the MNISTarchitecture training with variations in the resistance and θ distributions. Several Gaussian variabilities were examined with different relative standard deviations (RSD), where the mean values are shown in Table 3. Table 5 lists the training accuracy for resistance variation and variation in θ0. Typically, resistance RSD is approximately 5% [11], while our simulations show that the training accuracy is robust to resistance variation even for higher RSD values (e.g., only 0.46% accuracy degradation for RSD= 30%). The training accuracy is more sensitive to variations in θ0. Nevertheless, high standard deviation of θ values resulted in better test accuracy. To further evaluate the test accuracy dependency on θ0, we simulated training for different θ0 values. Table 6 lists the training results. Larger θ0 values, which correspond to higher switching probability, resulted in better test accuracy. Thus, we conclude that the performance of the MTJ-GXNOR algorithm improves for higher switching probabilities, which corresponds to larger θ0 values.

Sensitivity to Voltage Non-Ideality
Since the weight update probability is a function of the voltage drop across the MTJ device (Vup), it is sensitive to voltage source variation. Higher voltage leads to higher current. We tested training with Vup in the range of [0.5V, 2.5V ]. Our results show that the test accuracy improves when increasing Vup. The voltage magnitude can, therefore, be used to control the stochastic switching process and to improve the network training performance when using an MTJ device with low θ0 variance. In our setup, this effect is bounded and diminished when Vup exceeded 1.1V and only marginally

Sensitivity to Temperature
The MTJ dependency on temperature has several aspects. First, the switching behavior depends on the ambient temperature (15). For higher temperatures, the mean switching time τ is shorter [32]. Second, higher temperatures lower R of f . The resistance of Ron, however, has a much weaker temperature dependency and it is nearly constant [33]. The transistors are also influenced by the temperature. For high temperatures, the current drivability of the MOS transistors is degraded since the electron mobility is lower. Hence, ambient temperature affects the switching probability by lowering R of f and degrading the CMOS current drivability. Furthermore, the initial magnetization angle, θ, depends on the temperature by the normal distribution θ ∼ N (0, θ0), where the standard deviation is θ0 = kBT /(µ0H k MsV ). Hence, θ0 increases for higher temperatures. As shown in Section 5.4.1 and Table 5, the training performance is influenced by the value of θ, when in this work we do not include the effect of the transistors. Thus, to evaluate the sensitivity of the MTJ-GXNOR training to ambient temperature, we focused on θ0. We simulated MTJ-GXNOR training with different temperatures in the range [260K, 373K], with the associated resistance based on [32,33]. Table 7 lists the test accuracy obtained for different temperatures. Although better accuracy was obtained for higher temperatures, the improvement was less than 1%. The minor variations in accuracy imply that the test accuracy is agnostic to the temperature. Figure 6 shows the test accuracy over the training phase for the MNIST network. Higher temperatures increased the convergence rate of the network, while the network converged to similar test accuracy for all the temperatures in the examined range.

Performance Evaluation
QNNs aim to reduce the computation and memory-capacity requirements of DNNs; therefore, in this section, we evaluate the potential performance benefits of our solution. The overall performance is highly dependent on the exact system structure and functionality. For example, our solution can be integrated into a fully analog or digital architecture and  Figure 6: Test accuracy during the training phase for temperature range [273K, 373K]. Increasing the temperature leads to higher θ 0 variance, thereby increasing the randomness of the MTJ switching time. Therefore, higher temperature leads to faster convergence. can support different general optimization algorithms, including SGD, as in GXNOR [6]. Each configuration will produce different performance and should be compared to a similar configuration. First, we evaluate the performance of our circuit when it is not connected to the peripheral circuit. Then, we consider a simple test system to evaluate the potential of our solution with its associated peripheral circuits and supporting units.

TNN Power and Area
The power consumption and area were evaluated for a single synapse and different synapse arrays simulated in Cadence Virtuoso, including the interconnect parasitic. The results are given in Table 8. During the read operation, all the synapses are read in parallel; therefore, the feedforward power is higher than the write power, when each column is updated separately.

System Performance (Test Case)
To evaluate the performance of the synapse array when integrating our design in a full system, we consider the following setup, illustrated in Figure 7.
(i) The synapse array stores the ternary weights. The array can perform the GXNOR, backpropagation, and GXNOR operation, as described in Section 4.
(ii) A 127 × 127 synapse array. This size is broadly accepted as mitigating the parasitic effects on the circuit performance that is limited by the ADC resolution [17,3,34].
Since the size of a single DNN layer is larger than the array size, the layer will be divided between different arrays and the partial results are accumulated. To support a  Figure 7: System configuration -test case. A 127 × 127 synapse array is connected to the peripheral circuits through the row and column interfaces. The FF, BP, and UP control signals configure the interfaces to support the feedforward, backpropagation and weight update, respectively. κ and ν are inputs to the array. The IsZero signal indicates if κ is zero.
multi-array per layer, the binary activation is done after accumulating the results from the array. Thus, this system requires conversion of the partial results (Iout), from each array, to digital using analog-to-digital converters (ADC).
(iii) For the feedforward, 1-bit DACs are used to support the inputs per column, and 8-bit ADCs are used to convert the row current to digital outputs. For a 127 × 127 array, the output of each row is an integer value in the range [−127, 127]; thus an 8-bit ADC is sufficient. Furthermore, due to the high energy consumption of the ADC, we use only eight ADCs, which are shared among the 127 output rows [17]. Accordingly, the overall latency to produce 127 sum results is 8ns.
(iv) For the backpropagation, we consider the bit-streaming method with 8-bit precision as suggested in [17]; thus, we used a 1-bit DAC in the row interface and an 8-bit ADC in the column outputs.
(v) To generate the control signals, an 8-bit DAC, and voltage comparators are needed. To generate the sawtooth signal, we use the circuit from [22].
(vi) The system supports an in-situ SGD algorithm. Therefore, no additional circuit is needed to compute the update values. The columns are updated sequentially. Table 9 lists the power of the additional peripheral circuits. The energy efficiency of each stage for this setup is listed and compared to previews works in Table 10. The power consumption of the data converters significantly limits the overall performance.

Comparison to Previous Work
Most previous work on in-situ hardware implementations of BNN and TNN only support inference. In [37], a CMOS-based computation-near-memory engine was designed and fabricated. The design's energy efficiency during inference is 532 T OP s W . The authors assumed that the binary activation can be done immediately after the convolution, thereby eliminating  [35] 2 × 127 5.47 DAC 1-bit [17] 2 × 127 0.5 Voltage Comparator [36] 127 0.455 FF, BP, WU are acronyms for feedforward,backpropagation, and weight update, respectively. The delay and power cells format stand for <read value>/<write value>. the ADC. A similar assumption for our setup will increase the inference energy efficiency of our design to 180 T OP s W . Supporting training in such an accelerator will include additional arithmetic units and will also lead to frequent accesses to the memory to fetch the next layer and will require larger memory capacity to store the intermediate results. BNN inference without the need for an ADC is also supported in [34], where energy efficiency of 1326 T OP s W was reported. In that work, an RRAM device was used instead of an MTJ. The RRAM-based synapse can use smaller access transistors than the MTJ-based device, which is currentdriven. Moreover, a 1T1R synapse is sufficient when supporting only inference, thereby reducing the complexity and overall power consumption of each synapse. [38] simulated MTJ-based memory, which supports digital XNOR and XOR operations. By modifying the array drivers, they performed XNOR or XOR operation between operands given to the write driver. The result is written into the memory cell and read by the sense amplifier. Thus, this solution requires three stages to perform the XNOR or XOR operation: preset, XNOR (write), and read, and can perform the operation on a single row each time. They used the MTJ only as a memory cell and did not exploit its stochastic behavior.
Other works exploit the stochastic behavior of the MTJ device. [39] exploits the MTJ stochastic switching to design a stochastic neuron. In their work, the training is done off-line, and the weights can be stored in any memristive technology, while the neuron circuit includes an MTJ device. In [40], a STT-MTJ-based synapse is used to support BNN training. This solution works in the low current regime; thus, the MTJ switching follows an exponential distribution. Although such distribution is mathematically suitable to train QNNs, our simulations showed that working in the low current regime requires long update periods (approximately ms). Our approach is to train in the high current regime, so the stochastic update will occur in a realistic time period. In [12], 1T1R and 1R structures were proposed, and the stochastic behavior of the MTJ was leveraged to support in-situ training of DNNs with binary weights and full-precision activation. Two update operations are required for the

Conclusions
In this paper, we demonstrated the potential of MTJ-based synapse to support in-situ TNN and BNN stochastic training, without sacrificing accuracy. The proposed circuit enables highly parallel and low power execution of weight-related computation. We demonstrated its great potential to achieve high energy efficiency in different DNN systems. To fulfill the potential of the MTJ-based synapse, the next step is to integrate it into a full system design.
The stochastic behavior of the MTJ can support different training algorithms. For example, while in this work we used MTJ stochastic switching to quantize the gradients, it can be used in algorithms that use stochastic quantization of the weights and activations. Moreover, other optimization algorithms, such as simulated annealing, might benefit from these properties. The high energy efficiency and the flexibility in functionality enable different algorithms and systems that can accelerate QNN inference and training on low-power devices such as IoT and consumer devices.