On-chip learning of a domain-wall-synapse-crossbar-array-based convolutional neural network

Domain-wall-synapse-based crossbar arrays have been shown to be very efficient, in terms of speed and energy consumption, while implementing fully connected neural network algorithms for simple data-classification tasks, both in inference and on-chip-learning modes. But for more complex and realistic data-classification tasks, convolutional neural networks (CNN) need to be trained through such crossbar arrays. In this paper, we carry out device–circuit–system co-design and co-simulation of on-chip learning of a CNN using a domain-wall-synapse-based crossbar array. For this purpose, we use a combination of micromagnetic-physics-based synapse-device modeling, SPICE simulation of a crossbar-array circuit using such synapse devices, and system-level-coding using a high-level language. In our design, each synaptic weight of the convolutional kernel is considered to be of 15 bits; one domain-wall-synapse crossbar array is dedicated to the five least significant bits (LSBs), and two crossbar arrays are dedicated to the other bits. The crossbar arrays accelerate the matrix vector multiplication operation involved in the forward computation of the CNN. The synaptic weights of the LSB crossbar are updated after forward computation on every training sample, while the weights of the other crossbars are updated after forward computation on ten samples, to achieve on-chip learning. We report high classification-accuracy numbers for different machine-learning data sets using our method. We also carry out a study of how the classification accuracy of our designed CNN is affected by device-to-device variations, cycle-to-cycle variations, bit precision of the synaptic weights, and the frequency of weight updates.


Introduction
Neural-network algorithms are routinely and widely used now by the artificial intelligence and machine learning community for various tasks like data classification, recognition of objects in images, transcribing speech into text, showing news items and product advertisements on the user's feed based on the user's interests, etc [1]. In a traditional computer which follows the von Neumann architecture, the memory and computing units are physically separate, with a large part of the memory unit on a different chip compared to the computing/processing unit [2]. So, while running neural-network algorithms, a large amount of time and energy is spent in shuffling data, like the weight parameters of the neural network, between the computing unit and the off-chip memory unit. This high latency of memory fetches is known as the von Neumann bottleneck in literature [2,3]. The matrix-vector multiplication (MVM) operation is one operation fundamental in all neural-network algorithms which uses frequent interaction between the memory and computing units and is, hence, subject to the von Neumann bottleneck.
In order to overcome this von Neumann bottleneck, a novel neuromorphic computing paradigm (or inmemory computing paradigm) has been proposed and implemented through crossbar arrays of non-volatile R read is the resistance of the read path (MTJ structure) and I read is the read current. R write is the resistance of the write path (heavy-metal layer) and I write is the write current. (b) Snapshots of magnetic moments of the ferromagnetic free layer in the domain-wall device, as obtained from micromagnetic simulation and depicting the motion of domain wall upon the application of 'write' current (of magnitude I). All the snapshots are taken after a current pulse of magnitude I and duration 3 ns have been applied on the device from the same initial condition: the domain wall is at the center of the device. In all the snapshots, the color corresponds to the out-of-plane component of magnetization (m z ). Dark blue means moments point into the plane, or m z = −1 (vertically downward in (a)). Yellow means moments point out of the plane, or m z = 1 (vertically upward in (a)). The region where the color changes from blue to yellow corresponds to the domain wall. The domain wall moves by distances in integral multiples of 17 nm with magnitudes of current pulses in integral multiples of 100 μA. memory devices known as synaptic devices. In these analog crossbar arrays, memory and computing units are essentially intertwined, and hence they can implement the MVM operation much faster and with much more energy efficiency, when compared to traditional computing units [3][4][5][6][7][8][9][10][11].
Among these non-volatile devices, the ferromagnetic domain-wall device ( figure 1(a)), essentially a spintronic device, has been shown to provide a faster and more energy-efficient crossbar solution, for on-chip learning (training of the neural network on hardware itself), compared to other non-volatile memory devices, like resistive random access memory (RRAM) devices and phase change memory (PCM) devices [12][13][14][15][16][17][18]. This is because the domain-wall device has a much more linear and symmetric conductance response, or synaptic characteristic, compared to that of the RRAM or PCM device [16]. On-chip learning is known to provide several advantages for edge devices in terms of data security and the like [6,7]. This makes the domain-wall synapse device an important device in the context of the ongoing research on spintronics-based neuromorphic computing [19][20][21][22][23][24][25][26].
Thus far, simulation studies of on-chip learning on domain-wall-synapse-based crossbar arrays have been mostly restricted to non-spiking fully connected neural networks (FCNNs) and some simple spiking neural network architectures [14-17, 22, 23, 27-34]. A FCNN is good at solving simple data-classification and image-recognition problems and shows high classification accuracy for simpler data sets like Fisher's Iris set, Wisconsin breast cancer set, and MNIST images [1, 15-17, 35, 36].
But more realistic tasks, for real applications, make use of much more complex images (e.g. the CIFAR-10 data-set [37]), videos, and speech. Convolutional neural networks (CNN) have been employed regularly for such tasks in recent days with much success [1,3]. These CNNs have been typically implemented on conventional computing units like CPU and GPU. So for higher-speed and lower-energy implementation, if crossbar arrays are to be used, then on-chip learning of crossbar-array-based CNN needs to be explored.
However, unlike in a FCNN, the operations involved in a CNN are much more involved [1,3]. A FCNN uses a series of alternate MVM and non-linear activation function operations and can becompletely implemented in analog through crossbar arrays and transistor-based analog peripheral circuits, as shown in Bhowmik et al [15] and Kaushik et al [17]. On the contrary, a CNN uses various stages of the convolution operation, which needs to be implemented as repeated MVM operations (more details in later sections), non-linear activation functions, pooling operations, and drop-out operations [1,36,38]. This makes it extremely challenging to implement a CNN on purely analog systems.
As a result, the implementation of inference (forward computation of pre-trained network on neuromorphic/in-memory computing hardware) and on-chip learning (training on the neuromorphic/inmemory computing hardware itself) of CNNs has been recently proposed and simulated on analog-digital-hybrid systems in [7,39,40]. Here, the MVM operation for convolution is carried out on crossbar arrays of synapse devices because the crossbar executes O(n 2 ) multiplications and additions for a n × n matrix just in one computational step [39,40]. This leads to much higher computational speed compared to conventional computers, where such massive parallelism is absent. The rest of the operations in the CNN is carried out in the digital unit, and ADCs and DACs are employed to convert between the analog and digital signals [7,39,40].
The crossbar designs in [39,40] use memristive RRAM devices as synapses, while the crossbar design in [7] uses a combination of RRAM device and capacitor as a synapse. However, in this paper, we implement on-learning of a CNN on a domain-wall-synapse-based crossbar array instead (figure 4) since the domainwall synapse is known to be a higher-speed and lower-energy alternative to the RRAM synapse [16]. But we use a similar design approach as in [7,39,40]: MVM of the convolution operation happens on the crossbar array, and rest of the operations on the digital unit (figures 4 and 5). We use here a modified version of the thresholding algorithm proposed in [17] to achieve the training.
We next present the layout of the paper, and through that, we show how we carry out device-circuit-system co-design in the paper. We also mention here how the different components of our device-circuit-system codesign are inter-connected. In section 2, we discuss our device-level modeling of the domain-wall synapse device on micromagnetic physics package 'mumax3' (figure 1) and show its conductance response to current pulses of different magnitudes, as appropriate for our CNN-training scheme (figure 3). In section 3, we discuss our CNN-training algorithm and how we have implemented our scheme on a crossbar array of domainwall synapses (figure 4). We present our circuit-level SPICE simulation of the crossbar array; the conductance response obtained from device-level simulation in section 2 is used inside the circuit simulation of the crossbar array as a Verilog A module (figures 4 and 5). In section 4, we present our training results of the CNN for different data sets including the CIFAR-10 data set and analyze our results (figure 7). Beside providing classification accuracy numbers, we also comment on the speed and energy performance of the proposed CNN implementation. In section 6, we carry out a study of how the classification accuracy of our designed CNN is affected by device-to-device variations, cycle-to-cycle variations, bit precision of the synaptic weights, and the frequency of weight updates. We also discuss some recent experimental works connected to the domain-wallsynapse device and compare our simulation results with those experimental findings. Some peripheral circuits in our design are also provided. In section 7, we conclude the paper.

Operating principle of the device
The three-terminal domain-wall synapse device used here is based on the heavy metal/ferromagnetic metal hetero-structure, which exhibits perpendicular magnetic anisotropy (PMA), as shown in figure 1(a) [12,13,15,16,26,[41][42][43][44]. When in-plane 'write' current flows through the heavy-metal layer ('write' path, from terminal T3 to T1 in figure 1(a)), the domain wall present in the ferromagnetic free layer above it moves due to spin orbit torque experienced by the magnetic moments of the free layer at the interface with the heavy metal. This changes the conductance of the device along the 'read' path (between terminal T2 and T1 in figure 1(a)) due to tunneling magneto-resistance (TMR) effect in the magnetic tunnel junction structure of the 'read' (ferromagnetic free layer/oxide/ferromagnetic fixed layer). More details on the operation of this device can be found in [15,16].

Micromagnetic simulation parameters for the device
The domain wall considered here is of Neel chirality because of the presence of Dzyaloshinskii Moriya interaction (DMI) at the interface [41,42,45]. For such a Neel wall, it has been shown experimentally that in-plane current pulses can drive the domain wall even in the absence of an external magnetic field [15,16,41,42,[45][46][47]. For our synaptic device, we model a nanotrack based on the Pt (heavy metal)/CoFe (ferromagnetic metal)/MgO structure, with dimensions 1000 nm × 200 nm, on micromagnetic simulation package 'mumax3' [48] ( figure 1(a)). The heavy-metal layer is of 10 nm thickness, and the ferromagneticmetal layer of 1 nm thickness. Parameters used in our micromagnetic simulation are the same as that used in [45], where the model is benchmarked against experimentally obtained data. Thus, in our simulation, saturation magnetization (M s ) = 7 × 10 5 A m −1 , PMA constant (K) = 8 × 10 5 J m −3 , exchange correlation constant (A) = 1 × 10 −11 J m −1 , DMI = 1.2 × 10 −3 J m −2 , and damping factor = 0.3.
We consider the heavy metal here to be Pt, so we take spin hall angle to be 0.07 in accordance with experimentally reported values [49][50][51]. The thickness of the heavy metal layer (10 nm) is higher than the spin diffusion length (≈ 2-4 nm) [52,53]. Hence, the in-plane spin current density introduced into the free layer by the heavy metal layer (J s ) = in-plane charge current density (J c ) × spin hall angle. Dynamics of the magnetic moments of the ferromagnetic layer, in which the domain wall moves, has been simulated through micromagnetics here, under the influence of this spin current density (J s ) [54,55]. Pt is chosen as the heavy metal here because of its low conductivity; this reduces the energy consumption per 'write' current pulse for conductance change/weight update and hence reduces overall energy consumption for on-chip learning on a crossbar array [17]. The antiferromagnetic layer present at either edge over the ferromagnetic layer pins the magnetic moments of the ferromagnetic layer over there and prevents the domain wall from getting destroyed at the edge [12,13,56].

Micromagnetic simulation results
In our micromagnetic simulation of the domain-wall device on 'mumax3', we start from the domain wall initially being at the center of the device ( figure 1(b)). The domain wall located at the extreme left edge of the device corresponds to minimum conductance (G min ) of the MTJ, since the moments of the fixed and the free layer are oriented anti-parallel to each other in this configuration. The domain wall located at the extreme right edge corresponds to the maximum conductance of the device (G max ), owing to the parallel orientation of the moments of the fixed and the free layer in this configuration. A positive-polarity current pulse causes the domain wall to move toward the right edge of the device, leading to the conductance/weight increase. A negative-polarity current pulse causes the domain wall to move toward the left edge, leading to the conductance/synaptic weight decrease.
We will see in the next section that the algorithm we use to train our CNN involves increasing or decreasing each synaptic weight by integral multiples of a fixed positive value. So in this section, we show through micromagnetics that such kind of conductance modulation is indeed possible in our domain-wall device.
From our micromagnetic simulation of the device, we observe that when a current pulse of magnitude 100 μA and duration 3 ns is applied, the corresponding spin current density (J s ) lets the magnetic moments evolve over time such that the domain wall moves by approximately 17 nm ( figure 1(b)). For current pulses of 3 ns duration and a magnitude which is a multiple of 100 μA (n× 100 μA), the domain wall moves by a distance which is again a multiple of 17 nm (n× 17 nm). For example, when a current pulse of magnitude 200 μA is applied for 3 ns, then the domain wall moves by approximately 34 nm.
This happens because the velocity of the domain wall is linearly proportional to the charge current density, and in turn the 'write' current magnitude (if cross-sectional area through which the in-plane current flows does not change, as is the case here), for a wide range of current density/'write' current magnitude. This has been reported experimentally in [41,42,47] and verified through micromagnetic simulations and onedimensional domain-wall theory in [46]. We observe this in our micromagnetic simulation as well; for a given pulse duration (3 ns), we apply current pulses of different magnitudes and record the distances moved by the domain wall. Ratio of the domain-wall position to the pulse duration yields the domain-wall velocity. We plot the domain-wall velocity as a function of current density in figure 2 of the revised manuscript and show that the domain-wall velocity is indeed a linear function of the 'write' current magnitude. The analytical expression corresponding to this linear dependence has been obtained through one-dimensional domain-wall theory and reported in [46].
The tilting of the domain wall with the application of current, as observed in figure 1(b), has also been explained through the one-dimensional domain-wall theory in [46]. For the MTJ structure considered here, based on R-A product of 4.04 × 10 −12 ohm m −2 and TMR ratio of 120% [57], G min ≈ 2.9 × 10 −3 mho and G max ≈ 6.1 × 10 −3 mho. Since the lowest distance moved by the domain wall here is 17 nm (figure 1(b)) and the motion always happens in integral multiples of 17 nm, number of conductance states in the device ≈ 60. So the device is considered to store five bits of synaptic weight. Lowest change in conductance = G max −G min 60 = ≈5.33 × 10 −5 mho. Thus, conductance change in the 'read' path of the device (MTJ) structure in figure 1(a) occurs in the multiples of 5.33 × 10 −5 mho.
We embed this information in a Verilog A module for the device [58]. When we simulate a circuit component corresponding to this module on Cadence Virtuoso circuit simulator, we observe that conductance  changes as multiples of 5.33 × 10 −5 mho corresponding to current pulses with magnitudes in multiples of 100 μA (figure 3). Positive current pulses increase the conductance, while negative current pulses decrease the conductance. All aforementioned simulation parameters (both for 'mumax3' and Verilog A) and their respective values have been presented again in table 1.  The LSB crossbar for this set of filters, which stores the LSB part of the weights, and is updated in every iteration of the training algorithm. Circuits X and Q associated with every domain-wall synapse help in weight-update calculation: X is the multiplier, and Q thresholds the input. Common part of the weight update fed back by the output nodes (red dashed lines) is not thresholded in our algorithm.

Forward computation
Forward computation needs to be implemented on the proposed neuromorphic hardware both in the inference mode (the CNN is first trained on a conventional computer and then only the forward computation of the trained CNN on new/test data is implemented on the neuromorphic hardware) and in the on-chip learning mode (the CNN is both trained and tested both on the neuromorphic hardware). Among the different operations in forward computation, the convolution operation needs to be implemented on the domain-wallsynapse-based crossbar array, as mentioned in section 1. Figure 4 shows our implementation scheme for the same.
Following the method shown in [7], we load different convolutional filters onto different crossbar arrays and apply different segments of the input images on them. Different parts of the outputs are computed, some of them are summed, and then they are reconstructed to obtain the convolution. Since each synaptic weight in the convolutional kernel is of 15 bits in our algorithm (this helps us achieve the desired level of accuracy, as we discuss in the next section), we dedicate the five least significant bits (LSBs) to a LSB crossbar and the other ten bits to two MSB crossbars (figures 4(c) and (d)). We have three separate crossbars because each domain-wall synapse device can only store five bits of weight, as explained in the previous section.
As shown in figure 4(a), when the input images are in RGB mode (e.g., the images in CIFAR-10 data set [37]), one portion of each input image (of dimensions 3 × 3 × 3) is extracted at a time ( figure 4(a)). Then voltages corresponding to these 27 (3 × 3 × 3) pixel intensities are applied on all the three aforementioned crossbars simultaneously. For CIFAR-10 data set, 64 different convolutional filters, each of 3 × 3 × 3 dimensions, are used. Each convolutional filter represents a distinct feature. u i j , v i j and w i j denote the synaptic weights of the convolutional kernels, where i represents the convolutional kernel number (ranges from 1 to 64), j represents the position in the kernel matrix (ranges from 1 to 9), and u, v, or w correspond to R, G, or B ( figure 4(b)).
For the first convolutional kernel, inputs x 1 , y 1 and z 1 (of the input image) are multiplied with u 1 1 , v 1 1 and w 1 1 respectively, x 2 , y 2 and z 2 are multiplied with u 1 2 , v 1 2 and w 1 2 respectively, and so on. The synaptic weights corresponding to the first kernel are shown in red font in figure 4: .., u 1 9 , v 1 9 , w 1 9 . These multiplications are carried out in our designed crossbar array along just one vertical bar (the leftmost bar). Conductance of each domain-wall synapse is proportional to each synaptic weight; so currents flowing through the synapses are proportional to the product of each input value and each weight. When these currents add up at the leftmost vertical bar following Kirchoff's current Law, we essentially implement: All these multiplications taking place simultaneously provide the speed advantage for the crossbar array compared to conventional computing units.
Similarly, the second vertical column of the crossbar array, from the left, corresponds to the second convolutional kernel (its weights shown in violet font in figure 4) and executes the following multiplication: The other 62 vertical bars of the crossbar array take care of the other 62 convolutional kernels. It is to be noted that these aforementioned multiplication operations happen simultaneously on the three crossbar arrays mentioned earlier: five MSBs of weights are handled in the first MSB crossbar (figure 4(c)), the next five bits of weights are handled in the second MSB crossbar, and the five LSBs of weights are handled in the LSB crossbar (figure 4(d)).
At the end of these simultaneous operations on the three crossbar arrays, convolution of the first segment from the image (figure 4(a)) with the weight kernel is accomplished. Then the voltages corresponding to the next segment of the image are applied on the crossbar, and so on. Thus unlike in the implementation of forward computation in a FCNN on a crossbar array, where the entire image is loaded into the crossbar at one go, the image is loaded in segments into the crossbar array for a CNN. Voltages corresponding to only one segment are applied at a time on the horizontal bars of the crossbar array in figures 4(c) and (d), then voltages corresponding to the next segment of the image, and so on. Further operations on the output signals obtained from the crossbar, corresponding to the forward computation in the CNN, are carried out on the digital unit of the system, after conversion from analog to digital using an ADC of 15 bit precision.
In our work, we simulate the aforementioned analog, convolution operations on three crossbar arrays (two for MSB, one for LSB) which we design on Cadence Virtuoso SPICE circuit simulator. Verliog A modules for the domain-wall synapses, based on the micromagnetic simulation of the devices on 'mumax3' (figures 1 and 3), are inserted in our circuit design on SPICE, as mentioned in section 1. After every iteration, the outputs generated by the circuit simulator are then passed on to a code we write on a high-level language (Python) which takes care of the digital operations in the forward computation as well as the weight-update calculation. The conductance values of the Verilog A synaptic modules in the SPICE crossbar design are updated in accordance with the weight-update calculation, and the same process is repeated for other samples to carry out training/learning of the CNN. Thus, we carry out device ('mumax')-circuit (SPICE)-system (Python) codesign and co-simulation in this work to simulate on-chip learning of a CNN (figure 5). We discuss our training algorithm in details next.

Training
In the training algorithm that was used to train the FCNN in [17], thresholding function f has been applied both on the inputs and common part of weight update, calculated at the output node of each layer, where: , (θ and q thresh are hyper-parameters of the network). (Refer to [17] for more details). But thresholding both the input and common part of weight update in an extremely deep network like the CNNs used here leads to a significant drop in the classification accuracy. In fact, for CIFAR-10 data set, we observe that the CNN does not even get trained at all for such heavy thresholding. So in the modified algorithm used here, meant to train CNNs on crossbar arrays, we have thresholded the input with the aforementioned function f, but not the common part of weight update. We discuss our weight-update rule for training next.
As mentioned earlier, any particular weight in the convolutional kernel is split into three different weight strings, of five bits each, and stored on three different synapse devices, corresponding to the three crossbars. Let W M be the decimal number corresponding to the leftmost five bits (five MSBs), W N be the decimal number corresponding to the middle five bits, and W L be the decimal number corresponding to the rightmost five bits (five LSBs). W is the entire weight string of 15 bits in decimal.
In our modified training algorithm, only the LSB part of the weight (W L ) is updated after every iteration. From the stochastic gradient descent (SGD) algorithm [1], this weight update is supposed to be: where L is the loss function computed for the network after one cycle of forward computation (corresponding to one sample in the training set), α is the learning rate, and the symbol ← means value transfer/update. But since the weight is quantized (up to 15 bits), we cannot perform an update of arbitrary precision. Moreover, this update needs to be thresholded as well, since the LSB-synapse device can only store weight values between −W max L and W max L . So, after approximating the update ∂L ∂W to the nearest quantization level, we can write the update as: where k is the rounded-off integer of ( ∂L ∂W ) W min L , and W min L is the minimum non-zero value that can be stored in the LSB-synapse device (decimal equivalent of one bit).
The middle part of the weight is updated only once every ten iterations (one iteration corresponds to one sample in the training set): If W L has reached its maximum value, W L is updated to −W max L and δW N is set to W min N . Similarly, if W L has reached its lowest value, δW N is set to −W min N , and W L is updated to W max L . Unless either of these two scenarios occurs, δW N takes the value of zero. W min N is the minimum positive value that W N can take. If W N reaches its maximum or minimum value, then W M is updated the same way; otherwise, W M is not updated.
Since the weights of the LSB crossbar need to be updated after every iteration, a dedicated weight-update circuitry needs to be present with every domain-wall synapse just like in the synapse cell design of the FCNN crossbar array in [17]. Just like in [17], the input needs to be multiplied with the common part of weight update generated at the output node of that stage. However, more transistors will be needed here than that needed in the synapse cell of [17] (only two were needed there) because in [17], both the input and the common  part of weight update are thresholded by the aforementioned function f and hence can only take three values (θ, 0, −θ). On the contrary, here, the input is thresholded but the common part of weight update is not thresholded. We do not use a dedicated update circuit with each synapse of the two MSB crossbars because the weights there are updated only once in ten iterations. In this case, the weight update of each synapse can be calculated in the digital computing unit. Then the synapses can be accessed sequentially, and their weights can be updated by passing appropriate current pulses and moving the domain walls.

Performance: classification accuracy
Following the device-circuit-system co-simulation method discussed in the previous section and shown in figure 5, we carry out on-chip learning of the CNN on three popular machine-learning data sets: MNIST [35], Fashion MNIST [59] and CIFAR-10 [37]. All operations inside the forward computation that we carry out on the Python code, including the ones that need to be computed on the digital unit of the final neuromorphic hardware, are shown in figure 6.
Classification accuracy as a function of training-epoch number is plotted in figure 7. The final accuracy numbers along with information on the number of images used in the training and test set in each case have been provided in table 2. The performance metrics obtained here are at par with conventional, state-of-the-art methods for the MNIST and Fashion MNIST dataset [36,60,61].
For CIFAR-10 data set, our classification accuracy on the test set (test accuracy) is comparable to that reported for a conventional implementation of a CNN on a GPU in [38] and slightly lower than a similar conventional implementation in [60]. But our accuracy number is comparable to that reported for neuromorphic implementation on an analog-digital hybrid system, which uses a combination of a RRAM device and a capacitor as a synapse [7].
Possible ways of improving the test accuracy of our implementation further and making it comparable to that used for training a CNN on a conventional computer are as follows:  (3)). Using a more complex optimizer like the Adam optimizer [63], which has a momentum term in the weight update, may yield a better result. (b) We use a batch size of 1 here for the LSBs of the synaptic weights, i.e., we update the LSBs of the weights after forward computation on every sample of the training set. A large batch size may yield better results, but a large batch size also means weight updates for several iterations need to be added up and the sum needs to be stored in the digital unit. This implies more digital memory consumption. (c) In a conventional CPU or GPU implementation, the weights can have a precision of 32-64 bits. But here, since each domain-wall device can store five bits only and we use three crossbars, we are limited to weight resolution of just 15 bits. This reduces the accuracy number for our algorithm. Increasing bit resolution will increase the classification accuracy. Similarly, lowering bit resolution below 15 will reduce the accuracy significantly, as we show in the next section.
(d) While the test accuracy is ≈ 75%, the classification accuracy on the training set (training accuracy) is 100%. This may be a consequence of over-fitting. Introducing regularization techniques like 'dropout' in our proposed hardware can reduce over-fitting [38].

Performance: training time per sample
The training time can be calculated in terms of the time taken for the most fundamental operation in the crossbar: the vector-matrix multiply operation. In fact, the very reason to use crossbar arrays is that it makes the vector-matrix multiply operation much faster than digital implementations. If we assume the time taken for one such operation to be t 0 , then the time for one complete forward computation can be calculated based on the architecture of the CNN. For any particular layer of the CNN, if the output shape is given by n H × n W × n C , then the time taken to compute this output is t 0 n H n W for our design, since all the computations for a particular output pixel (all channels taken together) are computed in parallel, as depicted in figure 4. Moreover, the time taken to calculate the output from a fully connected layer is simply t 0 , since only one vector-matrix multiplication takes place. Thus, the time taken for one forward computation can be estimated by adding the product n H × n W for the output of different layers. For the case of our architecture, using this method, the delay is calculated to be nearly 2.7 × 10 3 t 0 .
Since the computations which occur in the backward pass (gradient calculation part) are of the same nature as the forward part (the outer-product calculation needs to be done on the crossbar array), the time taken for the backward pass will also be close to this figure. Thus, the total time for one sample (forward and backward pass) can be estimated to be nearly 5.4 × 10 3 t 0 + delay during weight update. Calculating t 0 and the delay during weight update, however, is challenging, and we discuss possible ways to do that in the next section.

Performance: training energy per sample
Next, the total energy for carrying out the weight updates during the training process, across all the crossbar arrays in the designed CNN, is calculated. This is a significant contributor to the total energy and total power consumed for on-chip learning. Figure 3 shows that to change the weight by the smallest amount in our synapse device, say one quantum (for which conductance changes by 5.33 × 10 −5 mho), a 'write' current pulse of magnitude 100 μA is required. For a change of n quanta, a current pulse of magnitude nI needs to be applied (as mentioned before), since the conductance change is linearly proportional to the input current. Thus, the energy dissipated in this weight update is given by (E): where R write is the resistance of the write path of the particular synapse, and Δt is the length of the write pulse. For our domain-wall device (shown in figure 1, R write = 50 Ω for the given length and breadth of the device (1000 nm × 200 nm), under the assumption that majority of the current flows through the heavy-metal (Pt) layer. This is a valid assumption because the resistivity of the ferromagnetic layer (CoFe say) is about 170 μΩ cm [49], which is much higher than that of Pt (≈ 10 μΩ cm [64]). Also the thickness of the ferromagnetic layer is 1 nm only, but the thickness of the heavy-metal layer is 10 nm. So the ferromagnetic-layer path is much more resistive than the heavy-metal-layer path, and majority of the current flows through the latter. Duration of the 'write' current pulse (Δt) = 3 ns, as mentioned before (figure 3).
In figure 8(a), the energy consumed per training sample for all weight updates, across all the crossbar arrays in the designed CNN, is plotted as a function of training epochs for different data sets (equation (6) has been used for the calculation). As training is achieved, weight update goes down and energy consumption per epoch goes down. For the most challenging data set, CIFAR10, the highest amount of energy is consumed, as expected.
A similar trend is observed in figure 8(b) where the cumulative energy (energy for the current epoch plus all previous epochs) is plotted as a function of epochs. The final cumulative energy, which is the total write energy consumed for the training process, is also mentioned in figure 8(b) for all the data sets. This energy includes training over all the samples in the training set. Dividing the energy by the number of training samples yields energy per sample consumed for weight updates across the crossbar arrays in the CNN, which we tabulate in table 2. Once the energy is divided by the training time, the corresponding power consumption can be calculated.
It is to be noted that energy consumed in forward computation has not been considered in this calculation ( figure 8). Typically, the 'read' current is several orders of magnitude smaller than the 'write' current because the 'read' current is not supposed to move the domain wall from its position unlike the 'write' current. So the energy consumed per input-vector-weight-matrix multiplication on the crossbar array in the forward computation can be considered negligible compared to that for weight update. But unlike in a FCNN, there's not just one input-vector-weight-matrix multiplication on the crossbar array corresponding to a weight-update operation in the case of a CNN. In a CNN, the same crossbar array is used several times for input-vector-weightmatrix multiplication between the weight kernel and several segments of the image (as explained in section 3) before weight update of the synapses in the crossbar array takes place once. Once the energy consumed in all these input-vector-weight-matrix multiplications during the forward computation is added up, the net energy in forward computation can be comparable to the energy consumed for weight update which we have calculated and plotted in figure 8. Estimating the energy consumed for forward computation on the crossbar array is a part of our future work.

Discussion
In this section, we study the dependence of the classification accuracy of our designed CNN on process/devicelevel variations, cycle-to-cycle variations due to write noise, bit precision, and the number of samples/cycles after which the other bits of weight are updated (the five LSBs are always updated every sample/cycle). Since CIFAR-10 is the most challenging of the three data sets used for this work (as mentioned earlier), we carry out this study for CIFAR-10 data set only. For the other data sets, our accuracy numbers are already above 90% and will not be affected much by the aforementioned variations.  figure 6 shown as a function of training epochs for CIFAR10 data set taking device-to-device variations into account.

Process/device-to-device variations
It has been shown in [65,66] that in spin-transfer-torque magnetic random access memory devices (STTM-RAM), process variations like variations in material parameters (saturation magnetizaion, PMA, damping factor) as well as variations in device dimensions lead to more write errors. Similar phenomenon will occur in domain-wall devices, which have the same material stack and similar properties as the STTMRAM devices.
Thus, process variation can lead to the following kind of device variation: we apply 'write' current pulses on the domain-wall synapses, but for a given pulse, the actual motion of the domain wall and hence the actual change in the weight/conductance is different from what's expected by a certain factor. This factor is constant for all weight changes of a particular synaptic device but varies from device to device. At the same time, it does not vary from cycle to cycle unlike in the case of 'write' noise which we discuss next.
Following this principle, we include random errors in synaptic-weight changes in our algorithm for CNN training. We carry out Monte Carlo simulations for the same, considering different extents of device-level variations from their expected values: 4%, 8%, 12%, 16%, and 20%. We use the CIFAR-10 data set for this purpose and plot our results in figure 9.
In all cases, we observe that the classification accuracy is not affected ( figure 9). This may be due to the negative feedback mechanism in the learning process. Learning is achieved by minimizing the global error of the network, obtained by comparing the actual output with the desired output. As long as the error decreases, the network trains itself independent of the exact values of the weights. Also, in physical implementation of the CNN through crossbars, the same crossbar will be used for forward computation (vector-matrix multiplication) and also for feedback (the outer product calculation in the back-propagation algorithm during weight update). Thus the error that can creep in due to variations in the synaptic weight values during the forward computation can get nullified when the same weight values are used for the outer product calculation during the weight update [17,40].
Thus our study shows that device-to-device conductance/weight variation up to 20% does not have much of an impact on the classification accuracy. Whether in the actual experimental implementation of the proposed crossbar array of domain-wall devices, the conductance variation will be limited to 20% or not cannot be possibly concluded at the moment. Since the domain-wall device is a relatively new device and less commercial device compared to the STTMRAM device, such study has not been reported yet for the domain-wall device to the best of our knowledge. So this will be the subject of our future study.

Cycle-to-cycle variations
The Monte Carlo simulation we carry out to account for cycle-to-cycle variations is similar to that we carry out for device-to-device variations. But the key difference is that in the case of device-to-device variations, the error in weight update of one particular synapse device is constant throughout the training process, though it may be different from that of another synapse in the crossbar array. In the case of cycle-to-cycle variations, the error in weight update of any synapse device changes randomly from cycle to cycle, to take into account noise in the 'writing' process of the domain-wall device ('write' noise).
Based on these simulations, we plot how cycle-to-cycle variations affect the classification accuracy for CIFAR-10 data set in figure 10. We observe that much like in the device-to-device variation case, cycle-tocycle variation does not affect the classification accuracy much (figure 10). One possible reason for this is again the negative-feedback nature of the error minimization/learning process. Another possible reason is that cycle-to-cycle variation/'write' noise does not affect the MSBs of the synaptic weight much (since the MSBs are not updated every cycle but are updated after a fixed number of cycles).

Bit precision of synaptic weights
We next study how the classification accuracy varies with bit precision of the synaptic weights in the CNN. Since CIFAR-10 is the most complex data set we handle in this paper, we show our results for the CIFAR-10 data set only like we did for the cases of device-to-device and cycle-to-cycle variations. Classification-accuracyvs-epoch plots for different values of bit precision of weights are shown in figure 11.
We observe that to obtain a high classification accuracy, we need a precision of around 15 bits. This happens because a CNN uses an extremely complex network architecture with several layers of weight-matrix and input vector multiplication, along with several other added functions: non-linear activation functions, max pool, etc. The classification task we want to accomplish is also quite difficult: images are taken from the CIFAR-10 data set. If the bit precision of the synaptic weights is low, information is lost at every stage of the CNN and the accuracy suffers.
However, for simpler network architectures like FCNN and simpler classification tasks, having only 4-8 bits per synapse is sufficient to obtain high classification accuracy. This has been shown in [17,67].

Frequency of weight update
In the results presented thus far, we update the middle parts/middle bits of the synaptic weights after every ten cycles/iterations (five LSBs are updated every cycle). Figure 12 shows how the classification accuracy changes for the CIFAR-10 data set for different weight-update frequencies (different number of cycles/iterations after which the middle part of the weight is updated). We observe that if the number of iterations (after which the middle part of the weight is updated) is increased, accuracy goes down which supports intuition.
Thus, choosing to carry out weight update after fewer iterations is a good idea from the accuracy perspective, but it also increases the number of write pulses applied on the synaptic devices corresponding to the middle part of the weight and so increases the power and energy consumption for on-chip learning. Also, as mentioned before, the crossbar corresponding to the five LSBs has a dedicated weight-update circuitry, and so all the weights can be updated simultaneously. On the contrary, in the crossbars corresponding to the other parts of the synaptic weights, the weights can be updated only sequentially using the peripheral digital unit. So reducing the number of cycles for update of the middle parts of weights, or increasing the frequency of weight update, may lead to significant delay in training.

Experimental calibration
Having presented classification accuracy results for our domain-wall-synapse-based crossbar arrays through device-circuit-system co-design, it is imperative that we compare the behavior of our simulated domainwall device with recent experimental results. The physics of in-plane-current-driven domain-wall motion in  figure 6 shown as a function of training epochs for CIFAR10 data set taking cycle-to-cycle variations into account.
heavy-metal/ferromagnetic-metal hetero-structure-based devices has been thoroughly studied before through various experiments, as we have mentioned in section 2 [41,42,45,47]. As also mentioned before, the values of our micromagnetic simulation parameters (table 1) are the same as that used in another micromagnetic study carried out by Emori et al [45], which is bench-marked against experiments in similar devices.
The domain-wall velocity we obtain from our micromagnetic simulation is plotted as a function of in-plane 'write' current in figure 2. The linear variation of domain-wall velocity with current matches qualitatively with that reported in experiments [41,42,47]. Since the width of the device is 200 nm and the thickness of the heavy-metal layer is 10 nm, current density through the heavy metal layer can be obtained from the data on the current in figure 2. After that conversion, it can be observed that for the same value of current density, the domain-wall velocity obtained in our simulation is close to (in the same order of magnitude) as that reported experimentally by Lo Conte et al [47]. For example, for current density of 4 × 10 11 A m −2 , Lo Conte et al experimentally report the velocity of the domain wall to be 15 m s −1 . In our micromagnetic simulation, for 800 μA of in-plane 'write' current which translates to 4 × 10 11 A m −2 , the domain-wall velocity we obtain is 40 m s −1 , which is in the same order of magnitude as the aforementioned experimentally observed value.
However, while there are several experimental reports on such current-driven domain-wall motion, integrating a magnetic tunnel junction with the domain-wall device and studying its conductance response through TMR effect of the tunnel junction have become a topic of experimental research very recently, with the emerging possibility of using such a device as a synapse in a crossbar array for neuromorphic computing (like in this paper). Lequeux et al experimentally demonstrated current-based control of multiple conductance states (measured through TMR) of such a tunnel-junction device [68]. Zhang et al, very recently, experimentally demonstrated current-controlled multiple conductance states in a chain of MTJ-s (as opposed to a single domain-wall device) and showed that the chain can act as a synapse unit for neural-network crossbars [23].
In the conductance response of our domain-wall device as shown in figure 3, the domain wall moves in multiples of 17 nm in response to current pulses with magnitudes in multiples of 100 μA and duration 3 ns. Such precise domain-wall motion has been experimentally demonstrated, very recently, by Zhang et al [69] and by Kumar et al [70]. Figure 4(d) shows how the thresholding/quantizer (Q) and multiplier (X) circuits have been used in the crossbar array for update of the five LSBs of weights after every iteration. The design of the thresholding and the multiplier circuit can be seen in figure 13. Both circuits have been designed using transistor-based amplifiers; more details on the design and their input-vs-output relationships, as obtained through SPICE simulations, can be found in [17,18]. The thresholding circuit's design is shown in [17], and the multiplier circuit's design is shown in [18].

Design of peripheral circuits
The multiplier circuit in [17], however, cannot be used for this purpose since it contains two transistors only and can only carry out the multiplication when both the input and the common part of the weight update  figure 6 shown as a function of training epochs for CIFAR10 data set, while we vary the number of cycles/iterations after which the middle parts/middle bits of the synaptic weight are updated (five LSBs are updated every cycle). are thresholded before the multiplication, which is the case in [17] but not in this work. Thresholding both the input and the common part of the weight update lead to significant reduction in classification accuracy since we train a CNN here unlike a FCNN (which is trained in [17]). So a more complicated multiplier circuit as shown in [18] needs to be used here ( figure 13(b)).
The major delay during the weight update is caused by the multiplier and the thresholding circuit (the domain-wall motion within the synaptic device happens just within 3 ns, as mentioned before). The speed of these circuits ultimately depends on the switching time of the transistors used and the parasitic resistances and capacitances. Carrying out rigorous SPICE simulations to determine this delay time is a part of our future work; the focus of the current paper is on the modification of the existing CNN training algorithm in accordance with the requirements of domain-wall-synapse-based crossbar arrays.
The time for input-vector-weight-matrix multiplication (t 0 in subsection 4.2) is also difficult to calculate at the moment because similar delays due to parasitic resistances and capacitances occur there as well. Multiplier and thresholding circuits do not contribute to the delay here, but the ADC and the DAC (which interface between the digital unit and the crossbar) can cause significant delay. Calculating this t 0 will be a part of our future work. Together, write-time delay and t 0 determine the time taken for training, and thus the power consumed for training. Figure 13. (a) Design of the thresholding circuit, as also shown in [17] (b) design of the multiplier circuit, as also shown in [18].

Conclusion
Thus, in this paper, we carry out device-circuit-system co-design and co-simulation of on-chip learning of a CNN using a domain-wall-synapse-based crossbar array. We use a combination of micromagnetic-physicsbased synapse-device modeling (on micromagnetic package 'mumax3'), SPICE simulation of a crossbar-array circuit using such synapse devices (on Cadence Virtuoso SPICE simulator), and system-level-coding using a high-level language (Python) for this purpose. We report very high classification accuracy numbers for data sets like MNIST and Fashion MNIST and fairly high classification accuracy numbers for a more complex data set like CIFAR-10.
We have also shown that device-to-device variations and cycle-to-cycle variations do not affect the classification accuracy. We also show that increasing the bit precision of the synaptic weights and increasing the weight-update frequency of the middle part of the synaptic weight (five LSBs are updated after every sample/cycle) improve the classification accuracy significantly.