STT-BSNN: An In-Memory Deep Binary Spiking Neural Network Based on STT-MRAM

This paper proposes an in-memory binary spiking neural network (BSNN) based on spin-transfer-torque magnetoresistive RAM (STT-MRAM). We propose residual BSNN learning using a surrogate gradient that shortens the time steps in the BSNN while maintaining sufficient accuracy. At the circuit level, presynaptic spikes are fed to memory units through differential bit lines (BLs), while binarized weights are stored in a subarray of nonvolatile STT-MRAM. When the common inputs are fed through BLs, vector-to-matrix multiplication can be performed in a single memory sensing phase, hence achieving massive parallelism with low power and low latency. We further introduce the concept of a dynamic threshold to reduce the implementation complexity of synapses and neuron circuitry. This adjustable threshold also permits a nonlinear batch normalization (BN) function to be incorporated into the integrate-and-fire (IF) neuron circuit. The circuitry greatly improves the overall performance and enables high regularity in circuit layouts. Our proposed netlist circuits are built on a 65-nm CMOS with a fitted magnetic tunnel junction (MTJ) model for performance evaluation. The hardware/software cosimulation results indicate that the proposed design can deliver a performance of 176.6 TOPS/W for an in-memory computing (IMC) subarray size of 1×288. The classification accuracy reaches 97.92% (83.85%) on the MNIST (CIFAR-10) dataset. The impacts of the device nonidealities and process variations are also thoroughly covered in the analysis.


I. INTRODUCTION
Deep neural networks (DNNs) in on-edge AI devices face the challenge of high energy consumption due to the requirement of a large number of tensor operations, which incurs not only a high computational workload but also large memory accesses [1]- [5]. The conventional computer architecture with limited memory bandwidth and a sequential computing framework is not ideal for such operations, especially for DNNs in on-edge AI devices such as the Internet of Things (IoT) or mobile systems, which have strict resource and power budget constraints. IMC has been recently introduced as a revolutionary approach to solving the memory bottleneck challenge [6]. This approach partially shifts the processing load from the central processing unit to distributed processing elements in memory, which greatly reduces memory access while increasing performance and energy efficiency. IMC finds it difficult to support complex processing operations such as multiply-accumulate (MAC); therefore, the binary neural network (BNN) has recently emerged [7], with the aim of simplifying network operations. Interestingly, IMC approaches and BNNs apparently exhibit much synergy. Indeed, a BNN typically performs calculations based on bitwise XNOR operations (for multiplication), and bit counting (for accumulation) can be fully implemented in memory, as proposed in some prior works. The proposed IMC designs in [8], [9] employ resistive RAM (RRAM), yielding low standby power and improving array density. However, in these works, the accumulation process based on the current summing approach could require significant energy in large networks. As a further development, STT-MRAM was also adopted for IMC-based neural networks. STT-MRAM exhibits outstanding advantages over other types of memory in terms of endurance [10]; thus, it is well suited for use as on-chip 2 VOLUME XX, 2017 memory. The STT-MRAM-based BNN accelerator architecture in [11] also adopts the current sensing approach and is only reasonable for small arrays (8×4). Furthermore, in [12], the authors introduced a scalable and fully parallel inmemory BNN structure, which supports MAC operations in a single memory access phase via a voltage sensing technique and achieves improved scalability and energy efficiency. Nonetheless, the accuracy of BNNs strongly depends on process variations, which could be quite severe in finer technologies. Spiking neural networks (SNNs), known as the third generation of DNNs, not only better mimic biological neural behaviors but also exhibit great fault tolerance and can potentially overcome the persistent drawback of binarized networks [13]. Therefore, SNNs have been actively studied recently. Among them, an MTJ was also adopted for SNNs [14]- [19]. In [14]- [16], the authors leveraged the binary switching of an MTJ to map the sigmoid function in an artificial neural network (ANN) to an SNN. However, this mapping may cause significant accuracy degradation. Additionally, the accuracy suffers from unstable MTJ switching and bias current variation. Inhibitory and excitatory spike-timing-dependent plasticity was processed for on-chip learning with MTJ-synaptic [17] and MTJ-neuron [18] implementations. In [17], a stochastic synapse was introduced, in which synaptic propagation was modulated stochastically by a full-precision weight. Then, each neuron accumulated incoming synaptic events sequentially using a spike counter, which significantly affected the network latency. In [18], the leaky-integrate-fire spiking model was used to emulate biological neuron dynamics, but this work focused only on neuron circuitry and did not cover the impact of variation. On the other hand, in [19], the author presented a compact probabilistic Poisson method based on back-hopping oscillation in an MTJ, where the number of spikes was exponentially proportional to the synaptic current in the utilized sampling time (within a time step). However, the classification accuracy of this approach is highly sensitive to the sampling period.
In this work, we propose a complete design approach with multilevel optimization for a BSNN that leverages highenergy and efficient IMC based on STT-MRAM. For network training, we first propose a direct BSNN training approach using a surrogate gradient augmented with residual connections. While directly training a deep SNN with binary weights can cause accuracy degradation, as reported in [20], our training method improves the accuracy and reduces the network latency. In the BSNN inference process, we also provide a mathematical formulation to justify that the MAC function of the BSNN can fully utilize in-memory with XNOR-based computation. Specifically, we propose an equivalent IF model with a dynamic threshold instead of a fixed threshold. This permits the MAC operation-the most computationally intensive operation of the synapseto be realized solely in the XNOR IMC array. Additionally, the fundamental operation of the IF neuron is based only on an accumulation function. The latter not only simplifies the neuron circuitry but also enhances its calculation accuracy. Additionally, considering that the threshold is dynamic, we adjust the initial value (i.e., after firing) to emulate the impact of BN.
Our contributions can be summarized as follows • We propose a BSNN with residual connections and train the network with a surrogate gradient, which enables higher classification accuracy with fewer time steps. • The proposed dynamic threshold mechanism allows neural synapses to be mapped to XNOR cells based on the STT-MRAM subarray in [12] and reduces the complexity of the neuron circuit by incorporating BN. • The proposed approaches are built for circuit-level simulations. The accuracy and other performance metrics of the network are then evaluated based on parameters realistically extracted from circuit simulations, including the nonlinearity and process variations. The rest of this paper is organized as follows. Section II presents the BSNN training model. The BSNN IMC model with a novel MAC operation using an XNOR cell circuit is introduced in Section III. Section IV expresses the models of the STT-BSNN array and introduces the sense amplifier used in the proposed architecture. Section V evaluates the accuracy of the proposed training method, and the energy efficiency of our design is compared with that of previous studies. Then, Section VI concludes this work.

II. TRAINING A BSNN WITH A SURROGATE GRADIENT AND RESIDUAL CONNECTIONS
In this section, we present the training process for BSNNs using surrogate gradient backpropagation with residual connections. In our training model, the binarized weights are represented in bipolar format (i.e., ±1), as introduced in [21].

A. BACKGROUND
In the conventional IF model, the membrane potential , for the i-th neuron at time step in layer is defined as follows Here, and are scaling and shifting parameters, respectively; and correspond to the standard deviation and mean of BN, respectively. denotes the number of presynaptic spikes, and , −1 is the presynaptic spike in the -th neuron. = • , represents the latent weight of the BSNN, where , = ( ) is the corresponding binary weight and = | | is the latent weight scaling factor. For the spike representation, we use the rate encoding method introduced in [22]. Specifically, the real input data are converted into spike format using a Poisson random number generator. The generated value is proportional to the total number of spikes within time steps. According to the IF model in (1), if the membrane potential , surpasses the firing threshold , a postsynaptic spike , is generated. Then, the membrane potential is reset to zero before being accumulated again. Furthermore, the cross-entropy loss function is calculated through the last output membrane potential , , which is expressed as follows where = ( 1 , 2 , … , ) is a label vector and is the total number of network outputs. During the training process, the loss function ℒ is minimized by gradient descent, and the latent weight is updated as follows [22] where is a learning rate. ∑ ℒ , is the accumulated gradient over all time steps, which is calculated as in [22]: In (4), the gradient calculation suffers from nondifferentiable spiking activities. To address this issue, an approximate gradient (i.e., a surrogate gradient) was introduced in [22], [23], which is formally expressed as where typically is set to 0.3. The surrogate gradient is effective for solving nondifferentiable spiking activity. However, when the network gets deeper, the training process based on gradient descent in (3)-(4) suffers from the degradation problem [20], [24]. In the following, we present how a residual connection [20], [24] can be adopted for our BSNN to tackle the degradation issue.

B. PROPOSED BSNN WITH RESIDUAL CONNECTIONS
Residual connection is an effective technique that helps to stabilize the training processes and improve the classification accuracies of deep networks [20], [24]. We hence propose a BSNN training model using a surrogate gradient in conjunction with residual connections. As illustrated in Fig. 1, relative to the conventional BSNN network in Fig. 1(a), each convolutional (Conv) layer in the residual structure in Fig. 1(b) has an additional connection from layer ( − 1) to layer via the inverter-AND spike-element-wise (SEW) function . The spike , of layer is now dependent on the IF output , and the spike , −1 as follows By supporting the SEW function in (6), if the output in (1) is , = 0, the output of the element-wise function becomes , = , −1 , which satisfies the identity mapping condition. The algorithm that describes the whole training process can be found in Appendix A.

III. XNOR-BASED BSNN INFERENCE WITH A DYNAMIC THRESHOLD
This section presents a BSNN inference model that utilizes the MAC operation based on an XNOR array; the latter is suited for implementation in memory using emerging technologies [12], [25]. To simplify the inference model in (1) without accuracy degradation, we set = 1 and = 0 in the BN layers [26]. The original IF model for a BSNN is expressed as In this model, during every time step, the membrane potential , accumulates with ⁄ {∑ , =1 , −1 − } and then is compared with a threshold for the firing decision. To avoid multiplication in (7), which could be costly to implement at the circuit level, we scale both the membrane potential and the threshold by a factor of ⁄ . 4 VOLUME XX, 2017  Given that, the scaled membrane potential ̂, and the threshold ̂ can be rewritten as The MAC component ∑ , =1 , −1 in (8) is essentially the most computationally extensive operation. On the other hand, in the binary spiking model, a spike signal is represented in unipolar format (0, 1) [27], [28], while the weights , are trained with a bipolar format (-1, 1). Therefore, unlike in the case of a BNN [12], [25], it is not possible to directly utilize an XNOR array for a BSNN MAC function. To overcome this issue, assuming that the weights in (8) have 1 negative weights ( − ) and − 1 positive weights ( + ), the MAC function in (8) can be reformulated as In (9), , is represented in unipolar format ( , = ( , + 1)/2). Substituting the expression of (9) into (8) and denoting = 1 + , the scaled membrane potential in (8) is now expressed as In (10), the component ∑ , ⊕ , −1

=1
can be realized entirely in memory by using an XNOR array.
However, this transformation introduces a constant in (10) implies that the hardware implementation must support both accumulation and subtraction. The latter leads to undesirable complexity in the circuit implementation. To solve this problem, we propose an equivalent IF model with a dynamic threshold mechanism. Specifically, instead of a fixed threshold, we can transform the negative component of the scaled membrane in (10) into a positive component of the threshold ̂, which is now considered the time-dependent quantity ̂, , . Therefore, the proposed IF model can be formulated as follows Compared to the IF model in (7), the IF model in (11) requires only accumulation operations. This means that the latter model can help to simplify the subsequent hardware implementation. However, the model in (11) is correct only when is positive. Although it rarely occurs, the value of can theoretically be negative, i.e., the subtraction in (10) is actually an addition. In that case, we retain the original expression in (10) and keep the threshold constant. Accordingly, a small modification is required in the IF neuron circuit implementation (detailed in Section IV) for covering both positive and negative values of .
The pseudocode of the XNOR-based BSNN inference process with a dynamic threshold is presented in Appendix B.

IV. IN-MEMORY STT-MRAM BITCELL AND SUBARRAY CIRCUITS
This section presents the circuit implementation for the BSNN model presented in Section II and Section III. The general architecture for intra-layer processing using the proposed model is shown in Fig. 2. First, the binarized weight , ≡ is mapped into the memory of the × STT subarray. Then, the digital presynaptic spikes , −1 are encoded by the column decoder and fed to the array through bit line BL ( = 1 ÷ ) to fit the XNOR-MAC computation. Finally, the source line (SL) SL ( = 1 ÷ ) voltage, which represents the output of the MAC operation, is passed into the IF model in (11) to generate postsynaptic spikes , . All of the circuit implementation and simulation steps are completed on a 65-nm CMOS with the MTJ parameters presented in Table I.
The detailed implementations of individual subcircuits are subsequently described in the following sections.

A. MAC OPERATIONS USING A COMPLEMENTARY 2T-2R STT-MRAM BITCELL AND SUBARRAY
To map an MAC computational unit into the IMC memory, we employ a 2T-2R STT-MRAM-based XNOR cell. The details of this circuit and model can be found in [12]; here, we give brief descriptions of the bitcell and row operations.
Updating the binary weights is performed at the beginning. Since SL is shared among the cells in the same row, each MTJ has to be written individually. Specifically, apart from the BL corresponding to the active MTJ, other BLs are left in the highimpedance state while an appropriate voltage is applied across (BL, SL) for flipping the MTJ magnetization. The write peripheral circuit is omitted for clarity, details can be found in [29] [30]. As seen in Fig. 3(a), for a single XNOR bitcell, the binarized weight (0, 1) is encoded by the MTJ states ( and ), and the presynaptic spike is encoded to a pair of BL voltages as follows Here, is set to 0.3 to limit the bitcell reference current to less than 50 µ and avoid a high disturbance rate and high power consumption [31]. The BL driver encodes incoming spikes using a pair of complementary transistors (an nchannel MOS (NMOS) and a p-channel MOS (PMOS)). The SL voltage represents the output of a single XNOR operation (see Fig. 3(b)). Fig. 3(a) shows the circuit implementation for a single-row XNOR-based BSNN using STT-MRAM. Each high-load WL is driven by a buffer (an 4-stage inverter chain) that guarantees the fast transition and stable level value during the MAC operation on the memory row. Additionally, from [12], the MAC product in the output of the -th row connection is represented by the merged SL voltage , , (i.e., all bitcells in a row share the same SL), which is equal to [12] , where is the overall resistance seen from the SL terminal of the bitcell, which is data-independent and equal to ( + )||( + ); denotes the number of XNOR outputs (i.e., , −1 ⊕ ) equal to +1 across the entire row of bitcells. The SL voltage linearly depends on and ranges from  18 27 36 45 54 63 72 81 90 99 108 117 126 135 144 153 162 171 180 189 198 207 216 225 234 243 252 261 270 279 Experiment Ideal

FIGURE 3. (a) A single STT-MRAM row based on 2T-2R STT-MRAM bitcells for realizing binarized MAC operations, (b) the SL voltage level corresponding to the XNOR operation for a single 2T-2R bitcell [12] and (c) the dependence of the SL voltage on the number of (+1) values among the XNOR outputs (K) of the circuit simulation for a row of 288 bitcells with =0.3 V and =0.9 V.
) [12]. In Fig. 3 (11) can be performed within a single in-memory access phase.
In the following, we introduce an approach for implementing an IF neuron mechanism (the model in (11)) using circuit computation in the charge domain, whose input is the SL voltage , , from the MAC operation. Subsequently, this voltage must be accumulated in every time step, followed by the IF model in (11). However, since , , varies from 110 to 184 , it must be amplified to an adequate level to limit the charging current 1 . The amplification process is performed by the CB circuit introduced in [31], considering the amplification factor ( ) does not require high precision, and the CB circuit is very compact and energy efficient. As seen in Fig. 5 Similarly, ACC2 is utilized for dynamic threshold accumulation. Assume that > 0, and represents the value of in the voltage domain. As shown in Fig. 5, when ENacc2p is activated, is used to charge C2 through M2 (ENacc2p ≡ ENacc1). The amount of additional charge corresponds to the voltage increment at the output 2, of capacitor C2. In such a way, 2, represents the dynamic threshold accumulation corresponding to the model in (11).

1) GENERAL ARCHITECTURE
Finally, 1, and 2, are fed into the current latched sense amplifier (CLSA) [32] circuit for a voltage level comparison. If the firing condition in (11) is satisfied ( 1, > 2, ), a spike is generated in the output of the CLSA ( , ), as an example shows in Fig. 5. Subsequently, , is shaped by two inverters before being fed into a D-flip-flop (D-FF) for the postsynaptic generation of spike , . The frequency of the clock (CLK) signal determines the postsynaptic spike period ( = 6 ). Additionally, after two inverters, a signal D, which is , delayed by two inverters, is used for resetting ACC1 and ACC2 before starting the next step operation.
To reset the dynamic threshold (after firing and during initialization), C2 is precharged by , , which represents the threshold ̂ (see (11)) in the voltage domain. According to Fig. 4, that C2 is precharged when both the STE and output of the D-FF are equal to 1, where STE is a periodic signal that enables the threshold precharging circuit for a fixed duration within a time step. In the normal working mode, the STE is activated when a generated postsynaptic spike occurs. For example, 2, is precharged to , (14.7 ) within the duration (0.5 ) after firing, as seen in Fig. 5. During the initialization process, the D-FF output is manually set to '1' to preset the threshold. After IF processing, the output of the D-FF is fed into the SEW circuit to perform residual connection according to the model in (6). Specifically, , is added with the spike from the previous layer , −1 using a single inverter-AND gate, as shown in Fig. 4. Note that in some rare cases, if < 0 ( ( ) ≡ 0) (see (11)), ENacc2p is deactivated, and ENacc2n is active. In such cases, accumulates in ACC1 instead of ACC2. In other words, the output of ACC2 is fixed as , . This switching mechanism recalls our discussion about in Section III.
The detailed charge-based accumulation, analysis and circuit implementation of the core functions are described below.

2) CHARGE-BASED ACCUMULATION
As shown in Fig. 4, the charge-based accumulation architecture consists of a PMOS transistor and a capacitor. According to the well-known -power law model [33], the charging current passing through transistors M1 and M2 is equal to where is the model power index, which ranges from 1-2 depending on the adopted technology. In our chosen technology (a 65-nm CMOS), the transistor is considered a short channel; i.e., theoretically, ≈ 1.
We further experimentally verify that the charging current is quasi-linear and dependent on in the range of interest ( from −702 to −443 ). This fact is very important and allows the proposed model in (11) to be directly mapped to the circuit solution. Note that the value of , , and its boosted value Here, = ℎ and = − ℎ ( + ). Accordingly, the amount of charge in C1 at time step equals to 1, ≈ 1, Note that ∆ 1, represents the voltage increment in ACC1 when , , is applied to the input analog accumulation. For a dynamic threshold, the circuit implementation is almost the same, but , , is replaced with in (15)- (17). We have 2, ≈ 2, The dynamic threshold can be reformulated as follows where ∆ 2, = + 1 . The equations for 1, and 2, in (17) and (19), respectively, in the circuit domain hence can be mapped to the model in (11).

3) THE EFFECTS OF NONLINEARITY AND PROCESS VARIATION
As introduced in (17) and (19), the IF model in (11) can be directly mapped to the charge-based accumulation circuit, which is introduced in Fig. 4. However, this model suffers from inevitable nonlinearity, and process variation comes from both the CMOS and MTJ devices. These effects essentially degrade the IF accuracy, as well as the overall system performance. In this section, we quantify this nonideal impact based on actual circuit simulations, aiming to have a realistic evaluation of the proposed BSNN model at the system level in the subsequent section.
To capture the effect of process variation, we run Monte Carlo simulations for a synaptic array (a row of STT-MRAM) in Fig. 3(a) and the proposed IF neuron circuit in Fig. 4. The variation in the CMOS device is set according to the provided by foundry models. For the MTJ, the variability in the MTJ resistance is approximately 5% according to [34]. The accumulated output ∆ 1, depends on (the number of XNOR outputs that are equal to +1), as presented in Section IV.A. Each IMC row is designed for 288 bitcells that later fit with the BSNN inference model. Monte Carlo simulations have been performed for 289 cases of .  The results indicate that 1 ( ) is not the same for every . This effect is understandable from the circuit perspective, where the working points of the M1 transistor for different values of are not the same. For example, 1 ( ) reaches its maximum value at 3.60 for = 2 and its minimum value at 1.15 for = 221. The noise profile, represented by a normal distribution {0, 1 ( )} is added to the ideal value ∆ 1, . The actual voltage level in the output of ACC1 can be approximated as A similar analysis and model are conducted for ∆ 2, . The difference is that the variation ∆ 2, comes from the ACC2 circuit itself without contributions from the synapse circuits (i.e., the memory array). We have Where ∆ 2 ( ) is the ideal linear; 2 ( ) and {0, 2 ( )} are the -dependent nonlinear errors and random variation of ∆ 2, . Finally, the nonlinear and variation effects on ∆ 1, and ∆ 2, are added into the accumulations in (17) and (19). The effects of nonlinearity and process variation on the full circuit simulation are studied in the following subsection.

4) IF NEURON CIRCUIT SIMULATION
In this subsection, we conduct a full circuit simulation for an IF neuron circuit, which is directly connected with the synapses of 288 STT memory cells. We extract the trained network parameters (i.e., weights, BN parameters, and threshold) from the PyTorch model and feed them to the circuit model, which is encapsulated in the HSPICE netlist. The network hyperparameters are set so that the kernel size is 3×3 and the number of input channels is 32, corresponding to an IMC array size of 1×288. Fig. 7 shows an example waveform of the CLSA output ( , ) for a single time step considering the process variation effect. From the figure, the shift delay can fluctuate from ,1 (~170 ) to ,2 (~740 ). However, this variation barely affects the output spike because , is shaped and latched in a fairly stable position.
Furthermore, for multiple time steps, the process variation also affects the spike position and the total number of spikes. . However, since we use rate encoding, this effect does not affect the final result, which is determined by the total number of spikes within period. It can also be observed that the output pattern may miss or introduce one spike, as shown in Figs. 8(b) and 8(c). Nevertheless, the undesirable missing or addition of one spike has little impact on the final result also thanks to the use of the rate encoding. In rare cases, we also observe two missing or introduced two spikes, but that is the maximum number of incorrect spikes observed thus far in our simulations. VOLUME XX, 2017 9 To quantify the impact of process variation on the robustness of the neuron implementation, we extracted the mean square error (MSE) of the output bitstreams from the multiple Monte Carlo simulations as follows where is the number of Monte Carlo simulations. According to (22), the MSE for 500-Monte Carlo simulations in this experiment has a value of 1.45%. Thus, process variation essentially has an impact on the classification accuracy, but this impact is small and can be reasonably accepted. The results obtained from the above analysis could be a solid evidence for the SNN fault-tolerant nature [13].
In fact, it is not possible to simulate the whole network at the circuit level with massive input patterns. However, as we have mentioned earlier, the statistical error models of ∆ 1, and ∆ 2, in (20) and (21) can be exactly injected into the system model to realistically estimate the system accuracy.
The details of the model and evaluation process are discussed in the next section.

A. BSNN TRAINING MODEL
To evaluate the performance of the BSNN at the system level, we use the training method proposed in Section II and Appendix A. The network structures and major parameters are shown in Table II. The training and inference models are written in the Python language using the PyTorch library. We use the MNIST and CIFAR-10 [35] datasets for training and evaluation. The networks are trained for 300 (50) epochs with a batch size of 128 (100) for the CIFAR-10 (MNIST) dataset. The base learning rate is set to 0.3, and the stochastic gradient descent (SGD) optimizer has a momentum of 0.9. The learning rate is scheduled with a decay factor of 10 at 50%, 70%, and 90% of the total epochs. For the CIFAR-10 dataset, we adopt BSNNs using 7 Conv layers for both networks (with and without the residual connections). Since the Conv layers occupy most of the computational workload and latency (more than 90%) [36], [37] in the deep neural network, only the hidden Conv layers are binarized in the BSNN to balance the accuracy with that of the conventional SNN [16]. Therefore, in our work, the IMC XNOR-based STT subarrays substitute for all hidden Conv layers in the BSNN evaluation. Fig. 9 shows the mapping of a BSNN Conv layer to the STT subarray. The main calculation workload to shift from presynaptic spikes to postsynaptic spikes is done by the IMC and IF circuits described in Section IV. As seen in the figure, we unroll each sliding window calculation for the given input into a vector of presynaptic spikes. Then, the unrolled vector is passed to an × subarray through BL decoders. The kernel is mapped and stored in the bitcells to perform convolution between unrolled spikes and the kernel weights. If output channels are generated simultaneously, the × subarray performs one sliding window, as illustrated in Fig. 9. Furthermore, intra-layer parallelism can be realized by utilizing subarrays that calculate parallel sliding windows [38]. hence indicates the level of parallelism, which reflects the tradeoff between the hardware cost and the speed of computation. For the practical implementation in this work, each Conv layer has =32 input and output channels, and a kernel size of 3×3 corresponds to a subarray size of 32×288 ( =288).

C. THE IMPACTS OF THE NUMBER OF TIME STEPS AND PROCESS VARIATION ON CLASSIFICATION ACCURACY
We investigate the impact of the number of time steps on the classification accuracy for both the MNIST and CIFAR-10 datasets. The classification accuracies of the BSNN inferences with six network configurations are shown in Table III. From the table, using more time steps essentially improves accuracy. Specifically, for networks without SEW, increasing the number of time steps from 4 to 8 results in an accuracy increase of 0.64% (3.24%) for MNIST (CIFAR-10). Additionally, it is clear that the networks with SEW achieve 2.52% (1.82%) better accuracy with 4 (8) time steps for CIFAR-10 than the conventional networks [27]. These results confirm that the training method using a surrogate gradient with SEW significantly reduces the required number of time steps compared with the that of the ANN-SNN conversion method [28].
Furthermore, the nonideal charge increments in (20) (21) are injected into the Python BSNN inference model to evaluate the effect of process variation on the classification accuracy. This is completed by replacing the linear models of the membrane potential and threshold by the actual models with an incorporated nonlinearity bias and a Gaussian random quantity 2 , which are characterized by the Monte Carlo simulations in Section IV.B.3. The models are evaluated on the test set multiple times (a 100-variation netlist) with different variation seeds. The means ( ) and standard deviations ( ) are reported in Table IV. From the table, the mean classification accuracy is slightly reduced by 0.22% (0.91%) for MNIST (CIFAR-10) with 8 time steps compared to the reported accuracies for the models without variation. Additionally, the accuracies evaluated with different configurations are not much different, with standard deviations of 0.23% (MNIST) and 0.03% (CIFAR-10). In the extreme case at the 6 × point, the classification accuracy is degraded by 1.38% (MNIST) and 0.18% (CIFAR-10). Overall, these results permit us to conclude that process variation has a minor impact on classification accuracy and the proposed BSNN exhibits a very good level of fault tolerance. 2 To ensure high accurate model, we use look-up tables for storing the nonlinear bias and standard deviation for 289 points in Fig. 6 to implement the corresponding Python models. 3 The area is estimated roughly in this work only because such design with this emerging technology is not fully available for utilization in its current

D. EVALUATION OF THE ENERGY, THROUGHPUT, AND AREA OF THE SUBARRAY
In the proposed STT-BSNN architecture, the energy per spiking operation is provided by the energy for synapses and for the IF neuron circuit: where is the energy consumed by the synapse operations of the IMC circuit (i.e., the MAC operations). This includes the precharged energy required for the high-load word lines and the energy consumed by the bitcells . The latter essentially accounts for the main portion of the total energy, which is proportional to the sampling time and the accumulated current drawn from the BL source voltage . is the energy spent on the IF neuron circuit, which includes the energy needed for the CB, ACC1, ACC2 and FS subcircuits in Fig. 4. Since these circuits are all charge-based circuits, they consume energy only during switching, i.e., without any direct currents. Therefore, their energy proportions are small compared to , which normally accounts for the main contribution in . Specifically, for an IMC array row of size 288, the for 288 synaptic elements is found to be 1.58 (where = 0.064 and = 1.52 ), where the optimal sampling time is set to 1 when using the precharged technique for the booster [31].
accounts for only 0.052 , which results in a total spiking operation energy of = 1.63 . In this work, we define the number of operations to be equal to the size of the MAC function. This means that 288 operations are executed within one time step. Therefore, the energy efficiency (TOPS/W) for an IMC array row of size 288 is estimated to be 288/1.63 = 176.6 TOPS/W.
The rough estimation of the subarray area is equal to 608 2 3 for a single-row implementation with 288 bitcells. The estimation area for a neuron is equal to 115 2 . For a subarray of size 32×288, the number of operations is equal to 32×288=9216 (operations) over 8 time steps with a period of = 6 . The throughput efficiency is equal to (9216 /8 × 6) = 192 GOPS.
If sliding windows are processed simultaneously, the throughput efficiency increases by times ( • ).   [39], [40]. c The full-precision weights are converted into stochastic bits (+1/0) in each time step.

E. COMPARISON WITH RELATED WORKS
Regarding accuracy, using a deeper network and direct training method, the accuracy level of BSNN in this work is ~13.8% higher than that in [16], evaluated using CIFAR-10 dataset. The implementation in [17] with spintronic synapse and digital neuron also achieves lower accuracy (by 6.3% for MNIST dataset) compared to our work. This is partly because their results are reported for a fully connected network. An allspin SNN in [19] exhibits similar accuracy for MNIST dataset, though their spiking rate is much lower than ours. The reason is that their method requires a large sampling time to convert synaptic current into a spike duty cycle in each time step.
Regarding the energy, the energy consumption per synapse calculated for our BSNN is ~6.5 times lower than that of the MTJ synapse reported in [16], not considering their synapse is implemented in a smaller technology node (45nm). Similarly, our work is still 1.6X more energy-efficient than the reported synapse energy in [17], even though their work is implemented on a 28nm CMOS, theoretically ~2 times more energyefficient than the same 65nm implementation.
Finally, for area comparison reported in F 2 (F is the technology feature size), the area per neuron of our model is essentially much better than digital implementation [17]. Still, it is not as good as all spintronic one in [19]. That advantage comes with the clear trade-offs in energy and latency, as mentioned earlier.

VI. CONCLUSION
This paper presents an in-memory BSNN based on STT-MRAM for low-power, low-latency on-edge AI applications. We propose a direct BSNN training approach using a surrogate gradient with residual connections that achieves high classification accuracy with much fewer time steps than the number of steps required in prior works. Furthermore, we propose a full-circuit solution for IMC MAC operations based on a STT-MRAM array, which allows ultrafast vector multiplication to be performed within one memory access phase. Furthermore, we propose a dynamic threshold approach for the IF circuit that mimics the neuron spiking behavior and consumes very low power. The BSNN system model is then re-evaluated using realistic circuit parameters and exact circuit simulations. The results indicate that device mismatches and nonlinearity essentially affect the misclassification accuracy of the model. Nonetheless, the accuracy degradation is insignificant, and the proposed BSNN still offers decent performance in comparison with other methods. The proposed design approach with a practical circuit solution could potentially pave the way for ultralow-power DNNs to be applied in on-edge AI applications. We are working forward to improve the architecture and design to fit deeper and larger networks.