The viability of analog-based accelerators for neuromorphic computing: a survey

Focus in deep neural network hardware research for reducing latencies of memory fetches has steered in the direction of analog-based artiﬁcial neural networks (ANN). The promise of decreased latencies, increased computational parallelism, and higher storage densities with crossbar non-volatile memory (NVM) based in-memory-computing/processing-in-memory techniques is not without its caveats. This paper surveys this rich landscape and highlights the advantages and challenges of emerging NVMs as multi-level synaptic emulators in various neural network types and applications. Current and potential methods for reliably programming these devices in a crossbar matrix are discussed, as well as techniques for reliably integrating and propagating matrix products to emulate the well-known MAC-like operations throughout the neural network. This paper complements previous surveys, but most importantly uncovers further areas of ongoing research relating to the viability


Introduction
Artificial intelligence is everywhere-using variants of deep neural networks (DNN) architectures for text prediction, object detection, speech and image recognition, to name a few.The computational tasks involved in conventional implementations of these neural networks require large data movements between memory and processing units.While there is continued development of dedicated hardware for these types of workloads, latency and power demands of this data traffic presents a well-known bottleneck and significant disadvantage especially for edge applications.Alternative architectures that perform matrix-vector-multiplication (MVM) in-memory using existing non-volatile memory (NVM) technologies may provide a solution to this bottleneck.The advantages and challenges of these analog NVM-based architectures are the main topic of this review paper.
As a complement to other surveys [1][2][3][4][5][6], the aim of this paper is to give a conceptual view of an analog-based artificial neural network (ANN) [1,3,6] from the perspective of the crossbar architecture and the individual NVM candidates for realizing the synaptic weights, as well as the means to propagate these weight products throughout the array in both directions.Also explored are the methods for en masse synaptic weight updates.The paper includes an overview of ANNs and DNNs for machine learning (ML) workloads, discusses ongoing research into analog-based ML hardware using existing NVM technologies and crossbar architectures and their limitations.A detailed review of the MVM macro explores candidate choices for synaptic weight storage to meet the requirements of different ANN applications.A cast of supporting peripheral circuitry follows-driving, sensing and data conversion architectures (ADC, DACs) with their architectural requirements and limitations.Concept analog-based accelerator architectures with their unique challenges are also evaluated and discussed.
The paper is organized as follows: • Section 2 gives an overview of ANNs, current hardware challenges, and presents arguments for why explorations in analog based accelerators for ML are gaining traction.• Section 3 describes current explorations in ML workload acceleration using analog hardware and the means in which the aforementioned features in section 2 are realized.It gives a snapshot of conventional memory storage and how the structure can be mapped into an ANN using analog devices in a conventional crossbar memory array.• Section 4 describes the backbone of the analog hardware accelerator framework as being a conventional memory crossbar.This section describes how this is attractive for matrix vector multiplication (MVM) as well as relating the size limitations of the crossbar array to drivers, interconnect wire and source (synapse) resistances.• Section 5 describes the requirements for ANN synaptic (weight storage) devices and presents candidates and qualifications.• Section 6 presents the support circuitry required to drive, sense and modulate the synaptic devices and to perform computations based on the NVM synaptic weights.Circuits include data converters, drivers, sensors and where needed simple approximations of these circuits.• Section 7 contains architectural and system considerations that address the unique analog-based ANN challenges-device variation and unresponsiveness, circuit non-idealities, precision control, effective multi-level signaling, signal regeneration and buffering, throughput, energy, area savings and latency.It also addresses challenges faced in current accelerators today and whether (and how) it affects an analog based NVM approach as well as what residual or new challenges remain for further research.

Background on deep neural networks
In simple terms, an ANN is a computing system formed by a collection of artificial neurons arranged in layers.Within each layer, each neuron takes inputs from all neurons in the previous layer, weighted by a scaling factor called the synaptic weight, constructing a weighted sum which for classification tasks is passed through a nonlinear activation function as shown in figure 1.This nonlinear (squashing) function, typically a softer expression of the sigmoid function, can be represented as a hyperbolic tangent, ReLU (or variations of) for improved classification accuracy.Or, as we will discuss later in the next section a simpler approximation [5].
The first and last neuron layers are called the input and output layers, and all intermediate layers are called hidden layers.In a single layer network the output neurons are simply a function of the weighted sum of the inputs.An ANN with multiple hidden layers indicates a deeper network, hence the term 'deep neural network' or DNN.The number of elements in a layer, especially the input layer is defined by the number of inputs or features, and can be further reduced to remove redundancy through various algorithms [7,8] to break down into just the principal components that affect the intended output.The propagation of information from the input stimulus through the synapses from one layer to the next is referred to as forward propagation.Figure 1 illustrates forward propagation where the matrix of weights θ that map from one layer to another are multiplied by the inputs (from the previous layer), summed, and passed through the activation function to form an output matrix a i j .The weights, or synaptic values, represent the strength of the connection from one neuron to the next.The equations in figure 1 illustrate this mapping between layers and for simplicity only the mapping from the input to the first hidden layer and from the last hidden layer to the output layer are shown.In a multiclass classification task there would be several outputs represented by the hypothesis function h θ in figure 1.The hypothesis function is a predictor that approximately maps the inputs to the outputs and is modeled or 'learned' from the test data provided in supervised learning.
The neural network cost function is J θ = cost(h θ (x), y) where we need to compute the θ (weights) that would minimize this cost function.Gradient descent is the general function to minimize the cost function in order to get the optimal synaptic weight matrix of θ for each layer.In the case of supervised learning the delta between the known and calculated output, known as the ground truth or 'labels' y k , and calculated output, h θ , are propagated backwards through the network.The back-propagation is actually doing two sweeps: • Calculation and accumulation of error deltas for each layer in the equation multiplied by the partial derivative of the activation-function.• And the second sweep is the synaptic weight update process.This process is illustrated in figure 2.

Types of networks
Figure 1 illustrates a fully connected (FC) DNN or multilayer perceptron (MLP) where each output activation corresponds to a weighted sum of all the inputs from its previous layer.The impractically large storage and computation requirements of these FC networks has prompted the exploration of sparsely connected network architectures.Convolutional neural networks (CNN) are an example of such sparsely connected architectures where weight sharing is used across an input feature map.The 'sharing' occurs whereby a filter of weights is convolving over a large input data matrix (see figure 3).The filters in figure 3 are simply pattern detectors and in DNN terminology, the filters correspond to the synaptic weights, while input and output feature maps correspond to input and output neurons.The abstraction levels of the input data (feature maps) are convolved with various filters in each layer, hence the various channels in various implementations [4].As one gets deeper into the network these feature maps generate a higher level of abstraction of the input data.For example, the low-level filters in the initial layers of CNNs used for image recognition may correspond to the image edges (e.g.horizontal and vertical) then, with deeper progression into the network these would correspond to more sophisticated shapes, and in the latter part to full objects.FC layers are found at the latter layers of the convolution network, usually the last one-to-three layers, as they are used for classifications hence also called 'classifier layers' illustrated in figure 3. Additional means to save storage memory in CNNs is using subsampling or 'pooling' to reduce feature map dimensions, illustrated in figure 3 solutions use an average, or max solution for the stride [4] in order to further reduce the input matrix to the next layer.In summary, there are three main layers in a CNN: convolutional layers, pooling (or sub-sampling) layers and a few full connected classifier layers at the end of the network.
Recurrent neural networks (RNN) are FC networks with large internal memory requirements to capture long term effects thus creating a computational bottleneck for today's hardware accelerators.They require storing outputs from intermediate operations within the network to be used in processing of subsequent inputs, for example in natural language processing (NLP) algorithms.A feedback from the output to the input of the network allows for inhibiting or promoting parts of the input data based on history.
For event driven processing, spiking neural networks (SNN) are more favorable where information is spatio-temporal so active power becomes directly proportional to spiking activity, e.g. in event based vision sensors.In an SNN, an output neuron fires when the sum of its connections overcomes a threshold.An output potential travels along connected synapses whose strengths could be inhibitory or excitatory after which the firing neuron's potential is reset.For forward inference systems one needs to factor in the firing frequencies and timing between the pre-and-postsynaptic spikes.In the case of training, to avoid the complexities of gradient descent, conventional DNNs are typically trained using back-propagation and subsequently the neurons are converted to spiking ones [9].Local learning rules such as spike timing dependent plasticity (STDP) [10] are also used.SNNs can also be convolutional, SCNN, where pooling will have to be restricted to average pooling solutions rather than e.g.max pooling due to the spiking nature of the stimuli.

Popular models and data sets
Different architectures for ANN models have been studied and many are now featured as reference models for the benchmarking of inference and training AI hardware implementations [11][12][13].The network architecture model is defined in terms of the number of layers, depth, layer shape (filter number and size, number of channels) and layer connectivity (e.g.FCNs vs RNNs, vs CNNs) and thus have different memory capacities and configurations requirements as seen in submitter benchmarking data in [11].Various popular models exist (e.g.LeNet5 [14], AlexNet [15], VGG [16], GoogleNet [17], ResNet50 [18], Bert-99 [19]) and are compared and tabulated in [4].
Table 1 illustrates some of the datasets discussed in the various works discussed in this paper.Popular data sets discussed in this paper are those that are specifically used for analyzing novel analog-based ANN implementations and thus tend to be smaller and more rudimentary than the ones used for conventional/commercial purposes [20] as they are used for proof of concept.These include versions of MNIST, IMageNET, and CIFAR10/100 datasets.

Hardware
A brief historical timeline of neural networks is provided in table 2 to provide context to this paper.As indicated in the table, current trends are toward custom ASIC implementations to improve computational and power efficiencies and convergence rates of modern hardware accelerators.Accelerators are used for two main applications: For each case, the hardware requirements are different and attract different applications [3].Forward inference tends to be in a more power constrained envelope for use in edge, internet-of-things, and autonomous vehicle applications, as well as server room.These forward inference applications favor using hardware architectures with reduced latency over increased throughput (especially in edge computing).Training, which typically happens in the cloud [33,34], relies on hardware designed for throughput (ops/sec) over latency, with usage of distributed multiple compute nodes optimizing the intercommunication between them.[23] Handwriting and Database of handwritten digits 60 000 character recognition, classification CIFAR-10/100 [24] Object detection and Low resolution images of 60 000 recognition, image classification 10 (CIFAR-10) or 100 (CIFAR-100) object classes KITTI [25] Object detection and recognition Edge in-the-field training is gaining more traction [35][36][37], not only due to the latency of training in the cloud, but also due to privacy/security risk concerns, and to reduce reliance on connectivity.
Existing hardware used for implementation of neural networks includes CPUs, graphics processing units (GPU), and tensor calculation specific ASICs [30,33,36].These are generally enhanced using special software drivers and stacks provided through various libraries [4,38,39].GPUs accelerate ANN implementations using massive parallelism of processing cores optimized for computing applications.This is different from traditional CPU multi-core processors which are more generic.The handling of floating point operations in GPUs is also attractive for the implementation of neural networks as it enables larger and deeper networks with many neuron computations performed in parallel.While GPUs were created to accelerate graphic rendering, TPUs are AI accelerator ASICs specifically designed for tensor calculations, and developed to accelerate deep learning workloads.
To provide a means of benchmarking performance for ML workloads, a consortium called MLPerf [12,13] specifies reference model architectures and data-sets to provide industry standards for measuring and comparing ML performance.

Current challenges
Key metrics and challenges for today's ML accelerators are latency, energy consumption, and throughput.Within inference applications, where latency is crucial especially for online applications, there are allowances for reduced precision in matrix calculations while still maintaining classification (prediction) accuracies.To address these challenges, data is encoded using smaller bit-widths with use of fixed-point versus floating point representations [3,40] for synaptic and activation function precision.Pruning the network removes neurons that are not important using sparse matrix methods, or as described in [4], studying weight saliency and setting the less significant weights to zero or just skipping over these weights entirely during computations.The usual trend to gain sparsity is to increase the number of convolutional layers and decrease the number of fullyconnected (FC) layers, which additionally decreases memory fetches and memory bandwidth.This is not a viable option for applications that require FC networks (e.g.RNNs).So, with the increased latency of memory fetches (with growing depth in neural networks) other means of increasing memory bandwidth need to be investigated.
In training applications the aforementioned techniques must be done with care as higher precision requirements are needed for gradient descent and other optimization approaches.One approach is to replace stochastic gradient descent (SGD) with batch or mini-batch gradient descent where the loss is calculated from multiple sets of data before doing a weight update to stabilize and speed up the process [4].Sparsity can also be gained from reducing the complexity of the sigmoid function to a ReLU function which gets negative values to zero.Another method is feature extraction down to principal feature components, or other means of compression [40].[17].NVM based accelerator exploration growth [28,29] 2015 Processor limitations cause a growth in DNN accelerator research optimized for neural network applications specific ASICs tensor processing unit [30] (Google), (Neuflow [31], DianNao [32]).Continued exploration in non-conventional approaches 2016+ Edge specific AI/computing, internet of conscious things These challenges ultimately mean changes need to be made to the hardware architecture to ensure advances in improved throughput, latency and energy consumption are at lockstep with the inevitable complexities of growing datasets [41].

Near data processing and the promise of in memory processing
As suggested in the previous section, existing solutions favor memory light approaches with reduced precision (where possible), and rely on pruning, data compression, and structured sparsing techniques.Current accelerators [4] integrate different levels of local memory along processing element (PE) routes as shown in figure 4. In these 'near memory' implementations, data can be routed between ALU, register file, and PEs for cheaper memory accesses.In examples like [42], where local memories are interspersed through the tensor processing cores and larger high-bandwidth memories around the periphery, there is limited capacity for these low cost memories, so the trend is to exploit data reuse to reduce memory fetches by using convolutional architectures where relevant.However, with growing demand for higher throughput, larger data sets, and need for reduced latencies, these types of implementations will no longer be enough and face a familiar memory bottleneck.If computation can be done within the storage unit, significant improvements will be achieved for latency, throughput, and energy consumption (i.e., the three main challenges for today's accelerators).This approach, known as in-memory computing, and its own novel challenges (such as data regeneration, data conversion, device and circuit variability, etc) are discussed in the subsequent sections.

Analog hardware for in memory processing of ML work loads
Constant fetches to memory to access weights and partial sums when performing matrix calculations introduce latencies due to the high data movement.While attempts have been made to mitigate this bottleneck with specialized AI accelerators [30,42] based on near-memory computing, growing data-sets and computational requirements have forced traction for the development of in-memory computing systems [43].Discussions in section 2 mentioned even with NN memory light solutions such as convolutional networks (CNN) some applications need FC layers-RNN (LSTM, GRU).
The diagram in figure 5 shows a concept diagram of an analog based DNN with resistor processing elements (RPE) driven and sensed by peripheral circuitry in both directions.
In essence the 2D matrix calculation from equation in figure 1 is mapped into a physical RPE array where the conductance element (RPE) at the crosspoints represents the synaptic weight between the row and column.This configuration is typically called a crossbar array or simply a crossbar.The weight is encoded into the device conductance and in many cases it requires a multi-bit value for higher accuracy and resolution.Two floating point operations (multiply and accumulate) can be condensed into one parallel operation as shown in the diagram in figure 6.Moreover, these operations can be done in parallel for all columns in the crossbar array resulting in parallel multiplications of input vectors with the weights matrix (vector-matrix-multiplication, or VMM) implemented in one step.Thus, this in-memory analog implementation of VMM avoids moving weights from memory to separate processing units and enables large parallelism in the computations.
Several existing storage memory crossbar hardware have already been shown to model the above matrix operations and can be used to do matrix vector multiplications in situ [44][45][46][47][48].These are based on various storage devices to implement the weights.Thus, in these crosspoint technologies, each memory cell at the  row-column intersection holds the weight of a synapse and can be manipulated based on the device characteristic to provide multiple states.These states typically correspond to device conductance state (e.g., in filamentary or charge-based resistive switching memory).Further illustrated in figure 7 from [5] is a generic architecture for DNN training using NVM based arrays where the architecture is split into array-blocks (large NVM array) that are interconnected by a flexible routing network.The routing fabric is to transfer input-data, weight updates from chip inputs into the device array and to carry updated chip information and inference classifications out.The flexibility is allowing for reconfigurability to multiple layers to control the depth of the neural network.The design grid connects input neurons on the west side of the array block to the output neurons on the south side each being fed by peripheral circuitry to drive and sense.Local storage is required for the activation excitation and error value during an in situ training application so that it can be used and compared later for weight updates.This will be further discussed in the architecture section 7.

Forward inference networks
Inference solutions begin with physical synaptic elements/devices being programmed with weights obtained from an ex situ training solution (typically done in software).The details and methods of mapping will be briefly discussed in section 7. One of the earliest methods for an inference accelerator was IBM's TrueNorth [28] where a large SNN was implemented using an SRAM crossbar array to perform forward propagation.The weights were trained offline and transferred onto the SRAM array that corresponded to 256 million synapses and 1 million neurons.Such attempts at CMOS-based synapses and neurons in neuromorphic systems [28,49,50] are not area efficient due to the large number of transistors needed for their implementation.
In analog-based implementations, the focus of this paper, a more area efficient solution is explored.There, SRAM cells are replaced with analog memory, not only to save area but also to extend beyond binary weights (i.e., dual-state representation of weights) [5,45,51] to allow more granularity and precision, as well as to enable in-memory neuromorphic computing architectures using crossbar configurations.The architecture in [45] takes this further to provide a solution on demand that can be dynamically reconfigured between accelerator and memory supporting MLPs (FC NN) and CNNs using resistive-RAM (ReRAM or RRAM).Once synaptic weights are written and verified to be mapped correctly, the inference phase will drive read signals from a DAC (non-disturbance signals) in order to read the current 'setting' of the synaptic weight element.For example, considering a ReRAM crossbar array, a driving voltage signal would be applied to the rows of the crossbar activating current flow through each resistive element.The sum of these currents are collected at the end of the crossbar column and integrated on a capacitor which can then be passed directly to an analog approximation of the activation function [51], (or converted into a digital signal for a more logic approach [45]) prior to driving the next hidden layer.The synaptic elements can be stimulated in different ways for a read operation depending on the type of element.Encoding from the DAC can be amplitude-modulated or time-modulated depending on the type of device-for example ReRAM (resistive RAM) [45,52] or phase change memory (PCM) respectively [51].Note that each column is driven by a combination of the various elements and subsequently drivers feeding these elements.So each column will have its own calculation, and the same goes for rows in the reverse direction for backpropagation.In propagating to the next hidden layer architectures can save energy through circuit sharing by time multiplexing the ADC and/or activation implementations.To realize the positive and negative weights device pairing can be used [53], since the physical storage mechanism typically corresponds to a positive value.For example, in ReRAM implementations [45] two crossbar arrays are used to store positive and negative weights respectively, and their difference is obtained using a subtraction unit prior to passing over to the activation unit.Similarly in [51] two PCM devices are used, one as positive (LTP) and one as negative (LTD) contributing opposite effects at the integrator during a read.

Back propagation
As discussed in section 2, supervised learning issued by back-propagation of error terms is used to adjust the weights.In an analog-based training solution this learning happens in situ as the crossbar element states are adjusted, so hardware friendly approaches are required to implement learning algorithms such as those based on gradient descent [54].As described in section 2 the back propagation is triggered by a calculation of errors propagated throughout the network from one layer to another.The column drivers propagate the error values though the synaptic weight in order to do a 'forward propagation in the opposite direction' and in a resistive solution, the current is accumulated on the row capacitor [51].The error values accumulated on this row capacitor represent the accumulated error for propagation to the next neuron.This value can be sent to an ADC and further processed digitally by combining with the derivative of the activation function or using a simple circuit approximation [5,51] step function to connect to the preceding upstream layer to create the accumulated error value for that neuron.The classification accuracies can be improved by mitigating the vanishing gradient problem by creating a leaky derivative emulation through redefining the 'zero' level of this step function [51].The diagram in figures 8 and 9 illustrates this process.Note that for backpropagation some sort of local storage is needed for the activation and calculated error for use in the weight update calculations.

Weight update
In [5,51] MLP DNN the upstream neuron sends a signal based on its activation value and the downstream neuron sends a signal based on its back-prop error value, the overlap of these signals is used to program the synapse.The relative temporal difference between the two determines the magnitude and whether this will be a potentiation of the synapse or a depression.In [55] is a more detailed study on the concept pairing a synaptic LTP vs synaptic LTD using a two-PCM synapse (crystallization phase to allow for gradual conductance and avoid the abruptness of LTD in amorphous phase) so essentially a PCM-LTP device in parallel to a PCM-LTD device see figure 10.The method is referred to as a modified STDP update rule.
During the LTP time window the interaction of a write pulse with the feedback pulse 'potentiates' (increases) the conductivity of the LTP device.During the LTP phase the lone feedback pulse by itself will only increase the conductivity of the LTD device thus depressing the equivalent synapse.Accommodating the two phases means longer write times, but the split is required due to driver/sensor stability problems at endpoints of a particular synapse and is an open area for further research.A means to reduce this latency is to investigate devices that support shorter set-pulse times [52,56].The effective change in conductance is studied in [57] with 1000 pulses to a phase change element and explores the effective change in conductance based on initial conductance value and the extent of causality and anti-causality firings to mimic the relative time slots of row  and column drivers.One could also use the selector device turn on [58] as an additional knob to control the amount of overlap.In [59] describes a similar means of doing a parallel write/updating where the encoding on either side of the ReRAM device is different-for the column driver as pulse amplitude modulation and row driver as pulse width modulation to effect the change in ReRAM conductance.The larger the amplitude the more the weight change as well as the duration of the pulse.[60] proposes a spike based read integrate fire circuitry to represent the input current into digital spikes and in write mode, this spike train overlaps with a duty cycled feedback pulse to potentiate or depreciate the device (the polarity being controlled by the sign of the spike pulses).The [61] positive and negative weights are presented as a deviation from the 'zero weight state' described as the mid-point between the RON + ROFF state thus avoiding (where possible) the need for a device pairing as in solutions mentioned earlier.Memristive FET crossbar structure is investigated in [62] using pre and post synaptic spikes on drain and gate FET terminals.The modulation of the FET threshold voltage by changing the gate to drain voltage creates the STDP positive and negative STDP updates.The 'shape' of the spike can be used as an added hyperparameter knob to implement a faster or slower learning process as needed.The effectiveness of writes degrades with number of pulses where by the effective change in conductance decreases over pulses [63], this will be discussed in section 5 on how systems handle 'stuck-ats' and reset strategies for saturated paired conductances.While time consuming for online updates, offline training solutions (ex situ training) can reliably write using a read verify write to account for this prior to device mapping.A hybrid-training approach is discussed in [64] for a memristor-CNN where only the final FC layers are trained in situ.
Synaptic weight update pulsing and decisions on how many pulses, amplitude and shape are dependent on memory device type and technology.The next section will explore use of NVM 2D crosspoint technology for ML workload acceleration.

2D crosspoint for ML acceleration
The support frame of the analog based accelerator architectures is the 2D crossbar array, the size of which is determined by its line resistance, synaptic resistance and driver resistance [65][66][67].The arrays are driven on either end of the synapses by drivers and sensors fed by DACs and ADCs respectively to allow for bidirectional signaling.The crossbar size is dependent on the synaptic device's low resistance (R LRS ), its high resistance to low resistance ratio R HRS /R LRS ratio, and the number of states that can be programmed and read reliably (which also affects latency and the required switching energy [63]).

Crossbar size limits
The ratio of the memristor resistance to the driver resistance also determines how large the crossbar can get as shown in figure 11 from [65].A relation that predicts the maximum crossbar size relating the driver transistor to memristor resistance ratio, the write voltage to memristor threshold and the number of devices to be written in parallel, W, based on a large data set of 2000 points is expressed in [65].In figure 11, increasing the R LRS (synaptic 'on' resistance) to driver resistance ratio (R m /R t ) allows for a larger crossbar due the reduced effective load resistance (interconnect resistance was not accounted for in this analysis) from greater number of memristors.But, an increased R LRS reduces the synaptic resistance window and hence the number of realizable states/levels, limiting multi-state capability and classification accuracies [67].The greater the number of devices to be written in parallel from a driver is also analyzed in the second figure, illustrating that there is a limited number of devices that can be supported above the write threshold.

Reducing effective crossbar line resistance
The minimum voltage required for both worst case memory cell (with its selector device if used) to switch is discussed in [66] and used as a minimum threshold for write voltage shown in figure 12.A figure of merit called  the normalized write window as a means to evaluate the crossbar reliability is described as the write disturbance voltage (the maximum voltage drop on an unselected cell which happens to be closest to the driver) subtracted from the voltage at the selected cell (V cell ), divided by the disturbance voltage (V dis ) This normalized write window described in [66] decreases with increasing array size due to the increase in interconnect resistance and hence reduced effective write voltage on the selected cell.This can be mitigated by using multiple drivers to reduce the effective interconnect length thus increasing the effective write voltages to the selected cell resulting in reduced write latencies and switching energy as illustrated in the case study from [66] in table 3. Using a dual row driver effectively changes the array from an N × N array to an N  2 − 1 columns by N − 1 row.In the quad driver case this is further reduced in size to an N  2 − 1 column by N 2 − 1 rows shown in figure 13 from [66].While this increases write power, the gains in switching speed are substantial.An increase in driver voltage in attempt to achieve similar speed gains, table 3 increases write power, gate driver breakdown susceptibility, and affects cells proximate to the driver to become 'over reset' resulting in 'stuck at faults'.An  alternative angle to reduce wire resistance in [68] uses a double sided ground biasing scheme.The reduction of the longest IR drop path using this method means reduced latencies.In both methods additional drivers, and decoders are required.Also discussed in [68] is the data pattern effects on write latency 1-0 transitions versus 0-1 transitions.

Worst case latency
The switching time of an ReRAM crossbar depends on the array size, write current, wire metal resistance, and number of bits being written in parallel.ReRAM switching time is inversely exponentially related to its applied voltage, and the closer the selected cell is to the driver the shorter the switching time as it is getting the full write voltage.Further away due to interconnect IR drop and sneak currents, multiple ReRAM cells will see different voltage drop as illustrated in figure 12.A significant timing bottleneck to track is to ensure that the switching time of the furthest (worst case path) cell is less than the minimum reset/set latency [68].The write latency of the furthest selected cell is proportional to τ × e KV d where τ is switching time and K is a fitting constant [68].

Sneak path 'crosstalk' current mitigation
A major challenge for crossbar memory design is the interference from leaky currents in adjacent unselected cells which can cause write failures and misinterpretation in readouts.The sneak resistance can be modeled as a resistor in parallel to the desired cell resistance with the worst case scenario being when these unselected devices are in their lowest resistive states [69].It is most commonly mitigated by using selection devices such as transistors to access the device in a 1T1R configuration or diodes in a 1D1R so limiting the current [70] or other novel means [58].However, there is a penalty paid in the compromise between the selection device conductance versus the synaptic resistance as this reduces the resistance window and hence number of multilevel synaptic states (especially when the selection device is too resistive).Conversely if not resistive enough, e.g. a leaky device, the selector cannot act as an effective current limiter.One selector-free architecture proposal is to raise driver voltages to overcome the sneak currents while another alternative is using fully-selected and half-selected cells, the latter's purpose to limit the amount of voltage drop in the non-active (half selected) cells [68,71].A more modular selector-free approach [72] to the problem of sneak-path is by reducing the crossbar into smaller modules and summing up currents from each of these modular crossbars prior to entering the activation unit burning more energy.A selector-free crossbar solution opens up higher density solutions for 3D growth representation of the resistive crossbar array.There are other means to achieve this as research is promising in the area of two-terminal selector devices [58].The ideal selector requirements are listed in table 4 gathered from several works surveyed in [58].Illustrated on the left of the table is the ideal two-terminal selector device and how two of the common types of these selectors match-up.Research is ongoing and there have been multiple means of mitigating these effects such as the high off current in the NbO 2 which make it unattractive for crossbar usage [73].Similarly a means to improve thermal stability of ovonic threshold switching (OTS) devices is underway [74].conductive bridge RAM (CBRAM)-type devices show great promise as a selection devices as long as the operating currents are well below its compliance current to allow it to remain in a volatile state while still aligning with the intended synaptic emulator device's operating currents.
This section looked at crossbar considerations for ML learning with analog neural networks.For overall system modeling, crossbar dependencies can be incorporated into crossbar modeling by extracting all the non-idealities in the crossbar and adding them into the aforementioned software ML frameworks (e.g.[39]) to create a fast crossbar model for ML evaluation [75].The latter is a pseudo-emulation model with the conductance non-idealities pushed into the weight tables to model resistive crossbars.[67] proposes a flowmap for crossbar ReRAM based array configuration with input from driver finite resistance, application matrix technology node and ReRAM model to optimize the matrix mapping to the crossbar array.

Synaptic device candidates
Crossbar devices act as synaptic emulators for neuromorphic computing and allow for the co-location of computation and memory.These devices can be split into • Volatile-most of these are charge-based (storing information in the presence or absence of charge), such as FLASH, SRAM, DRAM and • Emerging non-volatile devices in these cases resistance based devices [1] which use a physical property that represents a conductance change (changes in device atomic arrangements or ferromagnetic layer orientation)-such as ReRAM, PCM, STT-RAM, FeFET.Conventional NVM memory requires large ratios between the R HRS and R LRS states to allow reliable explicit readout of a binary value.With deep learning however, this window needs to have multi-state capabilities and the readouts to become more of an accumulation of multiple device effects for matrix vector multiplication.The primary focus of this paper will be on the emerging non-volatile resistive based devices over the charge based, conventional CMOS based memories (FLASH, DRAM, SRAM) which require a larger number of transistors thus are not area-efficient.In the case of FLASH also requires much higher operational voltages resulting in higher latencies and lower endurance due to gate oxide breakdown caused by larger electrical fields.
Synaptic weight updates can be positive or negative and with a physical device there are various means to realize this negative update.One means of doing this is to have two different conductance elements for each crosspoint so that the equivalent synapse is differential [1,3].This is especially important for unipolar devices where for example (PCM) set process is gradual while resets are abrupt so synaptic weight updates focuses only on the set process for positive and negative updates.The unipolar device is paired with another device with matched linearity and a reset strategy [3] is used to track saturation of one device over the other so restoring differential resistance and preventing network freeze-out.With bipolar switching devices that allow gradual change in conductivity for both sets and resets, linearity requirements can be more relaxed with preference to symmetry in set and reset [1,52].To create negative and positive updates architectures can use a local reference element the same for all rows and columns that sets the 'zero' threshold.Another method is to set the average value setting as the zero weight setting [61].

Requirements of analog synaptic devices
The primary requirements for analog synaptic devices in the crossbar architecture are: • High on/off ratio, the window between the applications usable device high resistance (R HRS ) and low resistance state (R LRS ).This defines whether the device can be used as a multi-level-cell (MLC) defining how many realizable conductance states.• Weight update linearity and symmetry.
• Distinct R LRS (on-state resistance, low resistance state) and R HRS (off-state resistance, high resistance state).• An accommodating average resistance, a smaller average resistance relative to crossbar interconnect means parasitic crossbar interconnect resistances dominate in IR drops.• Reliable number of multi-bit states within the conductance window-R LRS and R HRS that will allow reliable gradual uniform conductance changes.With small windows a binary multibit solution, where multiple devices are placed in parallel is an alternative.• Fast switching speed (low write latency).
• Fast access time (low read latency).
• High endurance for repeated programmability.
• Reduced cycle-to-cycle and die-to-die variation.The requirements will vary based on primarily the deployment application (inference versus training), or edge versus cloud based applications.Further requirements breakdown into expected workload (image classification, NLP, object detection) which then affects the type of neural network architecture and depth.For example, a training solution will require multi-state devices, allowing linear gradual conductance updates, high endurance and faster programming speed than an inference solution because of the need to back-propagate involving several epochs of writes.A forward inference solution can have these at a lower priority favoring faster access time for reads, and higher retention devices where read-disturbance is limited.An inference solution would favor devices with one time programming/mapping to sustain non-disturb MVM reads for a prolonged period of time.The spider chart of figure 14 summarizes this and illustrates, for two popular NVM candidates [63], PCM, ReRAM, how well they meet these requirements.An interesting part of the retention discussion is volatility and memory capacity effect where by memristors in [76] are not effectively trained due to the current limiter device, also the larger the parallel network the weaker the memory so there is a need to adjust the current supply limits to accommodate.These parallel devices (while allowing discrete multi-states and great for absorbing variation effects) means that the intrinsic conductance decay of the devices is more concentrated as there is now a competing natural decay thus penalizing retention even further.Volatility effects are studied as the RC decay time constant relative to the time needed for one epoch (forward propagation, reverse propagation and weight update) the higher this number is the greater the classification accuracies [6], this was studied on a 5000 examples of an MNIST data set using PCMO, ReRAM implementation.A low value means more volatile so need larger learning rates (retraining many more of the weights).
As discussed prior in section 3 conductance can be modulated based on the history of signals applied to the device and [77] looks into the variance of the synapse to the same pulse (width/amplitude) presented at different time stamps consecutively.
CMOS compatibility [52,78] is important to reduce the number of fabrication steps and ensure memristor operational voltages are aligned to other circuit expectations.The prudency of ReRAM technology scaling results in higher programming voltages compromising other circuitry and the approach in [79] with a monolithic 3-D IC stack allows integration of two technology nodes at BEOL where CMOS peripherals are kept at more advanced nodes (16 nm).Each memristor-array 'tile' (40 nm) interfaces with the next through interior vias after processing through peripheral circuitry and logic (sense amplifiers activation pooling, buffering) in the 16 nm technology.The impact of the worse case latency scenario (single device activated in a column) through the interior via resistance is additionally investigated on the ADC sensing capability and illustrates minimal impact [79].

Device type and structure
Recent research has shown interest in ReRAM, PCM, STT-MRAM, FeRAM where multilevel programmability can be applied using electrical pulses.Also other devices like battery like, capacitor based, photonics are of interest to researchers.This section and paper will primarily focus on emerging NVM devices in 2D crosspoint arrays which have shown potential for neuromorphic matrix vector computations.
SRAM: in memory computing in CMOS technology using SRAM is possible but with limited density [28,80], (with higher density NVM technologies are able to store multiple states in a 4F 2 footprint).The SRAM cell is built from back to back FETs and two selectors (6T STRAM) with no dedicated storage element and charge needs to be constantly refreshed so needs to always be connected to a power supply [1].
DRAM: a capacitor acting as the storage node is placed in series with an FET and needs periodic refresh.The challenge for DRAM are the destructive reads and nondestructive attempts to overcome this cause degradation in density [81].
FLASH: in FLASH devices, the storage node is coupled to an FET gate and allows for longer term data retention but operating voltages are extremely high with large latencies.FLASH has lower endurance due to oxide breakdown from the large electric fields [82].
ReRAM: ReRAM devices are the most mature device candidates and are already being fabricated commercially [3,83,84].They have strong compatibility with CMOS fabrication as they have BEOL compatible temperatures only needing one extra lithography step thus reducing costs.They also have long development history for learning applications [52,85].
ReRAM can be split into filamentary and non-filamentary ReRAM devices [86].Filamentary devices can be further sub-categorized into cation-based or anion-based, according to the means in which the conductive film is created [87].In cation based devices (CBRAM) when a positive voltage is applied to the top electrode metal (usually Ag or Cu), metal ion oxidation occurs where the anions are attracted to and collected onto the opposite relatively inert electrode.The buildup of these anions with continued applied voltage will eventually form a conductive path between the electrodes.With anion based devices (HfOx, TaOx, TiOx) however, the conductive filament (CF) is gradually formed through the metal oxide electrolyte insulator from the migration of oxygen vacancies through the electrolyte shown in figure 15.In the case of non-filamentary, the electrodes metal atoms form the conductive connection through oxygen vacancies [86].Filamentary ReRAM (CBRAM)-exhibits low programming energy, fast switching, and high endurance but high resistance window (100×) and intrinsic variability [3].This is compared to non-filamentary RAMs smaller resistance window of up to 50×.
Electrical pulses induce the set processes, associated with CF formation, and reset process, associated with the dissolution of the CF.If both processes are in the same voltage polarity then it is a unipolar process and if at different voltage polarities then it is bipolar [88].To control the multilevel states gradual dielectric breakdown is achieved by controlling the number of CFs/controlling the amount of oxidation [52].Bipolar filamentary RAM sets are usually abrupt versus gradual resets thus calling for a 2-ReRAM synapse differential readout approach like the PrCaMnO devices in [6].Another option for a device that does not show gradual conductance change is as in [89] where multiple 1 bit/binary BNOx memristors are integrated in parallel to create a compound synapse thus representing a multi-bit solution.With all devices inclusive of those that do show uniform gradual conductance change, 'single shot programming' is not possible to precisely set the conductance level [70,90] but a series of pulses.[52] has shown promising gradual bidirectional programming abilities that allow for incremental resistance changes with voltage pulsing.
The number and size of CFs can vary creating variations from device to device and cycle to cycle and [88] mitigates this by the use of buffer layers to confine CF paths.Changing the compliance current can also be used to alter the diameter of the CF.To ensure CF formation, electroforming or 'priming the oxide' for OxRAM [52,70] is used where a large electric field (>10 mV cm −1 ) is applied and causes soft dielectric breakdown creating defects in the oxide allowing CFs to form during sets.Reliance on electroforming to form the conductive paths allows for lower driver voltages during set operations which also avoids gate oxide breakdown of driver gates, (the larger deep gate oxide gate alternatives being slower).While forming enables the device to be controlled by smaller driver voltages to achieve the same resistance, it compromises the memory window (R HRS /R LRS ) as while R LRS are reduced, the R HRS are also reduced as compared to the relative resistances of the initial fresh samples.A forming algorithm solution is presented in [91] which allows certain devices already preformed by the anneal process to be skipped thus avoiding further device-to-device memory window variation.
The exponential dependence of current on applied voltage can be expressed [92] as , where d is the gap size between the CF filament tip and the electrode, I 0 , d 0 and V 0 are fitting parameters.The linear range of the IV characteristic curve for the responsive devices of a 128 × 64 ReRAM array down to a precision of 6 bit (64 levels) is illustrated in [93] an additional data point to explore would be the temperature dependency of this curve [94].A similar plot of the effect of linear range using the differential conductance provided in [90] that uses the 2-ReRAM synapse approach for reducing cycle-to-cycle and device-to-device variations.
The transition from short term memory (STM) to long term memory (LTM) is discussed in [95].With repeated stimulation, the CFs become stronger as there is a higher concentration of oxygen vacancies in the switching layer and more resistant to lateral diffusion to break conductive path and thus resulting in higher retention.Diffusion of ions in conducting channels causes decay in retention LTM versus STM.Also to consider is stuck-ats/device unresponsiveness [88,96] where stuck at R LRS and unable to reset to R HRS occur due to too many defects in the switching layer.Several architectural means of avoiding and building redundancy into the network can help and are discussed in section 7.
PCM: the second of the leading choices for analog based accelerators is the PCM.This two terminal chalcogenide, out of the listed NVM storage class emerging memory candidates, has the highest on/off ratio second only to 3D NAND FLASH [98].Its amorphous phase exhibits high electrical resistivity while its crystalline phase shows low resistivity about several tens of orders of magnitude lower [55].This opens up the space for multi-level cell operations.The amorphous phase is an abrupt melt-quench process that is initiated by a large amplitude short voltage pulse while the crystalline phase is when material is heated using lower amplitude longer pulses.Due to this, to realize the different multi-level states (in both directions) gradually changing amorphous thickness with progressive crystallization [63] through controlled heating (electrical pulses) of the chalcogenide material is required.The opposite direction, incremental reset of PCM is not possible because of the abrupt nature of amorphization, 'reset process', so similar to the filament based ReRAM (with its abrupt set [6]) using a pair of devices as an equivalent weight to represent positive and negative weights and mitigating asymmetry in set/reset [3] is required.In the crystalline state, the PCMs show ohmic dependence at lower voltages and non-ohmic when voltages are higher.The large currents needed to write a PCM cell [68] limits the number of parallel writes [89] as the crossbar needs to stay within electromigration limits.
Another challenge of PCM devices is resistance drift that is caused by spontaneous structural relaxation after the melt-quench process, conductances initially decrease rapidly then more slowly [6].This is studied in [99] where the change in relaxation is investigated over time and temperature (considering also array level impacts where arrayed devices exhibit different drift components).G ∝ t −v where t is time, G is conductance and v is the drift coefficient [6].With strong resets where cells are fully-amorphous, drift components are larger thus affecting data retention and hence network classification accuracies [63].
STT-RAM: an NVM two terminal device based on magnetic materials that has been widely studied for neuromorphic applications, due to its promise of high density and low leakage, is the STT-RAM [43,63] that uses electron spin to store resistive state.An metal tunnel junction (MTJ) is created by a spacer between two ferromagnetic (FM layers); one layer called the free layer and the other a reference layer.The relative orientation of these layers is controlled by passing a current to each FM layer to either have a parallel or anti-parallel direction to create the resistive states.For the R LRS a current is applied from the reference to the free layer so that the magnetic orientation of the free and reference layer are the same, this is referred to as 'parallel'.The opposite would be used to realize a logic 1, or R HRS .This is how a single bit cell or single level cell works.To extend to a multilevel cell this would require stacking of differently sized MTJs [100], one challenge is the low tunneling magnetoresistance [63], as well as reliability problems with process and thermal fluctuations in the MTJ.Write reliabilities can be improved by using higher currents and results in faster switching times but could adversely affect reliabilities for MLCs as several MTJs are in consideration.Researchers have looked at techniques such as early write termination, hybrid SRAM/STT-RAM architecture and read-preemptive writebuffer designs [68] to mitigate the long write latency of STT-RAM.Reading has its challenges as with the smaller R HRS /R LRS ratios the distinction between the states becomes challenging and coupled with thermal fluctuations, worsens read disturbance effects.
FeFET: while the aforementioned NVM cells have two terminals, the FeFET is a three-terminal transistor device acting as its own selector thus allowing for more compact memory arrays [101].It is an MOSFET with a ferroelectric gate dielectric (commonly HFO 2 based).The cell has two distinct stable polarization states and can be switched using an external electric field ('coercive field') the strength of which determine the extent of polarization as each crystal domain within the structure is polarized.The remnant polarization after the electric field is removed allows data storage through these two polarization states.The two states are referred to as a low threshold (low V t ) and high threshold (high V t ) states and the memory window is defined as the difference between these two stable states.With the aging device the memory window closes due to charge injection from the substrate due to wearing of the thin film interfacial layer between the ferroelectric dielectric and the silicon substrate shown in figure 16 [97].Various means to reduce this are discussed including changes in process flows [102], use of a series resistor 1FeFET1R (1F1R) to reduce V t variation [103].Research is on going in improving device to device variation, endurance and increasing the memory window for multi-level performance [104].

Device reliability
In the section 4 the effects of crossbar non-idealities were discussed, from wire interconnect, source and sink resistances that create linear and non-linear idealities when it comes to read accesses and thus output current inaccuracies.Discussed in this section are the contributions from device behavior which for example in the case of PCM whose IV characteristics at lower voltages displays ohmic behavior to exponential-behavior at higher voltages.In contrast, ReRAM current is exponentially dependent on voltage.In addition, access devices also play a role in this as they are in series further contributing to inaccurate read process-if not efficient current delimiters thus inaccuracies in read output current and/or affecting available resistance window for MLC use.Careful consideration is thus required for voltage ranges for read and write pulses.Pairing devices to overcoming asymmetries in set/resets was discussed [63] and can be extended for PCM and ReRAM.In an offline trained NN, device non-idealities can be overcome as conductance values can be programmed reliably by doing a write-verify-write offering opportunities to correct [105].In the case of online trained (in situ training) however, it is paramount that the conductance updates are symmetric and linear.
Larger crossbars mean higher impact of interconnect resistance overcoming the presented effective device resistance thus impacting accuracies, but limiting the crossbar size to a small size means fewer errors but more power as more crossbars are needed to represent each layer.Lowering the average device resistance (R LRS + R HRS )/2 has the similar effect to increasing crossbar size as this means greater impact of interconnect parasitic so higher average values are preferred to reduce parasitic impacts.A small R HRS /R LRS ratio also means few bits per device and lower area efficiency.
Reliability effects on ReRAM technology are studied in [70] where the CF growth of PMC devices show high tolerance to ionizing radiation exposure.CF rupture shows less tolerance (and thus R HRS ) is slightly higher than non-exposed devices.This means exposed parts have a higher R HRS /R LRS ratio.Exposure however, has little impact on retention for both resistance states up to a total ionization dose of 2.6 Mrad opening up usage in more environments.Time dependent variation [3] is more pronounced at R HRS states so during backpropagation, accuracy can be affected.Higher endurance is needed for small conductance changes instead of large changes in digital memory applications.With the multiple conductance update steps in backpropagation (and asymmetry between increasing and decreasing conductance) meeting convergence becomes challenging.Several tools [75,106,107] are used to model NVM based networks and evaluate system performance, they provide some direction on circuit area, leakage power information, latency and energy consumption.
Random initialization to break symmetry [90] aids in convergence and is easily provided due to intrinsic device-to-device variation.Another suggestion is to additionally assign the memristors somewhere in the middle of the conductance range [108] and within the useful section of the squashing function.Further, [109] discusses how to locate and initialize memristor synapses as this initial value is argued to be affect memristor variation.Several models are proposed to describe the memristor behavior and initial state, a mapping simulator software to map DNN to resistive crossbar to aide in the analyses [110].A detailed look at weight initialization and distribution method [3] from centroid initialization, to random, density based and linear is discussed in [111] and can provide some application for NVM-based weight initialization and weight quantization binning [112].Similar applications are used in mapping from offline-trained weights onto the ReRAM crossbar [3,113].
Several means of extending the life of the device and improving retention and endurance are discussed in various research such as 'periodic carry' [63] where a set of parallel devices represent a single synapse, each having a different weight toward the total synaptic conductance.So when LSB saturates to its max or min, the next least LSB is updated to account for the information from the previous LSB, (while the original LSB is then 'reset' away from its saturated value).This technique avoids 'overuse' of a single device thus extending usage.A take on the 'periodic carry' concept is also presented in [114] where a different device is introduced for the lower significant device pairs.The only demands for this device being high linearity for conductance updates, and high endurance thus protecting the larger more significant NVM-based device weights from 'overuse' degradation and relaxing their linearity requirements.A similar option to prevent overuse and hence extend device lifetime and average variability in devices is to put multiple conductance of equal weight contribution and update is done by programming one at a time to reach the conductance step needed.An arbitrator timer will make sure that they all get a similar number of requests to avoid saturation of one device or have endurance failure of one NVM.So, each device is only programmed once per several updates.The latter can also be extended to form a single equivalent synapse conductance made-up of several NVMs in parallel to distribute variation effects [54].A hybrid structure of different combination of memory devices to extend the conductance range and improve linearity of weight updates could also be considered, for example suggested in [63] is a PCM for MSB, and transistors to cap for LSB thus relieving training on PCM due to its high write latencies and resistance drifting.After training the final conductance value can be scaled and stored on the PCM device.This hybrid nature can be used on other NVMs to take advantage of positive contributions a device has for DNN processing (see spider plot in figure 14).

Peripheral circuitry support
This section will discuss supporting circuitry and design considerations for the cross-bar array based neural network implementation.Support circuitry discussed in this section include DACs, ADCs, drivers and sensors.Additional structures such as multiplexers, switches, various circuit sigmoid implementations, sign control and weight update circuits are featured in [72] with accommodation for binary neural networks (BNN) which avoids the prolific use of power hungry DACs and ADCs.Major considerations for robust design center around area efficiency, low power, and precision/resolution.

Data converter circuitry
DACs: DACs are needed to drive analog voltages to rows or columns to allow forward and backpropagation in the crossbar based neural network.In both cases the DAC driven drivers should have the ability and range to drive read and write-update voltages to the synaptic memory cells (figure 17).Encoding architectures using time encoding tend to be low-speed as several cycles are required to generate the various pulses to effect reads or write updates in the synapse.
Investigated in [75] is the error from DAC non-idealities and how a DAC output voltage can change with average equivalent synaptic load resistance (R load should also include effect of wire resistances) and is also a function of the applied input.Also illustrated in [75] and in figure 18 is that the sensitivities to crossbar dimensions is greater when including DAC and driver non-idealities thus resulting in the largest contribution (especially for crossbars less an 512 × 512) toward classification errors than sensor or wire non-idealities.This is due to the nonlinearities of the coupled with lower effective drive voltages in larger crossbars.
ADCs: ADCs tend to be area and energy intensive consuming up to 80% of total crossbar energy and about 60% of total cross bar area, the former increasing with the amount of precision required [63,115].One means to increase power and area efficiencies is the use time multiplexing to share the ADC across multiple columns [46] but this results in reduced throughput due to reduction in parallelism.Another is reduced precision, some studies have shown higher tolerances to accuracy degradation when ADC precision is reduced [116].With in situ training however, this may not be a reliable knob as these solutions must prioritize the precision (and range) of neuron computations and subsequent activation [5] from each neuron to be supported by the hardware while offering a fast ADC response.Turning precision into a hyper-parameter knob for each neural network layer may regain some energy savings while preserving classification accuracies.BNN which allow for faster inference times (where MACs become bit-wise operations) and faster updates are only two levels [R LRS (on) and R HRS (off)] circumvent the need for ADCs but lead to accuracy degradation over time [72].
It is clear that fast ADC responses are needed to propagate through the various layers and time multiplexing of ADC architectures across various rows and columns compound this need.A simple high speed option is offered by FLASH ADCs but with the large loading to the input voltage and the amount of area and energy required for all the comparators (an 'n' bit ADC requires 2 n − 1 comparators) this means limited resolution trade-off with area.Most ML solutions suggest up to 8 bits resolution, (or 255 comparators!), which is not an area or power efficient solution.Also, with the increasing resolution of these architectures the difference between adjacent voltages become smaller than the individual inherent comparator offsets.Pipelining ADCs provide a solution but are limited to the number of states to reduced buildup of error from mismatch of the internal DAC stages and residual amplifiers.A common compromise approach in analog DNN studies is the use of SAR ADCs [44,46], which presents lower capacitance loading to its input stage, higher resolution and lower power, but at the cost of lower conversion speeds (limited by the internal comparator and DAC speed divided by the required bit resolution).A new scheme providing superior power delay product than SAR and FLASH is investigated in [117] using an analog shift-add ADC scheme to do the weighted sum for up to six bit precision and comparable area to the SAR ADC.
In addition to conversion speeds, designs need to factor ADC settling times and time to latch outputs for further post processing (digitally handled RELUs, activation storage and pooling).One means of circumventing this is to use interleaved ADCs where multiple ADCs are interleaved in parallel with clock staggering and then outputs time re-aligned.This however means more loading to the input signal, additional clocking circuitry, and additional care to the non-ideal interleaving effects such as clock distortions, phase errors, mismatched ADC core offsets and gain errors-each of which will have their own correction techniques incurring further area and energy.
Research in [46] discusses the read ADC pipeline and reduction of overhead by sharing ADCs in an IMA (in situ multiply accumulate) cell that multiple crossbars share and creating a 1.28 GSps ADC unit to sample the 128 bit line current from its 128 × 128 crossbar.Also proposed in [46] is a method of copying common multiplication algorithms by splitting e.g. 1 bit computation into 16 cycles in order to keep high precision but limit DAC and ADC size to n bits (n < 16) by instead of having a voltage level being represented by the 16 bit value instead uses a stream of levels.These are then accumulated in an output register after the ADC.This means 16 cycles are required to complete the 16 bit input.Reduction in cycles can be cut in half by splitting the computation into different crossbars: one crossbar for 8 bit MSB and one for 8 bit LSB.
Conversely, an ADC free scheme of sensing a PCM cell resistance with up to 8 bits precision by dynamically changing the reference levels to achieve reduced access latencies to 5 μs is proposed in [118].

Drivers
DAC circuitry are usually modified for multiple functions as seen later in the architecture section 7. Prior to the driver and DAC, the type of encoding required for the particular architecture has to be decided depending on the device characteristics and intended operating range on the device IV curve.Whether the stimuli will be amplitude or pulse width modulated and at what amplitude levels and pulse duration, this also includes considerations for load versus driver resistance characteristics discussed earlier in section 4 and power constraints.Several categories of driver circuits can be considered, voltage mode versus current mode type drivers-these can borrow ideas [119][120][121][122] that allow for amplitude swing adjustment and slew rate control to efficiently pulse NVM devices (the latter feature an added plus to control crosstalk).An impedance control within the driver can also allow for dynamic crossbar dimension configuration [65,119,120].Methods of reducing effective cross bar wire resistance to the driver were discussed in section 4 by increasing driver voltages or use of multiple driver configurations [66,68] to reduce latencies.

Sensing circuits
Accumulated currents at the end of each crossbar column (or row in the case of backpropagation) will need to be sensed prior to digital conversion and subsequent storage or further processing (e.g.digital activation or pooling).The resulting accumulated currents are converted to analog voltages (voltage sensing) or currents sensed (current sensing) using various means [122][123][124][125].In the former case the drop in bit line voltage is sensed versus a reference voltage after a pre-charge and development phase.In the latter, the cell current is compared versus a reference current generated by reference/dummy array (sometimes with dynamic clamping of the bitline (BL) for a faster precharge phase) as in the figures 19 and 20.A simple circuit diagram of the voltage and current mode comparators are shown in figure 21.The choice of sensing mode is dependent on the size of the array-specifically the amount of loading on the BL as more cells per BL (and higher R LRS ) means a longer access time as shown the figure 22.With the large BL loading (larger R LRS ) and long BL lengths it is best to choose current sensors for faster accesses.Challenges to current mode circuitry are variations in reference current which can cause read failures when overlaps with read currents occur.Fluctuation in the BL clamp voltages means fluctuation in the voltage drops across the memristors.The voltage mode also has its challenges: as memristor variations result in a wide range of BL voltages so there is a need to select the reference to accommodate and/or schemes to track accordingly.Lower supply schemes have their effects in both cases, with current sensing the headroom of clamping device is compromised and in voltage mode sense circuits the lower voltage drives cause longer access times.Data pattern also affects access times as the different RC delays are presented to discharge change depending on how many memristor cells are selected-in [124] a simplified RC model of the bit line illustrates how the bit line discharge time increases with the percentage of R LRS and discusses various techniques to mitigate these effects.
Speed is key in sensing circuits [126,127] as this allows for propagation to the next stage and/or muxing to share sensor circuitry with other crossbar columns.Proposed in [123] is a proposed low latency current sensing technique and the effects of crosstalk and supply noise.
In a TIA circuit where current is integrated on to a capacitor and then sensed, integration time of the capacitor is based on acceptable noise tolerance of the integrator circuity and on/off ratio of the synaptic emulator [2,128].Noise tolerance can be increased by increasing this integration time albeit adding to more latency Where R device = average device resistance, N = the number of contributing devices, β = the ratio of R LRS /R HRS , V out = voltage at output of the op-amp.With decreasing integration time more throughput (operations/second) can be achieved.

Activation function and derivative circuitry
Circuit approximations of neural functionalities to drive reduction in area and complexity are studied in [5,6,51].Discussed in the implementation is the replacement of the sigmoid activation unit implementation such as a tanh/ReLU since they require high precision A-to-D and D-to-A circuitry with approximation circuitry as a PWL (comparator and ramp voltage).Similarly for backpropagation of correction errors the MAC sum will need to be scaled by the derivative of the activation unit.The derivative of the PWL, which is a step function is used by specifying two distinct states of the step function.Illustrated in [6,51] shows that on a 60k MNIST data set the training test accuracies of tanh() and PWL activation functions are comparable and can be further improved when changing the derivative of the low value of the approximation function to a non-zero as discussed in section 3.

Other support circuitry
It was mentioned earlier that write voltages for emerging NVMs (PCM and RRAM) are much higher than the logic supply voltages (and with advanced nodes are more significant challenge [79]) so there is a need to make provisions for high voltage supplies, level shift circuitry and charge pumps [125].

Architecture and system considerations
A high-level view of architecture processing unit from Google's TPU ASIC [30,33,36] is shown in figure 23.
The custom ASIC fetches weights from nearby DRAM and inputs through the high bandwidth memory (HBM) interface through the matrix units (MXU) for multiplication and subsequent accumulation before activation, normalization, and pooling before being written back into the HBM for use in the next layer.There can be multiple instances of MXU in each core and multiple cores within the ASIC.Other current commercial accelerators follow a similar architecture [42].Analog based accelerators will have to have different architectural approaches due to the NVM crossbar based neural networks and so must provision for: • Avoiding weight saturation [3,5,61,129], • Enhancing weight endurance and retention [56] • Synapse weight inline calibration [47] • Dual polarity based weight programming [5,130,131] • High precision techniques (especially for training applications [61]) while keeping ADC overhead at a minimum [29,46,121,122] • Pipelining to minimize hazard conditions and reduce buffering from one layer to the next • Synapse suppression to allow for structured sparsity • A solution to not only avoid but accommodate 'stuck at' faults [132] • Network flexibility to dynamically reconfigure network shape and size [5] • Network pruning for sparser representation of the crossbar (even for FC layers) [4,113,133].The aforementioned requirements have to be all coordinated by a robust instruction set architecture.A chronology of analog based accelerators is discussed in figure 24 and their features are described in the following text.Some are architectural concepts [44,61,135] based on a particular technology node while others are full [27,28] or partial implementations in silicon.One of the earliest takes of an analog based accelerator the ETANN [27] uses EEPROM driven floating gate devices to store and adjust its 10 240 synaptic weights, and Gilbert multipliers are used as multipliers and routed, after current summation, to a sigmoidal function emulator.This is a static architecture dependent on calculation of weight changes and voltages externally to be applied to modify the weights.
ANNA [130,134] (0.9 μm CMOS process) uses capacitive charge refreshed by external RAM to store synaptic weights for optical character recognition application.Additionally, as in ANNA, many of the architectures that follow [29,131], the data converters provide combined functions-the DAC serving also as a multiplier to multiply a charge driven weight bias with the digital inputs.The large voltage range on the capacitance is to minimize errors due to charge leakage while a refresh circuitry is provided to compensate for this leakage.A means of handling the positive and negative weight contributions is provided within each multiplier cell.The SAR ADC not only combines a current comparator to compare the sampled signed summed current to a reference but also provides a squashing function characteristic to form the neuron body circuit.The overall ANNA [130,134] architecture provides several orders of magnitude speed advantage over conventional hardware in use at the time.
Fast forwarding over 13 years later with growing interest in in-memory computing to overcome the von Neumann bottle-neck TrueNorth [28] (arguably analog based) was built on a 28 nm process and provides 256 million synapses and 1 million neurons with a neurosynaptic SNN core network.It provides more flexibility and scalability than its predecessors due to its tiled crossbars and provides time multiplexing between its core, a feature that continues with subsequent accelerators [29,[44][45][46][47]61].It is an inference only application where weights are trained offline and transferred onto an SRAM array for forward propagation.
Near computing approaches in a 28 nm concept model DaDianNao [135] are discussed using synaptic weights from adjacent EDRAM banks to the computational units (each tile containing 4 EDRAM banks), but, in this architecture strategy, neuron transfer versus synapse weight transfers from memory are preferred since as there are fewer neurons than synapses hence executing fewer external memory fetches.There are several tiles, within each tile are 4 EDRAM banks which has all synapses.Unlike SRAM from TrueNorth [28] EDRAM requires periodic refresh and has higher latency than SRAM.Methods of interleaving are therefore used between the 4EDRAM banks to overcome the destructive read nature of the EDRAMs [135].
In section 5 emerging devices using electron spin to store information are discussed as synaptic emulator candidates, but in SPINDLE [29] a crossbar spin neuron is proposed, where a memristive synaptic crossbar is fed to a spintronic comparator incorporated within an enhanced SAR-ADC.In figure 25 the neuron output is generated by sensing the resistance of the MTJ (which represents the comparison of a bias versus the summed input current from the memristor crossbar).Like in [130] this enhanced SAR ADC additionally performs an approximation of the activation function, in this case a hyperbolic tangent (tanh).SPINDLE [29] provides a hierarchical three-tiered architecture composing of spin neuromorphic arrays (SNAs), spin neuromorphic cores (SNCs) and SNC clusters.Within each array (SNA), in figure 25, is a memristive array and spin-neurons, and peripheral circuitry for driving and conv-pooling operations.The SNAs are arranged within cores (SNCs) which also contain local scratchpad memory (to store input features that SNA needs and output features that are generated) and dispatch block to transfer the input features (thus if they share input feature they are inside a core).Cores are then grouped into clusters where each cluster has global interconnect to a shared memory and global control unit.There is a two level memory hierarchy: on-chip distributed scratchpad memories local to SNCs, and off-chip shared memory.
The routing fabric needed for data movement across the memristor crossbar arrays and its communication and synchronization with the CPU using a centralized mesh of the crossbar arrays is conceptualized in RENO [47].Each group of four (64 × 64) arrays are connected to a group router which is in turn connected to a central router.A routing management solution is proposed for MLP or AAM (auto-associated memory) architectures and how the looping fields are created for the latter in order to determine the destination router address.A switched op-amp based sample and hold circuitry is discussed to buffer the analog signal across the network through a multiplexer.In-line calibration is also provided to monitor resistance shift of the memristor arrays and a means to restore them.
Buffering analog signals between the various neural network layers is reduced in ISAAC [46] using pipelining at the expense of more power consumption as all layers are simultaneously active.Using a VGG1 [16] implementation architecture a 16× throughput improvement over a non-pipelined ISAAC was obtained.This computational efficiency also illustrates superior throughput (479-1707GOPS/s × (mm 2 ) over DaDianNao [135] (63-344GOPS/s × (mm 2 ).To reduce ADC size, an encoding scheme [46] is introduced where if the sum of products gives an MSB of '1' due to large synaptic weights then the sum of products is flipped so that the MSB is zero thus reducing ADC resolution (a means to flag if a column is in its original or flipped form is also stored).This type of encoding improves ADC efficiency and cell density.
An architecture for memory cells to be switched on demand for neural network computations or storage boosts performance and energy efficiencies in PRIME [45,131], a dynamically morphable processing-inmemory architecture.Also provided is a software/hardware interface that allows for APIs enabling developer mapping of the neural networks to the ReRAM subarrays, program weights and configure data paths (figure 26).The compile phase optimizes both code and mapping of the neural networks to the sub-arrays to realize small to large scale neural networks specifications (that might require interconnection between banks).Since the PRIME architecture supports both MLP and CNN, a means of pooling layer is discussed with a favoring to the ease of mean-pooling (over max pooling) offered by simply reprogramming a ReRAM subarray (1/n) to achieve the desired n:1 mean pooling ratio.
Support for LSTMs is provided in training accelerator PUMA [44] and like [61] offers its own special instruction set architecture and compiler.The bit slicing technique from inference only ISAAC [46] is enhanced in PRIME [61] to support its training implementation where the needed high precision of ReRAM outer product accumulation is accommodated while reducing overhead on ADCs and DACs.The operation is split into time slices covered over several crossbars of the same layer to achieve a 32 bit matrix value.With this implementation introduces the concept of heterogeneous weight slicing where allowing higher precision to the more frequently updated slices is accommodated to reduce device saturation likelihood.
A means of handling the carries from the operations and the frequency of propagating such is also implemented in [61] (carry resolution step).Variants of SGD are also discussed [61] which depending on batch size require replicated copies of a crossbar to prevent structural hazards and avoid additional usage of shared memory.
Similarly described in [5] each DNN model copy is processing and training the same DNN model in parallel but each is observing different training portions of the training database.The weights will tend to diverge for each copy of the DNN model as each is reacting to their own particular/unique training set sequence.Coordination of the various processing nodes, updating a master copy based on the feedback, is needed with an overseer engine to provide DNN training speedups.
Fast programming strategies such as the introduction of coarser control steps by using longer pulses for fast resistance change and adjusting to shorter pulses for finer control [52] can be used.Other means are further investigated in [56] for reducing write latency by comparing the current state of the synaptic cell with the target state to determine if is faster to reach the target by either resetting to start programming from the ReRAM R HRS or issuing a set to start programming from the ReRAM R LRS state hence needing fewer programming iterations to meet the target resistance shown in figure 27.This method is at the expense of retention programs and a more reliable (albeit slower) means of controlling the strengths of the CFs by favoring a rupture of CFs to hit the target value is also proposed [56] shown figure 28.Both these methods are advantageous depending on the application; FPS for in situ training where retention requirements are less stringent need reduced latency writes and the more reliable, slower, form of programming for inference application where device retention requirements are needed in figure 14.
Architectures will also require continuous monitoring of synaptic weights so that they can be reset when near the danger zone of saturation to prevent network freeze.Such a method is proposed [129] where training  is paused, conductance measured, to indicate which conductance of the synaptic pair requires reset and a reset is issued followed by an incremental partial-set to restore/preserve the original conductance difference (albeit far away from the saturation zone).An additional verify step may be needed in highly variable devices adding more write latency.Synapse suppression [51] is another means to 'remove' devices that show dither with frequent updates.Of note is the network configuration usage of devices may cause more 'wear' over another configuration, for example in convolutional nets, set and reset cycles on devices is three orders of magnitude larger than FC nets hence faster device degradation [3].In the case of [61] the bit-slicing techniques causes more updates in central slices versus edge slices.
This section by no means encompasses all the architectural considerations for NVM based accelerators but provides some insight as to the complexities and multidisciplinary approaches required to make these analog based accelerator architectures viable.

Summary
Analog-based accelerators can only be adopted if they provide significant advantages over current processing techniques bench-marked across similar data sets and models.At this stage, while providing some compelling evidence to reduced memory fetches through processing directly in the memory crossbar and increased throughput and parallelism by condensing the number of operations-there are still more questions to be answered regarding handling the non-idealities and variations in the circuitry and devices.In this paper we reviewed various synaptic emulator NVM candidates for in memory computing and the device development required to meet the proposed solution-specific ideal requirements of a synaptic emulator.In consideration is a hybrid of these qualities in order to approach the ideal emulator specifications either within the same layer or across different layers and is an open area for research.The paper also reviews the crossbar sizing limitations, effective line resistance and sneak paths and mitigation of these effects to allow for high density growth.Constant regeneration of the analog signal is needed to propagate the signal through the network through the use of supporting data converter circuitry that in themselves provide bottlenecks requiring precision control compromises.The various architectures so far seem to form a consensus around SAR ADCs, and use of pipelining though several hierarchies of crossbars and sprinkled co-located memories to allow for re-configurability and communication between crossbar nodes and off-chip through global bus networks.The architectural perspective of current concept analog based accelerators propose techniques to overcome some of the challenges for device variation, retention, endurance, and circuit non-idealities while still maintaining comparative quality classification accuracies (to their digital counterparts) as well as learnings for further research.They are not at the stage yet to accommodate the larger elaborate models and data sets in use today to benchmark against current commercial accelerators.
(a) Forward inference of pre-trained DNN, (b) To accelerate the DNN training.

Figure 4 .
Figure 4. Basic diagram and network inside a hardware accelerator, showing the internals of the PE doing an MAC operation on the left.Then its instantiation within the matrix multiply unit internally can be presented as a convolutional or FC or both.

Figure 5 .
Figure 5. Matrix vector macro unit and its mapping to a matrix.

Figure 6 .
Figure 6.The condensation of two floating point operations-multiply and accumulate into a single parallel operation in the crossbar.

Figure 8 .
Figure 8.A diagrammatic concept view of a 2D crossbar concept during backpropagation error calculation.

Figure 9 .
Figure 9.A diagrammatic view of 2D crossbar illustrating a pipeline of backpropagation error calculation and weight update.

Figure 11 .
Figure11.(a) Plots of write voltage to write threshold (V w /V t = 1.9) versus crossbar size for three memristor to driver ratios (R m /R T ).(b) Given V w /V t = 1.9 how many devices can be written in parallel.© 2016 IEEE.Reprinted, with permission from author, from[65].

Figure 12 .
Figure12.Effective voltage at the selected cell (memristor device + selector if any), which is degraded from the disturbance voltage at the driver due to sneak paths and metal line resistance (source and sink resistances on the driver and current sense circuitry also play their part in IR drop but can be handled in the peripheral circuit design).

Figure 16 .
Figure 16.FeFET ferroelectric polarization and eventual charge trapping.With increased trap density over the operating life the memory window (MW decreases).© 2016 IEEE.Reprinted, with permission from author, from [97].

Figure 17 .
Figure 17.(a) Equivalent circuit of crossbar with data converters and contributors to errors.(b) Synapses programmed to G MIN sensitivity to crossbar dimensions.(c) Synapses programmed to G MAX sensitivity to crossbar dimensions.© 2020 IEEE.Reprinted, with permission from author, from [75].

Figure 18 .
Figure 18.Non-ideal DAC output due to R load (effective load resistance of the crossbar array).© 2020 IEEE.Reprinted, with permission from author, from [75].

Figure 19 .
Figure 19.(a) Read path and (b) conceptual rendering of waveforms produced by nominal voltage-mode sensing scheme.© 2015 IEEE.Reprinted, with permission from author, from [124].

Figure 20 .
Figure 20.(a) Read path and (b) conceptual rendering of waveforms produced by nominal current-mode sensing scheme.© 2015 IEEE.Reprinted, with permission from author, from [124].

Figure 22 .
Figure 22.Access time versus BL length for common voltage-mode sense amplifiers and current-mode sense amplifiers.© 2020 IEEE.Reprinted, with permission from author, from [124].

Table 1 .
Sample popular data sets.

Table 2 .
A timeline history of neural networks.

Table 4 .
[58]l two-terminal selectors for emerging memories versus promising selector devices in development, information gathered from survey in[58].