Recent progress in analog memory-based accelerators for deep learning

We survey recent progress in the use of analog memory devices to build neuromorphic hardware accelerators for deep learning applications. After an overview of deep learning and the application opportunities for deep neural network (DNN) hardware accelerators, we briefly discuss the research area of customized digital accelerators for deep learning. We discuss how the strengths and weaknesses of analog memory-based accelerators match well to the weaknesses and strengths of digital accelerators, and attempt to identify where the future hardware opportunities might be found. We survey the extensive but rapidly developing literature on what would be needed from an analog memory device to enable such a DNN accelerator, and summarize progress with various analog memory candidates including non-volatile memory such as resistive RAM, phase change memory, Li-ion-based devices, capacitor-based and other CMOS devices, as well as photonics-based devices and systems. After surveying how recent circuits and systems work, we conclude with a description of the next research steps that will be needed in order to move closer to the commercialization of viable analog-memory-based DNN hardware accelerators.


Introduction
Over the past five decades, information technology has been transformed by the virtuous combination of three intersecting trends: Moore's law, Dennard scaling, and the von Neumann (VN) architecture. Moore's law described exponential reductions in the cost per transistor [1] that then drove similarly exponential increases in the number of transistors per wafer, making multi-billion-transistor microprocessors (CPUs) and graphical processing units (GPUs) not just possible but profitable. Dennard scaling supplied a set of 'scaling laws' [2] that allowed those smaller transistors to be, fortuitously, both faster and lower in power. Also, the flexibility of the 'stored program' VN architecture allowed programmers to build a wide diversity of complex computational systems by leveraging these CPUs and GPUs as modularized 'building blocks'.
Over the past few years, this intersection of virtuous trends has begun to dissipate. Device scaling has slowed due to power-and voltage-considerations. It has become difficult (and thus extremely costly) to guarantee perfect functionality across billions of devices. Finally, the time and energy spent transporting data between memory and processor (across the so-called 'von-Neumann bottleneck') has become problematic, especially for data-centric applications such as real-time image recognition and natural language processing [3].
One avenue for continuing to evolve the capabilities of future computing systems is to take inspiration from the human brain. Characterized by its massively parallel Journal of Physics D: Applied Physics architecture connecting myriad low-power computing elements (neurons) and adaptive memory elements (synapses), the brain can readily outperform modern VN processors on many tasks involving unstructured data classification and pattern recognition. By taking design cues from the human brain, neuromorphic hardware systems could potentially offer strong potential as an intriguing non-VN computing paradigm supporting fault-tolerant, massively parallel, and energy-efficient computation [4].
However, the number of different projects and proposalsmany of them completely distinct from each other-that are now described as 'neuromorphic' computing has grown very large. (Recent surveys of neuromorphic hardware end up listing hundreds or even thousands of different citations [5,6].) Many of these efforts involve circuits, sometimes including novel devices, that attempt to carefully mimic something that we can currently observe with moderate accuracy within the brain, usually at the scale of a few neurons. This could be the exact neural/synaptic response, precise local connection patterns, or local learning rules such as spike-timing-dependentplasticity [7][8][9][10][11][12][13][14]. Other efforts involve new software algorithms partly or completely inspired by the architecture of the brain [15,16]. The motivations for such neuromorphic hardware research can range from improving our fairly limited understanding of exactly how our brains function, to the hope of engineering computers that could potentially operate at ultra-low power through sparse utilization (in both time and space) of computational resources that are tremendously large, in both sheer number and interconnectivity.
We recently surveyed the potential role that novel analog memory devices could play in these areas [17] (also see earlier assessments by other authors [18][19][20]). However, in this paper, we refine our scope much more tightly, focusing on the use of analog memory devices to build neuromorphic hardware accelerators for deep learning [21].
Artificial neural networks (ANNs), first conceived in the mid-1940s to mimic what was then known about neural systems [22,23], perform computations in a naturally parallel fashion. Modern graphical processing units (GPUs) have greatly increased both the size of the networks and the datasets that can be trained in reasonable time, giving rise to deep learning, or deep neural networks (DNNs) [24][25][26]-essentially, ANNs with many layers of neurons. Over the past few years, DNN performance has improved-on tasks such as classifying images [27], understanding speech [28], playing video games [29] and complex board games [30], and translating between languages [31][32][33]-to near-human (or sometimes even better than human) capabilities. More importantly, these developments have allowed DNN systems to become commercially pervasive-influencing social media sites, shopping and recommender systems, automated call centers, banking and finance, numerous cloud-based computing applications, and even our mobile phones and living rooms.
While some researchers occasionally attempt to connect DNNs back to biology [34], most deep learning practitioners do not concern themselves too much with neuromorphism. They are primarily focused on maximizing performance while finessing the limitations of commercially available VN hardware, which up until recently has meant hardware that was originally designed for something other than deep learning [35]. However, the intense interest in deep learning has led to research on [36,37] and the introduction of [38] custom-ASIC (application specific integrated circuit) chips for deep learning. While some of these chips also double as neuromorphic hardware [39], these efforts primarily focus on re-imagining the GPU as if it had been expressly designed for deep learning. Leveraging conventional digital circuit design techniques, numerous design teams are seeking to deliver hardware acceleration for high energy-efficiency, high throughput (Tbit/second), and low latency DNN computation without sacrificing neural network accuracy [37,40]. Thus, it is critical for researchers working on analog-memorybased hardware accelerators to both understand and take into account the advances that can be expected to arrive soon with such digital accelerators.
In the first section of this paper, we briefly overview deep learning and the major application opportunities for DNN hardware accelerators. Then we briefly discuss the research area of customized digital accelerators [36,37,40]. We discuss how the strengths and weaknesses of analog memorybased accelerators match up to digital accelerators, and attempt to identify where the most promising future hardware opportunities might be found. We survey the extensive but rapidly developing literature on what would be needed from an analog memory device to enable such a DNN accelerator, and then summarize recent progress using various analog memory candidates. These include NVMs such as resistive RAM (RRAM) and memristors, phase change memory (PCM) devices, Li-ion-based devices, capacitor-based and other CMOS devices. In addition to these efforts focused on integrating analog resistive-type electronic memories onto CMOS wafers, we also survey recent work on photonicsbased devices and systems and discuss their potential impact on deep learning accelerators. After surveying recent work in advancing and demonstrating the circuits and systems behind analog-memory-based accelerators, we conclude with some thoughts on the next research steps that will be needed to move closer to commercializing viable analog-memory-based DNN hardware accelerators.

Overview of deep learning
In this section, we briefly discuss the basic computational needs of deep learning, including both forward inference and training with the backpropagation algorithm. Readers interested in a more complete overview should consult other recent tutorials in the hardware accelerator space [21,37,40]; those interested in truly learning the field should consult one of the many excellent online resources [41][42][43][44].
In general, the topology of a deep neural network is fixed by a designer before any training occurs. The size of the first layer-e.g. the number of neurons in the input layer-is typically chosen to match the size of the incoming data of interest: the pixels of a standard-size image, reduced audio content from digitized speech, encoded letters or words from a written document, etc. The size of the output layer is chosen based on the task the DNN should accomplish, such as classifying an image into one of a number of pre-defined classes ('this is a handwritten seven', 'this is a border collie'.) or an output vector of interest. Examples of the latter include Melfrequency cepstral coefficients (MFCCs) for speech recognition [28] or a vector of predicted probabilities of the next letter or word in a sentence.
Forward inference is the evaluation of what an alreadytrained neural network 'thinks' of one or more new dataexamples, using weights that have already been optimized by training. It turns out that, despite the apparent complexity of deep learning [44], the set of computational tasks involved in this computation is not very large. Better yet, most of the computational effort is spent implementing only a small subset of those tasks. Figure 1 shows one of the most important of these computational tasks: the vector-matrixmultiply, or VMM. (Here we note that while there are many different DNN architectures, in general most differ only in how many 'balls' and 'sticks' (neurons and weights) are involved and how they are organized. In contrast, the actual operations performed with those resources are reasonably consistent across different DNN models.) Many DNNs contain some variant of a VMM, in which a vector of neuron excitations, x i , must be multiplied by a matrix of weights, w ij , generating a new vector of neuron excitations for the next layer, y j . This breaks down into a series of multiplyaccumulate (MAC) operations (Σw ij x i ), followed by a nonlinear squashing function, f ().
From a neural network perspective, f is very important. Without this nonlinearity, forward-evaluate of a multi-layer neural network of any number of layers would simply collapse into a single linear equation. From a computational perspective, evaluating the squashing function takes much less effort than the preceding MAC. The newest DNNs tend to use many 'rectified linear units', or ReLU functions. ReLU is a simple piece-wise linear function with only two segments: one along the x-axis, outputting zero for any input sum that is negative; and a second segment along the diagonal f (x) = x directly passing any positive sum as the output. The ReLU helps avoid problems stemming from saturating excitations, and also helps keep gradients from vanishing for deep networks. However, recurrent networks such as long short term memory (LSTM) [45] and gated recurrent units (GRUs) [32]-which tend to suffer from gradients that explode instead of vanishstill tend to use saturating nonlinearities such as the logistic or hyperbolic tangent functions.
The neuron excitations at the input layer come directly from the data-example being evaluated, including any preprocessing so that it looks 'just like' the data-examples that were used for training. The network is evaluated serially from input to output. (Pipelining can be introduced, so that layer #1 can already start working on data-example #2 while layer #2 is still working on the excitations just passed to it from layer #1.) At the output layer, a softmax operation is frequently performed. Here, each raw excitation y j is put through an expanding nonlinearity (such as an exponential), and the intermediate result, q j = exp(y j ), is then normalized by the sum of all such intermediate results across the entire output layer. This produces outputs that are guaranteed both to fall between zero and one, and to sum up to one as well. The softmax operation produces a vector of probabilities, representing the predictions (or the guess/vote) of the pre-trained DNN, given that particular data-example.
Training is the process of tuning the weight matrices to a set of values that can provide good performance (e.g. accurate classifications, predictions, translations, high game scores, etc). This begins with the same forward inference step described above. At first, however, since the weights are randomly chosen, the output result is likely to be nowhere near the desired target. The backpropagation algorithm [46] is a supervised training algorithm that attempts to tune the weights by altering between forward inference of training data for which the desired output vector (or label) is known, reverse propagation of errors based on the difference between the current guess of the DNN and the correct label or other 'ground truth', and then weight update of each weight based on the excitation it saw during forward inference and the error it induced using that excitation, as computed by reverse propagation.
The additional computation associated with training is also dominated by a MAC or VMM, this time proceeding from right to left. Figure 2 shows the reverse propagate step, as a vector of errors (δ j ) is multiplied by the transpose of the original weights w ij . Instead of putting this sum through a nonlinear squashing function, however, the sum is multiplied by the derivative of the squashing function as evaluated at the original excitation, x i . This formula arises from the use of the chain-rule to compute the derivative of an 'energy function', E, for the overall DNN as a function of each individual weight, w ij [44]. If forward inference produces a guess, y, which should have been g, we can choose an energy function that is minimized only when y = g. Backpropagation then allows the DNN to compute the derivative with respect to each Figure 1. An important computational task for DNN forward inference is a vector-matrix-multiply (VMM), in which a vector of neuron excitations, x i , is multiplied by a matrix of weights, w ij , via a multiply-accumulate (MAC) (Σw ij x i ), and then put through a nonlinear squashing function ( f ()), to generate the new vector of neuron excitations for the next layer, y j . weight, or how that weight needs to change in order to allow the DNN to do a better job the next time it sees that same training example.
As we proceed from the output of the network back towards the input, the very first δ vector typically comes from the raw difference between the network output y and the label vector, g. Interestingly, it turns out that multiplying this difference by the derivative of the squashing function used at the output neuron can cause problems. One way to deal with this is to train against 'cross-entropy loss', so that the underlying energy function applies a logarithm to the softmax outputs. The advantage of this is that the derivative at this output neuron cancels out, and thus training does not get stuck even if the initial choice of weights produces excitations at the output layer that are not just wrong, but large in magnitude. Since such large output excitations start out already in the regime where the derivative of f gets very small, multiplying by this derivative would strongly suppress the very corrections that the network will be applying to fix this. As a result, training with this final-layer derivative can be very slow to get started.
Another method is to skip the softmax, the logarithm, and the derivative at the output layer. While this second approach technically does not implement the exact chain-rule, it seems to work fine for simple networks. While this can simplify an implementation, one would need to confirm that this second approach would still work on the very deep networks that have become popular in the past few years.
Reverse propagate takes place throughout the network from output towards input, and can terminate once the accumulated error values (δ j ) have been delivered to the first hidden layer. Weight update for each weight is then the product of the original upstream neuron excitation, x i , and the downstream neuron's error, δ j . Typically this is scaled by a fairly small number, η, called the learning rate. Note that, during training, each neuron needs to hold onto its original excitation, x i , until it is used for weight update. In contrast, if we are only performing forward inference, that same excitation can be discarded as soon as it has contributed into all the associated MAC operations. As a result, introducing pipelining is somewhat more complicated for a full training implementation than for a forward inference-only implementation.
There are many subtleties to deep learning beyond the above discussion. The learning rate (and other 'hyperparameters') must be chosen carefully, not too small so that learning is too slow, but not so large that the changes induced by each training example inhibit convergence. Often learning rate is modified during training. The set of training data must be chosen extremely carefully in order to represent the intended test data, and then divided into a subset of verification data in order to evaluate the performance of the network while it is being trained. The training data should preferably be supplied to the network in a random order (this is the 'stochastic' in 'stochastic gradient descent').
The initial distribution of random weights needs to be chosen carefully, so that the accumulated sums (Σw ij x i ) land in a useful region of the squashing function. Then, as the weights get trained, these 'internal covariate' distributions will shift [47], which can end up moving the sums out of that regime and thus requiring corrective steps such as 'weight normalization' [48] or 'batch normalization' [47].
Probably the most important aspect of deep learning to convey is the realization that the mathematics of the training algorithm is only guaranteed to compute the weight adjustments that will move the DNN towards better performance on exactly those example(s) that were just examined with forward inference. There is no guarantee that adjusting the weights for the next set of training examples (often referred to as a 'minibatch') will not completely ruin the improvements for the first set. Fortunately, if the learning rate is small and the training examples are repeatedly supplied to the network, stochastic gradient descent tends to get better on the entire training set.
At that point, however, the algorithm is going to attempt to perfect this, driving the energy-function to zero, at which point the network has literally memorized the training dataset. In contrast, the commercial interest in DNNs stems from good generalization performance: how well can a trained DNN handle handwritten digits, pictures of dogs, spoken sentences, written sentences, etc, that it has never seen before. A deep learning practitioner spends considerable effort finessing the subtle but important distinction between the mathematical goal of the DNN (memorizing the training set) and the actual desired engineering goal (good generalization performance on a much larger and effectively unknowable 'test' set). Tricks such as dropout [49] and early stopping [28] are some of the many approaches used to maximize the generalization performance of DNNs. Much more information can be obtained from books [44], online resources [41][42][43] and conferences such as ICML and NIPS. Table 1 lists the two main opportunities for hardware accelerators-those designed for just forward inference of pretrained DNNs, and those designed to accelerate DNN training. For forward inference, there is a set of hardware opportunities in highly-power-constrained environments such as internetof-things (IoT) devices, edge-of-network devices and sensors, mobile phones, and autonomous vehicles. There are also numerous forward inference opportunities in the cloud or server room, as described quite well in [38]. Despite these distinct opportunities, the performance aspects that are likely to be more or less important for forward inference are relatively similar. While throughput in terms of tera-operations per second (TOP s −1 ) or equivalently, in data-examples-per-second, is always important, a forward inference application is quite likely to value low latency over throughput. This is likely to be as true in edge-computing (an IoT sensor reading just changed, an autonomous vehicle must respond to its sensors, etc) as in the cloud (a customer is waiting for this particular search/recommendation/translation/ recognition result). While both scenarios always profit from lower power, obviously edge systems will require extremely power-efficient computation.
For the near future, it appears as if training can be expected to take place mostly in the cloud. In the future, there might certainly be opportunities for training in the field-but this would be much easier if the problems of 'catastrophic forgetting' during DNN training [50] can be solved. This would allow an edge-based training chip to update a network on new data-examples without sacrificing the performance on training examples that are no longer readily available. Typically, training is now performed in a distributed manner using many parallel workers, either working on the same model with different data (data parallelism) or on multiple instances (model parallelism) that can improve performance by averaging the different model outputs [44]. For data parallelism, it is important that the necessary communication between the workers over an interconnecting network does not itself become the bottleneck that determines the total time needed for training [51]. This then favors approaches that can harness the improved network performance (e.g. generalization accuracy) offered by having multiple workers while using the interconnect between the workers wisely [52]. The overall goal of a hardware accelerator is to complete training in a shorter total time. Thus, in contrast to forward inference, latency on any one training example is not as critical as raw throughput. Power and areaefficiency are important simply as a means to packing as much compute as possible into each card-slot of a given standardized volume and power envelope (e.g. 75 W or 300 W).

Digital accelerators for deep learning
The recent history and the apparent emergence of deep learning owes much to graphical processing units (GPUs). Deep learning can be considered as the fortuitous convergence of a scalable learning algorithm (e.g. one that drives better performance as the models and the training data get larger), the easy availability of vast amounts of training data via the Internet, and the raw computation needed to train and implement very large networks. The first two components have been available for 30 and 20 years, respectively. The final ingredient was the fast, parallel computation provided by GPUs [26].
In a GPU implementation, the VMM operations described in figures 1 and 2 are turned into matrix-matrix, or even into tensor-tensor operations [44]. This allows mini-batches of examples to be computed at the same time, with the MACs for each layer taking place in parallel on the many SIMD (single-instruction multiple-data) processors within a modern GPU. GPUs are particularly efficient when multiplying large matrices of roughly unity aspect ratio, and thus the size of the mini-batch is chosen in order to fully utilize either the compute or memory resources of the GPU. (Note, however, the inherent tension between the large mini-batch sizes that optimize computation, and the small mini-batch sizes that would help keep latency low.) The advent of sophisticated layers of middleware and hardware drivers such as CuDNN have allowed deep learning practitioners to focus solely on high-level scripting languages such as TensorFlow and Caffe, yet still harness the full computational capabilities of GPUs. As we mentioned earlier, the fact that only a fairly small set of fundamental operations are involved has helped greatly.
Research in custom digital accelerators primarily focuses on re-designing a GPU-like processor, but as if it had been designed explicitly for deep learning. This can either be done with full ASIC designs [38,53,54] or with more flexible fieldprogrammable-gate-arrays (FPGAs) [55]. The fundamental building block for the critical MAC operation looks something like figure 3: a processing element that receives three pieces of data (x i , w ij , and the partial sum so far y j | i−1 ) and outputs the new partial sum, y j | i . While this seems rather simple, there is a strong incentive to carefully organize the complex 'systolic' data-flow into and among these processing elements [37,40].
The overarching concern driving all deep learning accelerators is the enormous cost of moving large amounts of data over any long distance. For example, bringing data onto a processor chip from off-chip memory is much more expensive than retrieving it from a local register. One way to reduce the volume of incoming data is to reduce the precision (number of bits) Hardware accelerator opportunities break into two major application areas: hardware for the evaluation of pre-trained DNNs (forward inference), either in extreme-power-constrained environments (IoT, edge-of-network, autonomous vehicles, etc) or in the server room [38]; and hardware for DNN training, typically performed in a distributed manner in server room harnessing many compute nodes working in either a data-or model-parallel fashion [44].  In a digital accelerator, MAC operations are implemented by processing elements that work with three pieces of data (x i , w ij , and the partial sum so far y j | i−1 ) in order to produce the new partial sum, y j | i .
with which the data is encoded. This can be done with fixedpoint arithmetic (integers with a scale divider to help tune the dynamic range where it is needed) or fewer bits in the mantissa and/or exponent in a floating-point number. Precision in forward inference implementations has been aggressively tuned all the way down to 1 or 2 bits, using binary (0, 1) or trinary (−1, 0, 1) encoding [56]. Much more typical are weights encoded using 8 bits. One of the advantages here is that the encoding can be introduced during the training process and its impact both measured and minimized during training. A similar approach is to 'prune' the network, eliminating neurons during the final stages of training that can be identified as unimportant [57]. Individual weights that are unimportant can even be removed, if the matrix can be stored and delivered to the accelerator efficiently using sparse matrix techniques. Alternatively, unimportant weights that cannot be removed can be set to zero and the circuitry simply instructed to skip over such weights, eliminating unnecessary computations. Compression techniques can reduce the on-chip bandwidth, at some increase in computation associated with decompression [57]. All these approaches can help reduce the amount of data that must be brought on-chip in order to feed the MAC units shown in figure 3.
While forward inference appears to work for many DNNs even at low precision, DNN training appears to call for higher precision in order to avoid sacrificing significant accuracy. One issue with DNN training is the large contrast between the absolute magnitude of the weights and the magnitude of the tiny weight changes requested by a large mini-batch. As training proceeds, weight updates naturally get smaller, both because learning rate is typically reduced during training, but inherently as well, because the errors are getting smaller as the network does a better job on each example. At any given precision, there will be a requested weight update that is effectively smaller than the least significant bit (LSB). Various tricks such as stochastic rounding can help reduce the precision beyond this limit while still achieving good training accuracy [58]. Other tricks are being developed to help reduce the amount of data conveyed between the various chips ('workers') participating in distributed training.
An important part of optimizing the data-flow into and among these processing elements is designing the hardware to match the inherent re-use of data within the algorithm. A family of DNN networks offering many opportunities for such data re-use are convolutional neural networks (CONV-net) [24]. As discussed earlier, the main difference between various DNNs is how the 'balls' and 'sticks' are organized. Figure 4 shows two important types of layers within DNNs: the fully-connected (FC) layer (at left, part (a)), in which every pair of neurons across the two neighboring neuron layers shares a unique weight, and a CONV-layer (figure 4(b)). (Note that neither of these configurations have any connections within layers.) A CONV-layer contains many neurons, often organized into planes. For instance, the input color images to a CONV-net trained on ImageNet contains three planes (red, green, blue). In most CONV-nets, the number of planes increases rapidly as one moves away from the input layer. Instead of a unique weight between all possible upstream and downstream neurons, there are small weight kernels (frequently a 3 × 3 array for each input plane) which are convolved across the input planes to produce the output planes. Since the same kernel is needed in order to produce y j from x i , x i+1 , x i+2 and y k from there is significant data reuse. As the number of planes increases from CONV-layer to CONV-layer, pooling layers and larger strides (e.g. stepping the convolutional kernel in jumps of 2 pixels at a time rather than just 1) help quickly reduce the lateral dimensions. Convolution makes enormous sense for image processing, inherently allowing a system to learn and apply specific kernels to recognize features within images independent of the specific location within the image. Much of the success of deep learning has come from the rapid progress of CONV-nets on very impressive image processing tasks [27,42].
A few years ago, CONV-nets such as AlexNet [27] included multiple FC-layers near the output layer. The number of unique weights in a CONV-layer is quite low, sometimes 1000× smaller than the number of neurons. Given that memory and memory bandwidth are the first things one runs out of in a GPU-or digital-accelerator implementation, the trend by DNN practitioners has been to increase the number of CONV-layers and decrease the number of FC-layers to the bare minimum [59].

Analog-based accelerators
As we noted in the previous section, the most important priority in designing any DNN accelerator is minimizing both the amount of data that needs to be moved, and the distance that it needs to be moved. As a result of this realization, a large fraction of the activity in digital accelerators has focused primarily on optimizing the computations behind memorylight DNN models such as CONV-nets [24]. In fact, recent reviews of digital accelerators have focused solely on forward inference-only accelerators for CONV-nets [37,40]. This is great for applications such as image processing, but less ideal for other applications that depend on FC-layersincluding families of recurrent neural networks mentioned earlier such as LSTMs [45] and GRUs [32], which have fueled recent advances in machine translation, captioning, and other natural language processing. Fortunately, in the same way that digital accelerators seem uniquely well-suited for CONV-layers, analog-memory-based accelerators seem to be uniquely well-suited for FC-layers.
The heart of any analog-based accelerator is a memory array that can store the values of the weight matrix in an analog fashion (figure 5). Weights are encoded into device conductances (inverse of resistance), typically (but not always) using NVM devices. In analog-based accelerators, the MAC operations within each VMM are performed in parallel at the location of the data, using the physics of Ohm's law and Kirchhoff's current law. This can completely eliminate the need to move weight data at all.
Conventionally, NVM devices are used as digital memory devices. A high conductance or SET state might represent a digital '1' and a low conductance or RESET state might represent a '0.' In a crossbar array of such memory cells (figure 5), access devices allow addressing of a single memory cell by appropriate activation of word-and bit-lines, for reading device conductance to retrieve stored data and for programming device conductance to update the stored digital data values.
Such an NVM array can readily be used as an accelerator for deep neural networks. As shown in figure 5, each FC neural network layer-connecting N neurons to M neurons-maps reasonably well to a crossbar array of N × M weights. (As we will describe below, typically we use multiple conductances per weight.) For forward inference, signals are applied to the horizontal row-lines of the array-core, and a small trickle read current is generated in each device along the row, just as they were in a memory application.
However, unlike the memory application, we do not activate just one row at a time, and uniquely sense each small trickle currents at the ends of each column-line to retrieve digital data. Instead, we will activate all the rows simultaneously, and allow these trickle currents to aggregate along the entire column-line. If we are careful to encode each upstream neuron activation into the voltage that is applied to 'its' row, then Ohm's law at each stored conductance implements the multiplication between neuron excitation x and weight w ( figure 6). Once Ohm's law has performed the multiply operation, then the summation along the column-lines via Kirchhoff's current law implements the accumulate operation.
In order to be able to encode signed weights w using positive-only conductances G, we typically take the difference between a pair of conductances, so that w = G + − G − . In some cases, we can use a 'shared' column of devices, or even a dedicated reference current instead of G − . However, this requires that each device can be tuned both up and down in a gradual manner, which is not available for some well-known NVM devices such as PCM and filamentary RRAM.
Note that the neuron excitation can be encoded onto the voltages in one of two ways. If the x value is mapped to a unique voltage, then the instantaneous aggregated current along the column-line encodes the MAC result. While this can be measured as soon as the current stabilizes, there are a few drawbacks. First, a dedicated D/A converter is required at every row to supply the voltages, meaning that the target resolution must be specified at fabrication. Second, since the NVM devices could be read anywhere within a range of different read voltages, their I-V characteristics must now be highly linear. Finally, we have no remedy if the instantaneous power involved with activating all the row-lines simultaneously turns out to be excessive.
In contrast, by encoding the neuron excitation x into the duration for which a constant read voltage is supplied, many of these drawbacks are removed. We do not need any D/A converters, and the NVM device could be significantly nonohmic because we are going to use only one read voltage. The signal pulse conveying the analog data within its duration can be manipulated across the chip using digital circuits, right up until the voltage conversion at the edge of the array to the desired read voltage. Since the data is no longer in the raw current, we do need to integrate the aggregated current (say, onto a capacitor) for some length of time. This also means that if instantaneous power were an issue, we could distribute the application of these pulses as needed within a slightly longer integration window. Additionally, resolution could be dynamically adjusted as needed by adjusting the maximum duration allowed. Note, however, that this does create an undesirable tradeoff between effective resolution with which excitations can be encoded, and the speed and latency of the VMM operations.
By turning an analog-memory-read operation into an inmemory-compute operation, we perform an entire VMM without any motion of weight data, and entirely in parallel. This is the most attractive feature of this analog-memorybased approach: this could potentially be both quite fast and quite energy-efficient. However, while most of the computation in deep learning are VMMs, there are more steps that are needed to turn this simple VMM operation into a viable DNN accelerator.
During forward inference each such NVM array performs all the MAC operations constituting the VMM for one FC-layer of a deep network. The outputs of the array must then be processed by neuron circuitry that applies a nonlinear squashing function (the f () function from figure 1). We have now computed the excitation of this downstream neuron that is needed for the next layer of the neural network. However, the weights of this next layer are encoded within another crossbar array, sitting elsewhere on the chip. Thus, each hidden neuron within the network is implemented by circuitry sitting at the periphery of two different array-blocks. A first circuit collects column output and implements the nonlinear squashing function (such as a logistic function or its piecewise-linear (PWL) approximation); and a second circuit then introduces this neuron activation into the corresponding row of the second array-block. The former represents the 'output' to a neuron from an upstream layer; the latter, that same neuron's 'input' to the downstream layer. A routing network must then be able to connect all columns of a first array-block to the corresponding rows of a second array-block, connecting the two halves of each hidden neuron with each other, preferably in a flexible and reconfigurable manner [21].
Alternatively, the column-portions of each neuron circuit can include an analog-to-digital converter (ADC) to convert aggregate excitations to digital representations, which can then be bussed to digital logic for processing steps such as the nonlinear squashing function [60]. The resulting excitations would then be bussed to the row-neuron circuits and converted from digital representations back to appropriate excitation pulses. While this approach offers the flexibility and familiarity of a digital bus, the need for high parallelization in processing each neuron layer mandates ADCs that are very fast, leading to significant power dissipation and silicon real-estate.
If the application is the implementation of a forward-inference accelerator, then once the routing network is able to pass data from one crossbar array to another efficiently, one need only apply the softmax operation at the output neurons (if desired) to compute the network output.
For training, things get more complicated. First of all, if we intend to perform training of any sort, the neuron excitations need to be stored temporarily-preferably within the upstream neuron circuitry to minimize energy spent transporting and storing this data. The training label must be made available and the raw δ corrections computed by subtraction at the output layer. Then for the reverse propagation of errors, we perform a very similar operation to forward inference, except that δ corrections from downstream neurons are applied to the 'south' side of the array, and the errors for the upstream layer are accumulated on the 'west' side. (This is effectively a VMM using the transpose of the original weight matrix.) For stochastic gradient descent, each weight receives an update for each training example proportional to the backpropagated error for the downstream neuron and the activation of the upstream neuron during forward propagation. This is why these forward activations had to be stored, to have them available to combine with the backpropagated error in order to perform the weight update. For weights represented by pairs of NVM conductances, weight updates are typically performed by firing programming pulses at the NVM elements to increase or decrease their conductances. It is essential that this should be as fully parallel as possible, as the time required to individually program all conductances for each example would result in unacceptably long training times. Parallel weight updates are facilitated by schemes in which downstream and upstream neuron circuits independently fire programming pulses according to their knowledge of backpropagated error and downstream activation, respectively, resulting in the correct conductance programming when these pulses overlap in time [61,62].
It is during this weight-update step that the imperfections of real NVM devices can cause serious problems. As the neural network examines each example from the training dataset, the backpropagation algorithm computes the weight changes needed to improve classification performance on that example, implementing gradient descent along the objective function designed to force the network outputs to match the target labels. For any particular weight which gets increased during this step, the network is quite likely to request, during training of some later example, a counteracting decrease. Many thousands of increases may be requested, and over some period of time, nearly but not quite the same number of decreases.
In an ideal world, these increase and decrease requests would exactly cancel. When they do not cancel, serious problems can arise. It turns out that neural networks have a surprising degree of tolerance for stochastic variability. If the cancellation of increases and decreases were to be random from synaptic weight to synaptic weight, or better yet, random over time, accuracy could still be reasonably high. Unfortunately, nonlinearity in the conductance response of real NVM devices means that at a given conductance, each conductance-increase pulse might consistently be more effective than the conductance-decrease starting from that same absolute conductance (or vice-versa). Since this is systematic across every single device in all the crossbar arrays, this means that the weight updates that are supposed to cancel do not. Worse yet, since the cancellation error has the same general trend on all weights at all times (typically towards weights of smaller absolute magnitude), touching a weight at all means that it invariably shifts in that same direction. And since the network is firing many hundreds if not thousands of update requests yet expecting that most of them will cancel, these weights are touched all the time. As a result of all this, neural network accuracy of NVM-based systems can markedly fail to match what would be expected of a GPU-or CPU-based system of the same network size. Our IBM colleague Tayfun Gokmen has shown that an asymmetry between the size of conductance increases and decreases as small as 5% can have a marked effect on accuracy [60].
The above discussion accentuated the efficiency of computing an entire VMM for each FC-layer in one time step. This situation, where each weight is used exactly once, turns out to exactly match the strengths of an analog-based accelerator. In contrast, an FC-layer is problematic for a digitalbased accelerator, because the number of weights brought onto the chip is enormous yet there is minimal opportunity to be clever with data re-use. For a CONV-layer, the situation is exactly reversed. Since many excitations need to be multiplied by the same weights, an analog-based accelerator will either spend time to implement this (we apply each set of excitations one by one to the crossbar-array encoding the one copy of the weights), or area (we maintain multiple copies of the weights on different crossbar arrays, and route the excitations to the various copies). Either choice will inherently depress computational efficiency in units of TOP/s/mm 2 . For training of a CONV-layer, since the weight updates for each weight are actually the sum of the xδ products across all the copies, the complexity-orchestrating the all-reduce of the various contributions to each weight update, and the broadcast of the accumulated weight-update back out to the various copiesof implementing all this efficiently is extremely daunting, to say the least.
So digital accelerators are naturally good for layers with a lot of neurons per weight (like CONV-layers). Also, analog accelerators-if the effective precision is suitable and the data routing does not sacrifice the inherent efficiency of the crossbar-based VMM-will be naturally good for layers with a lot of weights per neuron (like FC-layers). As a result, one can expect that a hybrid analog/digital accelerator would be an ideal blend of these complementary characteristics, leading to the best of both worlds for DNNs that can benefit from a mix of various types of layers. An example would be CONVnets in which the first layers are CONV-layers naturally suited to applications such as image processing, implemented on digital cores, which then feed highly-efficient analog cores implementing FC-layers for the final layers of the DNN. In the near term, the yet-to-be-answered research questions that must be addressed for analog-based DNN acceleration are effectively identical, whether the final goal is an all-analog accelerator or a hybrid analog-digital accelerator.
In the next section, we discuss the specific requirements of analog memory devices for the application of deep learning accelerators.

Requirements of analog memory devices
Analog-based accelerators promise significant improvement in speed and power. However, such improvements are useful only if the performance in terms of accuracy is reasonable. Ideally, training or inference results with analog MAC operations should produce comparable accuracy to a full software implementation with high precision weights stored as digital bits. A common method for studying how analog memory devices affect deep learning accuracy is to substitute ideal weights with values predicted from a single or ensemble analog device model. Such a model can include a wide range of non-ideal properties. For example, conductance change per programming pulse can be a nonlinear function of the current conductance state of the analog memory device, with a conductance that typically saturates at some maximum value. The response to input pulses can be very asymmetric depending on whether conductance is increasing or decreasing. There are also variations from device to device, and from one programming event to another for each device. Some devices may be defective, resulting in no response and either 'stuck on' or 'open' conductance values.
In this section, we review how the specifications of analog memory devices affect accelerator performance. We survey various proposals on how to mitigate such device limitations through altered algorithm or more sophisticated circuit designs. Table 2 summarizes the underlying device specifications that can be expected to be more (or less) important when seeking an ideal analog memory device for deep learning accelerators for both forward inference and training.
In order to benchmark accuracy for analog-based accelerators, a deep learning dataset that can be solved reasonably well with FC networks, such as MNIST, is commonly used. The MNIST hand written digit recognition dataset consists of 60 000 training examples and 10 000 test examples. The deep learning network chosen for benchmarking varies in terms of number of layers and neurons per layer, usually to accommodate the size of available device hardware. As a result, the target acc uracy can differ from around 90-99%, depending on which network, how many layers and how many neurons per layer. It should be noted that MNIST is a much easier network to train than cutting-edge DNNs. Thus, success at training or inferencing MNIST must be considered as absolutely necessary, but in no way sufficient to predict success as a generic DNN accelerator.

System-level simulations
Gokmen et al [60] introduced the concept of a resistive processing unit (RPU) and identified several RPU device and system specifications, including minimum/maximum conductances, number of conductance steps, device non-linearity, weight update asymmetry, device-to-device variation, and noise. The specifications differ significantly from parameters typical for NVM technologies as the algorithm can tolerate up to 150% of noise in the weight updates and up to 10% reading noise. Even with the intrinsically high variability of states in RRAM due to the physical movement of ions, which limits its use as conventional memory, the intrinsic variability does not impose a major problem in this application because  [63]. Impact on accuracy from time-dependent variation (TDV) in RRAM is more severe for high resistance synapses, and during backpropagation, because of the narrow distribution of resistances in a trained network, accuracy can be affected by TDV [64]. Endurance requirements are also relaxed as RPU devices also only need high endurance to small incremental conductance changes, rather than the large conductance changes needed for digital memory applications. On the contrary, a large number of conductance steps are required and weight update asymmetry (between conductance increase and decrease) becomes the most demanding specification, which is quite unlike any of the restrictions typically imposed upon conventional memory devices [60].
Chen et al [65] also looked into impact of device non-idealities with device models of Pr 0.7 Ca 0.3 MnO 3 (PCMO), conductive-bridging RAM (CBRAM) (Ag:a-Si), and TaO x /TiO 2 RRAM. A sparse-coding feature extraction network was used as the benchmarking problem. The authors considered properties needed for array access/selection device and looked at the effects of device nonlinearity, variation, stochasticity, and limited dynamic range. Multiple analog memory devices (up to nine in one example) were used as one weight element to average out variability.
NeuroSym+ [66] provides a framework for modeling NVM-based networks, including similar device properties, and aims at evaluating system-level performance. The simulator yields circuit area, leakage power, latency, and energy consumption during training. A comparison was conducted among SRAM-based synapse, 'analog' NVM synapses, and 'digital' NVM synapses, where weights are stored as digital bits in NVM devices. SRAM showed advantages for online learning, analog NVM was found suitable for offline classification, and digital NVM was judged to be better for low standby power design.
Finally, Gokmen et al [67] discussed implementation of CONV-nets with RPU arrays. Device variability, noise, optimal array size for best weight re-use, and power consumption were analyzed.

Device asymmetry
In real hardware demonstrations, device asymmetry is difficult to avoid. PCM and RRAM are the leading choices for implementing analog-based accelerators, but both exhibit asymmetric response between SET (increasing conductance) and RESET (decreasing conductance) operations. When PCM is programmed with SET pulses, it is possible to increase the conductance of the device in small enough increments to make weight updates reasonably effective in training networks. However, incremental RESET of PCM devices is difficult to achieve, as a pulse that produces any RESET response typically fully resets the device to the high resistance state. Filamentary RRAM has the opposite behavior, in that these devices can be incrementally RESET, but SET is abrupt to the low resistance state. As a result, it is common to use a pair of analog memory devices to represent one weight, not only to represent both positive and negative weights, but also to mitigate weight update asymmetry by choosing to program one or the other devices in the same SET/RESET direction when applying positive/negative weight updates [61,62]. Efforts to improve device characteristics by engineering the device physics will be discussed in the next section.

Device dynamic range and weights of varying significance
An interesting approach to extend the device conductance range is the periodic carry method proposed by Agarwal et al [68]. This introduces a method for encoding a wider dynamic range for weights, as compared to the size of the smallest possible weight change. This helps increase the number of effective conductance steps, thus training to higher DNN accuracies. Four devices with varying significance per weight were used. Weight updates were performed only on the least significant device, while weights were always read from all four devices combined. When the updated device saturates, either at its minimum or maximum conductance values, the second least significant device is updated to take into account the information from the least significant device. Training then continues on the least significant device after it is initiated to an intermediate conductance value well away from saturation.
Similarly in [69], multiple RRAM cells along one vertical pillar electrode together define one weight value. Each layer in the 3D vertical RRAM crosspoint array represents a weight contribution of varying significance, allowing higher resolution and effective dynamic range. RRAM weights were only ternary, i.e. −1, 0, or 1. Parallel read is implemented for forward inference, but weight update is read-before-write, one row at a time.
In Ambrogio et al [70], our research group at IBM Almaden proposed a new weight structure exploiting the multiple conductances of varying significance using a combination of different analog memory devices to both extend available conductance range and improve weight-update linearity. A pair of PCM devices are used to represent the more significant contrib ution to its weight, while a pair of transistors with gates connected to a capacitor are used for the less significant part of the weight. Training is performed by adding and subtracting charge from the capacitor, thus avoiding PCM device endurance and non-linearity issues. After training with a certain number of examples, the entire weight from the transistor pair is transferred to the pair of PCM devices with a scaling factor, thereby extending the weight dynamic range beyond the limits of a single pair of PCM conductances. This is similar to the approach of Agarwal et al except that no additional ADCs are required as additional conductances are added.
Finally, a third option is to implement multiple conductances of equal significance [71]. Here, a single weight is computed from the sum of many PCM devices, typically 7. Since the weight update is performed by programming only one of the PCM devices at a time, more conductance steps can be achieved. An arbitration clock ensures that all PCM devices receive a similar number of programming requests, to avoid early saturation or endurance failure of any single PCM device. This method also improves linearity in weight update and allows a more gradual RESET transition. Weight update asymmetry can also be mitigated by controlling the relative update rate between positive and negative updates, at the expense of missing some update events. This architecture also reduces device degradation due to limited endurance since each device is only programmed once per seven updates. The downside of this technique is the increase in array size and power consumption.
Note that while the last two methods were demonstrated with PCM devices as the analog memory, the same concepts could readily be applied to many other types of NVM devices, with fairly minor modifications.

Non-linearity
Most analog memory devices exhibit some level of nonlinearity, either between measured conductance and device voltage, or between the amount of weight update and current conductance value.
This first type of non-linearity is particularly important when neuron excitations are encoded into analog read voltages [72]. Effects of non-linearity were shown to be more severe for deeper networks with many synapse layers. This effect can be mitigated by applying a nonlinear transformation of upstream activations before multiplying by weights, effectively linearizing the combined activation-device response or by using pulse duration rather than amplitude to represent analog input to synapses. As mentioned earlier, the advantage of encoding analog signals as pulse duration comes at a cost of increased computation time, which could reduce performance on accelerator speed.
The second type of non-linearity, which is the non-linearity in conductance update, has been identified by multiple researchers as the most restrictive requirement for analog memory devices [60,62,66]. This is because during training, each weight element sees numerous update pulses in both increasing and decreasing directions, yet it is critical that a positive update and a negative update with the same magnitude can cancel each other. When implementing synapse weights using one single analog memory device, this cancellation relies on the symmetry between positive and negative conductance updates. When implementing weights using a pair of memory devices, as most hardware implementations do, both positive and negative weight updates become conductance updates in the same direction, just on different devices. Therefore, the update symmetry requirement becomes a linearity requirement, i.e. the amount of conductance update should be independent of the particular conductance value. Studies using modeling with experimentally measured 'jump tables'-tabulating the induced conductance change as a function of the starting conductance-from a variety of devices, including TaO x /TiO 2 -based RRAM, AlOx/HfO 2based RRAM, PCMO, and Cu/Ag-based CBRAM [65,66,73,74], show the effect of non-linearity on training accuracy.

Weight mapping for inference only
When DNN weights are pre-trained offline in software and then loaded into the analog memory array for forward inference, inaccuracies in setting weight values lead to poor performance. The device requirements in this case are slightly more relaxed compared to the case where training takes place directly in memory, because there is no need to implement backpropagation and many fewer weight tuning steps are required. As a result, one can afford to be quite careful when tuning the resistances of individual weights, and one can apply more complicated mapping schemes.
By having a sparse collection of weights represented by the NVM conductance plus a value stored in digital memory, the in-memory values can be trained further to improve performance. Because only a small fraction (5%) of the weights are in the sparse collection, some portion of the inherent advantages of NVM array can be retained [75]. Yan et al [76] contrasts two weight mapping schemes for weight quantization: evenly-spaced levels in resistance are compared to equal conductance difference between levels. The authors also investigated resistance shift due to read disturb and proposed alternating read polarity to minimize this effect. Wang et al [77] considered the limitation of device dynamic range, i.e. how many distinguishable weight values are needed for accuracy. They considered networks with binary weights and proposed to assign different analog values to the binary weights in different layers of the network, according to the distributions the weights would have if continuous-valued.

Analog memory device candidates
Established memories range from high density, slow and low cost NAND to low-density, fast and expensive DRAM and SRAM [78]. In recent years, the semiconductor industry has shown growing interests in the development of novel memories to replace or enhance functionalities of existing CMOS memory. Various candidates show multilevel programmability by applying electrical pulses, including RRAM, PCM, magnetic RAM (MRAM) and ferroelectric RAM [79][80][81][82]. This capability fits well with the basic needs of an analog-memorybased deep learning accelerator. Progress on other device options, including emerging battery-like devices, capacitorbased devices, photonics, and more exotic devices is also covered in this section.

Resistive RAM (RRAM)
RRAM is one of the more mature novel NVM device candidates, with commercially available memory arrays fabricated with CMOS technology (albeit in small size arrays, or at low density using older technology nodes). Filamentary RRAM offers promising properties such as very low programming energy, fast switching on the nanosecond timescale and relatively high endurance [83]. On the contrary, the resistance window of RRAM is generally not larger than a factor of 50×, which, together with an inherent intrinsic variability, poses limitations towards implementation of a large number of intermediate levels at low programming currents [84]. Other types of filamentary RRAM include unipolar RAM, where transitions are thermally driven, but reliability and endurance are relatively poor [85]. Conductive-bridge RAM (CBRAM) usually shows a resistance window larger than a factor of 100× [86,87]. In a crossbar array, devices are typically located at the intersections between wordlines (WL) and bitlines (BL). When the memory device is in series with a select device, such as a diode, a selector or a transistor, the crossbar is active; otherwise the crossbar is passive. In the last few years, hardware demonstrations implementing FC networks have been limited by the number of available devices in a single crossbar array and by device variability.
6.1.1. Device optimization. RRAM comprises a family of devices which can be divided in two categories: filamentary switching devices and uniform (non-filamentary) switching devices [83].
Filamentary-RRAM typically consists of a metal-insulatormetal structure, where the formation of a conductive filament (CF) through the insulator (mostly metal oxide layers based on Hf, Ti, Si, Ta, but also chalcogenides) provides a high conductance state [83]. In many cases, the filament is composed of oxide defects; in some cases, however, the filament is composed of metal atoms, usually originally coming from one of the two metal electrodes. The CF formation (SET transition) and dissolution (RESET transition) are reversible and can be induced by electrical pulses, providing switching capability between high (SET state) and low (RESET state) conductance states. If CF formation and dissolution take place under the same voltage polarity, the device is defined as unipolar. If, instead, SET and RESET require different voltage polarity, then the device is bipolar.
Bipolar devices have shown superior performance in terms of endurance, variability and reliability. In bipolar RRAM, SET is temperature-accelerated and driven by the electric field [88], and the transition is typically abrupt, although non-abrupt transitions can be obtained with careful engineering of the oxide interface [89]. On the other hand, RESET transition is usually gradual due to the gradual dissolution of the conductive filament. This latter transition is of interest in deep learning applications, since it enables analog tuning of device conductances. Another option to gradually change the device conductance is to vary the CF diameter by changing the maximum allowed (or 'compliance') current that can flow into the device during SET transition. This leads to different SET conductance states with a higher degree of controllability, while RESET states typically show stronger non-linear dependence on applied voltage. This is caused by an exponential relationship between the conductance and the gap length during the RESET transition. In contrast, the dependence during SET state is linear in the area of the CF cross-section [90]. This asymmetry between conductance update during SET and RESET transitions is highly detrimental to deep learning accuracy, as we discussed in the previous section.
Several works have been published concerning improvements in RRAM device switching properties. Woo et al [89] developed a device stack based on Al/HfO 2 /Ti/TiN in order to symmetrize RRAM switching by slowing down the SET transition. Figure 7(a) shows the obtained IV curves with gradual SET and RESET transitions, while figure 7(b) shows the conductance evolution as a function of identical pulses of 100 μs width. The simulation of this device as inserted into a three layer FC network showed an accuracy around 90% on the MNIST dataset [24]. Other approaches involve a careful and more elaborate sequence of programming pulses [91], with gradual SET states obtained through the application of consecutive SET and RESET pulses. Wu et al [92] used a thermally resistive top layer to smooth out the temperature distribution during programming, allowing multiple filaments and smoother bidirectional SET-RESET response. A small network for face recognition and a one-hidden layer perceptron for MNIST with binarized weights in the hidden layer were demonstrated.
Uniform-switching, or non-filamentary RRAM that could reach acceptable linearity and number of states, has also been developed. The non-localized switching strongly reduces variability and enables gradual tuning of the conductance through electrical pulses [93]. Among them, Pr 0.7 Ca 0.3 MnO 3 (PCMO) devices and vacancy-modulated conductive oxide (VMCO) RAM [94] are most promising that have been used for neural network simulations [93,95,96]. PCMO devices show a conductance change due to migration of oxygen ions at the interface between electrode and PCMO layer [97]. In these devices, the adoption of molybdenum electrodes has been demonstrated to increase data retention [93], which is one of the important factors enabling multilevel programming in deep learning networks, together with other aspects such as low read noise, negligible conductance drift and resilience to device instabilities [98]. In addition, other architectures have been employed, such as 1T2R (one transistor-two resistors) weights, where one of the two resistors is the PCMO device. Here, a two-resistor voltage divider controls the transistor gate voltage. The application of pulses on this divider changes the resistance of the PCMO and this reflects in a modified gate voltage. By reading the current from the transistor from the transistor source, the number of conductance levels and linearity strongly increases [96].
Another non-filamentary device is based on a TiO x oxide layer [99]. Here, limitations arise due to asymmetry between SET and RESET currents. To overcome this issue, Park et al [99] suggest the adoption of a Mo/TiO x TiN stack. Since workfunctions for molybdenum and TiN are equal, the device shows enhanced SET/RESET symmetry.
In addition to engineering the physical switching mechanism, the application of dedicated voltage or current [99] pulse shapes can also relax the constraints on device characteristics. However, the benefits come at the cost of peripheral circuit overhead, energy dissipation and, in cases involving a large number of full RESETs, larger device endurance degradation [100]. Furthermore, improved device linearity was demonstrated with relatively long pulse-widths, around hundreds of μs or even ms, which are not practical for hardware accelerators. Thus, exploration of the proposed techniques with shorter pulse widths and large number of cycles will be important to prove feasibility of such methods.

Fully connected RRAM network demonstrations.
Hardware demonstrations fall into two major categories: those where the weights encoded into RRAM device conductances are trained in situ, directly within the crossbar array; and those where weights are trained in software (ex situ) and then programmed into the crossbar.
A first hardware implementation by Alibart et al [63] reports a 9 × 1 neuron one-layer classifier implemented in a passive crossbar, which was able to classify 3 × 3 images of letters X and T. Here, the output neuron was providing +1 for X and −1 for T. Weights were encoded in the conductance difference of a pair of devices, thus providing both positive and negative weights. The crossbar size was 10 × 2, implemented with Pt/TiO 2−x /Pt devices. Training was performed both offline in software and directly in the crossbar memory array.
A later work from Prezioso et al [101] demonstrated a more advanced implementation with a 12 × 12 Al 2 O 3 /TiO 2−x passive crossbar. The network was directly trained in the crossbar array with 3 × 3 input images, taken from three classes representing the letters z, v and n and their noisy versions. Training was performed with the Manhattan update rule, which is a simplified version of the usual Delta rule. The Manhattan rule takes into account the sign of the sum of all δ values obtained after the forward propagation of all the training images. Therefore, weight update is performed only once per training epoch. Figure 8(a) shows the experimental accuracy error during different training runs. The inset shows the distribution of weights before and after training, while figure 8(b) shows the average output neurons signals for inputs corresponding to z, v and n letters.
Recently, Bayat et al [102] reported a bilayer network with one crossbar array divided in two portions of 17 × 20 and 11 × 8 Pt/Al 2 O 3 /TiO 2−x /Pt devices. The network is able to classify 4 × 4 images representing letters A, B, C and D. Training was performed both in software and in the crossbar array, with software (ex situ) training yielding higher recognition accuracy.
Demonstrations of relatively large networks still do not employ crossbar arrays due to reliability issues. Yu et al [103] use a 16 Mb TaO x /HfO 2 RRAM macro where they implement a 400 × 200 × 10 neuron FC network for MNIST handwritten digit recognition. To overcome the variability issues which arise in the multilevel programming operation, weights are trained in software and then programmed into the crossbar array with 1-bit precision, thus providing a large error margin to device conductance. To perform training, weights are encoded in 6-bits and programmed into six different devices. Simulations show less than 1% discrepancy from the software case. The drawback of this implementation is that it requires six times more memory, which also leads to higher power consumption.
All previous cases implemented weights as the difference of a couple of device conductances. Instead, Yao et al [104] reported pattern classification with a 128 × 8 active crossbar where weights are encoded in a single, bidirectional device stacked as TiN/TaO x /HfAlO x /TiN. Figure 9(a) shows the adopted algorithm, which is the backpropagation algorithm with the Delta rule for write-and-verify tuning, or the Manhattan rule for write-only tuning. The crossbar implementation is shown in figure 9(b). The network was able to recognize faces extracted from the Yale face database (face images not shown here, [105]). In this demonstration, higher accuracy was obtained by using a write-and-verify procedure, which enables more precise conductance tuning, and therefore more accurate weights.

Dot-product accelerator.
Crossbar memory arrays can also be used to compute dot-product operations, x · y = Σxy, in one clock step. Given vectors x and y of size n, the computational complexity of a dot product operation goes from O(n 2 ) in digital hardware to O(1) in crossbars [106]. For this reason, researchers at HP Labs extensively studied how to develop an accelerator to efficiently map and calculate dot products in crossbar arrays. Due to device non-idealities, voltage drops on the wires, and circuit non-linearity, mapping of software weights into crossbar memory using a trivial linear conversion would degrade computational accuracy. Non-linear mapping techniques are developed to program crossbar weights in such a way that the final dot-product result in hardware matches the expected software value, as shown in figure 10. In [106], a first technique to reduce the voltage drop on the word lines consists of biasing a word line from both array edges (figure 10(b)). This leads to the highest error in central columns, then corrected with a static signal restoration that amplifies the read current from central columns (figure 10(c)). A major drawback is that this approach is data-dependent since the read current depends on the input signal. The image in figure 10(a) shows an input example with a Gaussian noise distribution.
A second approach is to use a non-linear conversion algorithm for weight programming. The ideal linear crossbar behavior is calculated in software. Then, by using a careful resistive device model of Pt/TaO x /Ta, the actual response is obtained. Finally, the devices are fine-tuned in order to close the gap between ideal and actual crossbar simulation results. This technique is still dependent on input data, but results are very accurate, reaching 99% accuracy on the MNIST dataset in simulations, with no degradation from full software implementation [107]. 6.1.4. Convolutional RRAM network demonstrations. Most hardware neural network implementations focus on FC networks because of the large number of weights employed in these networks. Here, the parallelism deriving from the large crossbars strongly accelerates speed for both training and forward inference. Instead, CONV-nets implement relatively fewer devices, which are organized in kernels and used several times during convolutions. Garbin et al demonstrated for the first time the impact of RRAM devices in CONV-nets by means of device modelling, figure 11(a), characterization and simulation [108]. The network was trained on the MNIST dataset where pixel intensity is encoded with a train of spikes whose frequency is proportional to pixel brightness, figure 11(b).
Accuracy was close to the software equivalent (98.3% against 98.5%) under strong programming conditions (thus maximizing the device resistance window around a factor 100 ×, but greatly reducing the device endurance), and weights were encoded with 20 binary RRAM in parallel. Accuracy is decreased to 94% when considering weak programming  [104]. (b) hardware implementation with one device per weight, and corresponding encoding of gray-scale image (face images from [105], not shown here). Reproduced from [104]. CC BY 4.0.
conditions, with a resistance window around 10× and a higher dependence on device variability [109] These conditions, however, preserve the device from early failure [110]. Unlike FC networks, CONV-nets use fewer weights, thus reducing the impact of using many devices in parallel. On the other hand, the number of SET/RESET cycles on each device is three orders of magnitude larger, thus degrading the device faster [110]. In general, crossbar implementation of CONV-nets shows difficulty in achieving speed improvement over existing GPUs during training because weights need to be convolved with the entire input image, thus breaking the parallelism that exists in FC networks.

Phase-change memory (PCM)
PCM relies on the creation of different conductance levels by switching the material properties of chalcogenide layers, such as Ge 2 Sb 2 Te 5 , from amorphous, at low conductance, to crystalline, at high conductance. Different architectures exist, but they all rely on controlled heating of a chalcogenide material. The SET transition is gradual, since crystallization implies a local reordering of the atomic lattice, while RESET transition is abrupt, as the entire region needs to be melted and then quenched into the amorphous state [111]. Both SET and RESET processes can be driven by electrical pulses, enabling the implementation of analog acceleration for neural network training.

Fully connected PCM network demonstrations.
In recent years, our research group at IBM Almaden has developed an approach to accelerate on-chip training of FC networks with weights weights encoded as the difference in the conductances of a pair of PCM devices in an active crossbar array [61,62]. The backpropagation algorithm is implemented in three steps. Training images are forward propagated through a 528 × 250 × 125 × 10 neurons network, results are compared against the correct labels, errors are backpropagated from the last to the first layer, and then the weights   [108]. Simulated RRAM devices encoded the weights in the kernels. (b) Input image conversion with a train of spikes with frequency proportional to pixel brightness. Only neuron corresponding to '8' fires at the last layer. © 2014 IEEE. Reprinted, with permission, from [108]. are updated following a crossbar-compatible procedure that enables parallel update of all the weights in the crossbar array [61,62]. Results performed on real PCM devices (but with the neuron circuitry simulated, not integrated with the NVM devices) reported a 82.2% test accuracy on MNIST dataset. As mentioned before, later developments with the same PCM in a larger and more complex unit-cell have recently achieved software-equivalent training accuracies [70]. 6.2.2. PCM for memory computing. In addition to fully implementing neural networks on crossbars, hybrid solutions have also been adopted. Le Gallo et al [112] propose a general method for solving systems of linear equations in the form Ax = b where the solution comes from two interacting modules, namely a high-precision processing unit and a lowprecision computational memory unit, i.e. the PCM crossbar. The main idea is to split the calculation of the solution into two parts: a low-precision z solution from Az = r, followed by high-precision calculation of the solution update x = x + z and error r = b − Ax. After that, successive iterations refine the solution to the desired degree of tolerance. Interestingly, this method speeds up the overall computation time since the calculation of the inexact z, which involves the calculation of many multiplications and sums, represents the most computationally expensive operation in digital computation, thus fully exploiting the capabilities of analog-based acceleration.
This concept has also been applied in simulations of FC neural network for MNIST digit recognition by Nandakumar et al [113] (see also [114]). Here, the high-precision unit calculates forward, back-propagation and weight update. Therefore, the network is implemented in CMOS with the PCM array used to perform the compute-intensive multiplyaccumulate operation, creating a hybrid architecture which accelerates training on the MNIST dataset. Weight updates are summed into a high precision variable χ. Since update on PCMs shows a certain granularity ε, meaning that it is not possible to program conductance changes smaller than ε, weight update is only performed when χ > . After the effective weight update, χ is updated to χ = χ previous − n , where n represents the number of steps the network asked to program into the crossbar. Simulations show test accuracy within 1% from full software implementation.

Battery-like devices
While well-known NVMs such as PCM and RRAM dominate the landscape of emerging technologies for deep learning, there have been attempts at exploring other devices with better linearity, symmetry, scalability, and higher dynamic range. Two recent papers [115,116] report novel devices exploiting electrochemical reactions derived from working principles of batteries [117]. Fuller et al [115] describe a Li-ion synaptic transistor (LISTA) based on the intercalation of Li-ion dopants into a channel of Li 1−x CoO 2 . A negative gate voltage V G recalls Li-ions from the channel region to the gate, providing additional electronic carriers and thus increasing the sourcedrain conductance. Similarly, a positive V G pushes ions into the channel region, decreasing the source-drain conductance and enabling a six-orders of magnitude dynamic range in conductance. Figure 12(a) shows the electrical characterization of this device, with application of many pulses. The corresponding jump-tables [61,62] for current-controlled ((b) and (c)) or voltage-controlled ((d) and (e)) positive ((b) and (d)) or negative ((c) and (e)) weight update reveal a highly linear device. This improvement on device characteristics translates into MNIST performance on accuracy, achieving less than 1% accuracy degradation from full software implementation [115].
Similar results are obtained by van de Burgt et al [116] demonstrating an organic neuromorphic device with a similar experimental behavior (with H + as the mobile ion) and MNIST accuracy only 1% below its software baseline [115]. These novel devices show promising results for neural networks, but research on these devices is still at its early stage. Programming times are on the order of milliseconds, since shorter pulses induce only a short term conductance change [116]. Scalability of such devices, and operation within an array will also need to be explored.

Capacitor-based CMOS devices
Given the inherent non-linearity and asymmetry in existing NVMs that make on-chip training challenging, Kim et al [118] proposed an analog synapse based on capacitance ( figure 13). The weight of the synapse is proportional to the capacitor voltage, and is sensed through a read transistor. The authors proposed using a logic block in every unit cell to make a local determination on whether an up or down pulse needs to be fired during weight update. While the proposed guideline of 1000 states per unit cell implies that the capacitor dominates unit-cell area in initial designs, this assumption could well change with either multiple conductances of varying significance and/or adopting other capacitance manufacturing processes such as deep trenches, stacks or metal insulator metal capacitors. In this case, the many transistors in the design would make achieving area-efficient unitcells (and consequently large numbers of synapses per die) a challenge. Furthermore, even with elimination of some of the logic devices, managing the random variation-induced asymmetry between the pull-up and pull-down FETs (P3 and N3 in figure 13) would still require very large devices and/ or other circuitry techniques. Although the synaptic state is decaying continuously, it can be shown that at high learning rates, the network can accommodate this so long as the ratio between the RC time constant (governing the charge decay) and the time-per-trained-data-example is extremely large ( 10 4 ) [118,119].

Ferroelectric devices
Ferroelectric materials have also been studied for analog memory devices with hafnium-zirconium-oxygen (HZO) stoichiometries being a popular choice [120,121]. Applying short pulses can cause polarization domains to flip, changing the threshold voltage of a FeFET device. However, for gradual changes in conductance over a wide range, first domains with smaller coercive voltages and then domains with larger coercive voltage would need to be flipped. This implies that programming of weights would require a read-before-write scheme to choose the right programming pulse amplitude. This would severely hamper speed for on-chip training, but may still be applicable for inference, where weights are only programmed once.
In [122], the authors proposed using ferroelectric capacitors not as a continuously tunable analog device, but as strongly-ON or strongly-OFF switch devices allowing current   [118]. © 2017 IEEE. Reprinted, with permission, from [118].
to flow through resistive elements of varying significance. This reduces the requirements on ferroelectric devices, and the authors showed through simulation that they can achieve well-separated weights. Also, using a hardware-aware regularization approach during training led to good accuracy during inference. This is an insight that may be valuable for other inference researchers as well. Nevertheless, the area cost of building unit cells with multiple resistive elements (the suggested implementation is as distinct FETs), and power/performance benefits were not quantified.

Photonics
The drive to reduce power consumption and increase throughput in the execution of deep neural networks has spurred novel approaches, including the emerging field of photonic networks. Photonic implementations promise high speed due to the high communication bandwidth of optics and low power consumption due to the low dissipation associated with the transmission of light in waveguides. Early efforts in this area included optical implementation of the Hopfield network and proposals for holographic MAC operations [123][124][125][126]. More recently, silicon nanophotonics is becoming a mature technology for producing versatile photonic integrated circuits. Although photonic devices are larger than CMOS logic and NVM memory devices, techniques such as wavelength division multiplexing allow large numbers of signals to be simultaneously transmitted through the same physical waveguides and devices.
Although the field is still emerging, several building blocks relevant for neuromorphic computing have been shown. These include optical versions of neurons with leaky-integrate-andfire response [127,128], MAC operation using wavelength division multiplexing and optical filters [129], and adaption of the intrinsic nonlinear dynamics of optical feedback networks for application to reservoir computing [130][131][132][133]. Here, we give some highlights of this work. For recent reviews focused on this area, see [134][135][136].
Optical gain media used in laser oscillators and amplifiers are intrinsically nonlinear, and this nonlinearity has been exploited to implement functions needed for neuromorphic computing. Using a semiconductor optical amplifier as an integrator and a nonlinear fiber loop mirror as thresholder, an optical leaky-integrate-and-fire neuron was demonstrated [127,128,137]. A similar approach was used to demonstrate a simple neuromorphic processor [138]. Non-linear microring resonators [139,140] could serve a similar role.
Wavelength division multiplexing (WDM) has played a key role in optical communications, allowing a single physical waveguide to carry many signals simultaneously. If the activation of a neuron is represented by the optical intensity of one of these wavelength channels, with each neuron assigned a different wavelength channel, WDM provides a means of transmitting a multiplicity of signals from one network layer to the next. A series of optical filters, implemented, for example, with silicon photonic microring resonators [141], can transmit individually chosen fractions of each wavelength, producing upon photodetection a weighted sum of the outputs of the upstream neurons [129,142]. In this scheme, the microring resonators are the optical synapses, with synaptic weights programmed by the detuning of the resonators.
In addition to ring resonators, several alternatives for realizing optical synapses are being explored using photonic technologies. Silicon photonics resonators have been fabricated on a ferroelectric barium titanate film [143,144]. The transmission of the resonator at a particular wavelength could be incrementally tuned by changing the domain configuration of the ferroelectric layer with in-plane electric field pulses. By integrating phase-change materials onto an integrated photonics chip, the analog multiplication of an incoming optical signal by a synaptic weight encoded in the state of the phasechange material was achieved [145]. In this device, the weight could be adjusted with optical write pulses carried by the same waveguide. This is one example of an optical synaptic element that can potentially have its weight tuned in situ for online learning. This scheme of embedding a phase change element as an optically programmable attenuator has also been used for another example of optical 'in-memory' computing, an 'optical abacus' that can perform numerical operations with optical pulses as inputs [146].
One relatively advanced photonic ANN implementation uses coherent optical nanophotonic circuits [147]. Processing is done by arrays of Mach-Zehnder interferometers and phase shifters to realize matrix multiplication of arbitrary real-valued matrices. In this case, the matrix of weights that represents the synaptic connections between neuron layers is factored via singular value decomposition into the product of two unitary (i.e. lossless) matrices that are implemented using Mach-Zehnder interferometers and phase shifters, and a diagonal matrix whose elements are represented by optical transmissions. Effectively, the diagonal matrix encodes the synaptic weights, represented as optical transmission, and the unitary matrices the connectivity. A simple four-layer network is shown that recognizes vowel sounds with 76.7% accuracy, compared to 91.7% for an ideal network, limited by the precision for controlling optical phase and photodetection noise. For this application, training is done offline and the network programmed with the resulting weight matrices.
The devices discussed above were used for forward inference only, the synaptic weights for a given application having been pre-computed offline. Given that forward propagation through an optical network is cheap, researchers have proposed computation of the gradient for each weight directly, one weight at a time, which would bypass the need to implement backpropagation [147]. Another approach is to use a neuromorphic computation model that requires relatively few tunable weights. Reservoir computing [136,148,149] is one such paradigm that uses a recurrent neural network with fixed weights, exhibiting nonlinear dynamics with a sufficiently rich state-space to effectively represent a large variety of inputs. This recurrent network is the reservoir. Typically, a small number of the reservoir neurons are coupled to output neurons to serve as a classifier, and only these output weights are adjusted during the learning phase. Optical systems with feedback are one possible implementation of this type of recurrent network and have been shown using semiconductor optical amplifiers [130,131,150], nonlinear electro-optic oscillators using delayed feedback [151][152][153]. These have been applied to simple tasks such as spoken digit recognition [132,150,151], or time series prediction [132,133,150].
To date, many basic neural network operations have been demonstrated using photonic devices [134][135][136], but the numbers of neurons and synaptic elements are far from the scale of, for example, deep CONV-nets that embody today's state of the art. Implementing a network for forward inference is conceptually straightforward, and a significant amount of work has been done to understand the impact of issues like weight resolution, variability and noise on the expected performance. Online learning has not yet been addressed in a satisfactory way, nor has the widely used backpropagation algorithm. Reservoir computing is an area to which photonic networks seem to adapt well, and this network model may be useful in applications where recurrent networks could be important such as classifying sequences. The low power dissipation and high processing speed that photonics brings to ANNs will be attractive only if photonic implementations succeed at solving problems of strong interest to computer scientists and AI practitioners.

Other devices
In this section, we summarize other recent research on new device exploration, including but not limited to other CMOS devices, flash and organic devices.
Bae et al [154] propose using Schottky diodes whose work-function can be modified by charge-trapping using a back gate. The material stack proposed uses Si-SiO 2 -Si 3 N 4 , which involves well-established CMOS unit processes and can fit in a 6F 2 unit cell area, comparable to most DRAMs or 1T1R designs. However, the authors' proposal for dealing with non-linearity uses a read-before-write scheme which would be better suited to inference-only schemes as opposed to high performance training.
In [155], the authors use a charge-trapping HfSiO x layer as part of the gate dielectric stack to induce a threshold voltage shift on 28nm CMOS planar SOI devices. This can modulate the current flow through the device, enabling analog synaptic behavior. The authors propose using this device as part of a forward inference engine, and include full mixed-signal circuit and architecture design to build a test prototype. However, at the time of writing, experimental results from the prototype are not available. Simulation results suggest 8-bit chargetrap-transistor (CTT) weight resolution may be needed for software-equivalent accuracies on MNIST, but could benefit from recent work on other inference engines demonstrating aggressive quantization of weights [156]. Also, it is not clear if the simulations capture threshold voltage increases due to non-zero source-to-body voltage, which is strongly dependent on the current being integrated in the array.
In [157], a single-crystalline SiGe layer epitaxially grown on Si was used as an analog memory device, called epiRAM.
Conductance tuning is achieved through modulating confinement of conductive Ag filaments into dislocations in the SiGe layer. A defect-selective etch step was required before cation injection to widen dislocation pipes to enhanced ion transport in the confined paths and therefore increase on/off ratio of the device. With the one-dimensional confinement for filament formation, the epiRAM devices showed improved set voltage variation both spatially and temporally. A 3 layer MNIST FC network simulation with experimentally measured device characteristics showed online learning accuracy of 95.1% (97% in software.) As opposed to building new devices (albeit with existing unit processes) to exploit charge-trapping, Guo et al [158] used modified NOR flash memory arrays for inference, as shown in figure 14. They implemented a 784 × 64 × 10 neural network on a test site, and demonstrate <1 μs inference latency, ∼20 nJ average energy consumption on MNIST and discuss prospects for further improving these numbers. They also demonstrate resilience to drift (in measured NN performance), over a timescale of 7 months, and temperature invariance. The reduced classification accuracy (∼94% in hardware versus 97.7% in software), may be attributed either to the weight tuning itself (only 30% of the weights were tuned to within 5% error), or to device and circuit variations, although it is unclear what the relative contributions were.
Lin et al [159] used organic memristive synapses based on Iron (tris-bipyridine) redox complexes. While the devices show gradual conductance change with both SET and RESET pulses, these devices still need considerable improvement and a compelling use case. The pulse width of 100 μs, as well as the need for increasing voltage amplitudes makes high-performance training difficult. The authors discuss a complete test setup, including an FPGA interface and different programming modes to tune the conductances using a Delta learning rule. Experimental demonstrations include successful learning of a three-input boolean function, along with simulations of other functions and MNIST under different assumptions of variability.

Computing-in-memory architectures
In addition to materials, devices and process integration efforts on building ideal analog memory devices for deep learning, an important research direction is the realization of larger-scale systems that can translate the raw benefits of analog computation to tangible improvements at the application level.
This includes several design challenges, e.g. area and power-efficiency in peripheral circuitry that handles communication and computations outside of the analog MAC operations, IO interfaces, resource allocation for maximizing throughput-per-unit area, control schemes, etc. It also requires circuit and/or architectural simulation frameworks to demonstrate speedup or power/energy benefits over competing CMOS CPUs, GPUs or ASICs designs on various benchmarks. Finally, an often overlooked yet equally important research challenge is achieving equivalent accuracies on these benchmark tasks in the presence of imperfect devices, circuit variability and analog noise. There is little point in being faster or more area-efficient if the hardware accelerator does not 'do the same job' as software.
This section presents an overview of several computingin-memory architectures that address one or more of these aspects. We begin with approaches for forward inference on CONV-nets and multi-layer perceptrons (MLPs), and then discuss architectures for training.

Architectures for inference
The ISAAC accelerator [160] is positioned as a processingin-memory architecture for forward inference of CONV-nets. MAC operations occur on 128 × 128 memristor arrays, with 2-bit memristors and eight memristors-per-synapse (16-bit weight). ADCs are used at the periphery of the arrays, with one ADC shared among 128 columns, and achieving a sampling rate of 1.28 Gigasamples s −1 to meet a target read time of 100ns. Embedded DRAM (eDRAM) is used for buffering intermediate terms and digitized array outputs that are yet to be used in the next layer. For CONV-net forward inference, the authors propose a pipelining scheme that allows convolution operations on the next layer as soon as a sufficient number of pixels in that layer have been generated. They also observe that ADCs consume the most power in the design (58%), and present a weight flipping scheme that allows reduction in the ADC resolution. While the peak throughput-per-unit area of 479 GigaOps/s/mm 2 exceeds modern GPUs, it is somewhat unclear if one can achieve 100% utilization of the memristive arrays on more modern CONV-nets such as VGG [161], especially in the first few layers where the number of inputs is far larger than the number of weights. The impact of memristor imperfections on classification accuracies is also not discussed.
The PRIME architecture [162] is a similar RRAM-based inference engine with some important distinctions. Firstly, device assumptions were more aggressive-including 4-bit (16 state) RRAMs, two of which are combined for an 8-bit weight, multiple analog voltage levels for read (which assumes I-V linearity over the entire span of read voltages), and eschewing all eDRAMs (which places a high demand on RRAM device endurance). Secondly, at the circuit-level, the authors proposed to repurpose the sense amplifiers as ADCs, and the write drivers as DACs ( figure 15) in order to save area and power. They also provided a means for interfacing their architecture to a software stack, allowing mapping of several different NN topologies including MLPs and variants of VGG. Benchmarking showed potential for 3 orders of magnitude speedup on VGG over a 4-core CPU, but did not compare to GPU architectures. Classification accuracy and memristor imperfections/variability were not discussed.
In contrast to the above approaches that assume digital communication of signals between arrays, the RENO approach [163] presents a reconfigurable interconnect scheme (figure 16) that can be repurposed to transmit either analog or digital signals. ADCs or other digitizing schemes are not required except at the I/O interfaces. This approach still requires multiple analog read voltages and I-V linearity for the memristor device. The authors considered several small MLPs for benchmarks such as MNIST. However, classification accuracies are somewhat below their software counterpart. Furthermore, speed and power numbers are shown in comparison only to an Intel Atom CPU.
A paper by Fan et al [164] targeted low-power inference, as opposed to the other approaches where high performance is the point of emphasis. The use of STT-MRAM allows for several orders-of-magnitude higher endurance than either PCM or RRAM, which is necessary for the in-memory logic scheme that the authors proposed. To overcome the issue of low resistance contrast, the authors proposed using a binarized CONV-net, which has been shown to achieve comparable accuracies on benchmarks such as AlexNet. While the authors showed nearly two orders-of-magnitude less energy compared to GPUs, their reconfigurable computation scheme involves setting different reference voltages on the column sense-amps. This will likely be extremely challenging for analog-memorybased accelerators, given the aforementioned low resistance contrast and associated variability, and only gets exacerbated as higher fan-in functions are considered.
Finally, DRISA [165] is a CMOS-based approach that seeks to use 3T1C and 1T1C DRAM arrays for in-memory compute. The technological challenge here is integrating logic and DRAM, as opposed to using other NVMs. While this may seem more achievable, the upside for such a technology is low. The paper demonstrated one order-of-magnitude speedup and energy efficiency over GPUs at software-equivalent classification accuracies. However, the caveat is that this was evaluated at a mini-batch size of 1, which is inherently inefficient for GPUs. Increasing the mini-batch size to 64, which is standard for GPUs, nearly eliminated the benefits. Forward inference use cases where input data is infrequent (implying that it may not be trivial to fill up a mini-batch) yet latency and power consumption are critical, may benefit from the DRISA approach.

Architectures for training
In addition to forward inference, architectures for training also need to include mechanisms for backpropagation and open-loop weight update. This is especially challenging to implement for convolution layers, where typically multiple copies of the weights are needed for efficient forward inference, yet the same gradient needs to be applied to all copies of the weights. This requirement to consolidate weight updates from x and δ values that arrive at different points of the crossbar and the fact that many convolution layers may not be Figure 15. The PRIME memory bank from [162] showing a typical memory read write path (left-blue) and a typical compute path (leftbrown). Schematics (A)-(E) on the right show repurposing of standard memory circuitry with additional components for implementing compute-in-memory. © 2016 IEEE. Reprinted, with permission, from [162]. memory-bound in the first place makes the prospects for hardware acceleration unclear. To our knowledge, no one has yet to propose an end-to-end training architecture for convolution layers. The papers below discuss variants of FC-networks, including MLPs and Boltzmann machines.
In [166,167], the authors discuss an early architecture for training with backpropagation. The authors proposed using a separate training unit that needs to generate the weight updates required for all the pulses and transmit it back to the arrays. However, the challenge with implementing training separately is that the latency and temporary storage requirements for any intermediate terms needs to be carefully considered. The authors also did not assume any access devices to cut off sneak path leakage, which will likely be a problem for weight update operations.
In [21], our research group at IBM Almaden described a generic architecture for training using backpropagation on NVM arrays, using approaches for circuit implementation of forward, reverse and weight update with input x, δ, and update signals all in the analog domain. We described several tradeoffs for peripheral circuitry, including several approximations to reduce area overhead and minimize time multiplexing of neuron circuits while supporting standard forward, reverse and weight update operations. In this approach, weight update is implemented directly on the crossbar array with upstream x and downstream δ firing a finite number of pulses based on their values and associated learning rate(s). This 'crossbarcompatible' and highly parallel weight update step (figure 17) was shown to achieve the same accuracy as the full weight update for the MNIST dataset [62]. In addition, we discussed the issue of 'network freeze-out' wherein NVMs whose conductance changes are gradual only in one direction (such as PCM or RRAM) eventually saturate to zero net weight. We described an occasional RESET procedure (occasional SET for RRAM) that would be needed in addition to the three NN modes, to allow training to continue.
In [168], the authors proposed a memristive Boltzmann machine, that uses RRAM crossbars to accelerate both the well-known restricted Boltzmann machines (RBMs) used in deep learning, and more general Boltzmann machines that are often applied to various optimization problems. Computation involves a three way partial product between the downstream neuron (implemented as time gating on bit lines) an upstream neuron (implemented as time gating on word lines) and a crosspoint weight (implemented as RRAM conductances). A sense-amp-as-ADC approach similar to [162] was used. 57× speedup was shown for some deep belief network configurations compared to a 4-core CPU (no comparison against GPUs.) The use of this architecture to train using contrastive divergence is likely limited by the need for a separate controller to compute weight updates from the obtained energy, and to perform the write operations in the array. However, this is likely not a problem for some of the other Boltzmann machine problems, where energies are recalculated in every iteration yet the weights of the network do not change.

Conclusion
Innovations at the device level targeting improved conductance symmetry and linearity are still essential for hardware acceleration of training. At the same time, it is important to not view device development in isolation, according to some fixed set of criteria and specifications. Instead, device development must proceed in the context of integrated systems wherein different algorithms, architectures and circuit schemes can place different requirements on the underlying devices. For instance, techniques such as multiple conductances of varying significance [68,70], local gains [169] or other approaches can potentially change device requirements, making them both less challenging but also subtly retargeting these requirements.
In this system-centric view, the applicability of the device will be quantified based on whether it can achieve competitive machine learning accuracy not just on MNIST but on much larger datasets while accounting for the full extent of variability. While an eventual large-scale hardware demo may seem appealing, early efforts to demonstrate equivalent accuracy on commercially interesting scales would likely involve either mixed hardware-software approaches or simulations that can reasonably accurately capture real device behavior. Such experiments would greatly benefit from hardware-aware simulation frameworks that could allow mapping of networks from deep learning platforms such as TensorFlow and Caffe to real systems. Small-scale hardware demonstrations investigating the core critical modules or tasks that will eventually be an integral part of a large-scale system (for example, implementing parallel vector-vector multiplies needed in LSTMs, or block-wise reconfigurable routing) will be an extremely important stepping stone.
Identifying the right networks and tasks suitable for hardware acceleration is critical. Recurrent nets based on LSTM/ GRU, RBMs and MLPs are all promising candidates, and further advancements in approximate computing for DNNs (e.g. reduced precision [58]) are a good fit with custom-hardware approaches. While convolution layers have widespread use in image classification systems, the relatively small number of weights with respect to neurons, especially in earlier layers make efficient implementation challenging. While it may be possible to use multiple copies of weights and/or pipelining schemes, this limits the effective throughput per unit area while also setting up additional difficulties for high-speed training. In that sense, it may be beneficial to use approaches such as transfer learning [170] that utilize pre-trained weights for convolution. It must also be noted that the evolution of deep learning has been closely tied to the evolution of the existing hardware paradigm, which is better suited to handle convolution. The emergence of reliable high-performance Non-VN architectures could, in a similar fashion, fuel further innovations in the algorithms.
At the circuit and micro-architecture levels, there are several open avenues of research. For instance, analog-to-digital converters at the edge of each array might provide the maximum flexibility for implementing arbitrary neuron functions in the digital domain. However, the tradeoffs in power and performance need to be carefully quantified. There may still be good use cases for hybrid analog-digital chips, for e.g. in the area of memcomputing [112]. Memcomputing applications that could use the exact same hybrid chips designed for forward-inference of deep learning acceleration would be particularly attractive.
Similarly, encoding DNN neuron activations in voltage levels as opposed to time durations at fixed voltage may seem promising, however device non-linearity with respect to changing read voltages needs to be carefully considered. Furthermore, routing a large number of analog voltage levels between arrays would need specialized operational amplifier circuitry that would be both area-and power-inefficient, as opposed to simple buffering. Analog signal noise sources could negate improvements in device characteristics, especially in large arrays. Reconfigurable routing for mapping arbitrary NN topologies onto the same piece of hardware is necessary for reuse, and must potentially tie in with higher level software frameworks. This is closely tied to finding mechanisms for fast export and import of weight information from/to the chip, allowing accelerated distributed deep learning, which is absolutely a must for competing against GPUs for training applications.
The requirements for forward inference are in general less stringent, although there are one or two unique challenges. Firstly, linear and symmetric conductance response is not needed as closed loop weight tuning can be used. Secondly, inference could possibly work even with a limited dynamic range of weights, benefitting from recent work on weight quantization [156], as well as hardware-aware training regularization approaches such as the one described in [122]. However, as discussed in table 2, devices would need to demonstrate excellent long-term resilience to drift in conductance, even at elevated temperatures. Furthermore, defect rates should (eventually) be low enough that many thousands or even millions of chips can be programmed with the exact same set of pre-trained weights, with minimal provisioning for spare rows/columns. As with training, a key challenge will be demonstrating larger hardware demos that can show equivalent accuracy to software. However, achieving significantly lower throughput per unit area than GPUs may still be okay, if these chips can deliver on ultra-low power inference that could make them ubiquitous in the mobile/embedded/ consumer space.