In-Memory Computing for Machine Learning and Deep Learning

In-memory computing (IMC) aims at executing numerical operations via physical processes, such as current summation and charge collection, thus accelerating common computing tasks including the matrix-vector multiplication. While extremely promising for memory-intensive processing such as machine learning and deep learning, the IMC design and realization must face significant challenges due to device and circuit nonidealities. This work provides an overview of the research trends and options for IMC-based implementations of deep learning accelerators with emerging memory technologies. The device technologies, the computing primitives, and the digital/analog/mixed design approaches are presented. Finally, the major device issues and metrics for IMC are discussed and benchmarked.


I. INTRODUCTION
Today, artificial intelligence and its enabling technology, the deep neural networks (DNN), have become largely popular in various applications such as image recognition, autonomous vehicles, speech recognition, and natural language processing.In the last five years, a state-of-the-art deep neural network model increased the number of its parameters by about 4 orders of magnitude, leading to a significant increase in computational and memory requirements for both the training and the inference operations [1], [2], [3], [4], [5], [6].Traditional computing systems (Fig. 1a) typically store massive information on a memory unit that is physically connected to the computational unit by a data bus.The continuous data movement between the processing and the memory units represents the main bottleneck due to the limited bandwidth, long latency, sequential data processing, and high energy consumption [7], [8].
To minimize the latency and energy overhead of conventional von Neumann computers, in-memory computing (IMC) aims at performing the computation in close proximity to the memory or even in situ within the memory itself [9], [10].The range of operations that can be executed within memory devices includes stateful logic [11], [12], pulse integration [13], [14], associative memory [15], [16], and stochastic computing [17].The most popular and enabling IMC operation is, however, matrix-vector multiplication (MVM) via Ohm's and Kirchhoff's law in a memory array [18], [19].IMC has been thus largely targeted for hardware accelerators of DNN, where MVM is by far the most intensive workload.The ability to execute MVM in a single operation by activating all rows and all columns in parallel represents a key benefit of IMC that is unrivaled by other technologies.Despite the simplicity of the MVM concept and the potential advantages of IMC, the design options and the interaction between circuit operation and device nonidealities still represent a key open challenge.
This work provides an overview of IMC for DNN acceleration from the perspectives of device technology, circuit design, device-circuit interaction, and its impact on computing accuracy.Section II illustrates the emerging nonvolatile memory technologies that are currently considered for IMC.Section III presents an overview of various IMC circuit topologies for performing matrix-vector multiplication and their possible applications.Among these applications, the most promising one is the IMC acceleration of DNN inference, discussed in Section IV.Hence, Section V illustrates the most critical device nonidealities affecting the accuracy of IMC circuits.Section VI provides an overview of the open challenges for the research field, while Section VII concludes the work.

II. COMPUTATIONAL MEMORY TECHNOLOGIES
The main benefit of IMC is the improved energy efficiency thanks to the reduction or suppression of data movement.A first option to mitigate data movement is to bring the main memory core directly on the chip via high-density embedded DRAM [20] or embedded nonvolatile memory (NVM).This approach, called near-memory computing and depicted in Fig. 1b, allows the storage of even megabytes of model parameters, such as synaptic weights and activations, in close proximity to the processing unit.A second option [21], [22], [23] is true IMC where computation is executed directly within the SRAM array as shown in Fig. 1c.A key limitation of this option is the volatile nature of SRAM and the relatively low density compared to DRAM and emerging NVM.In fact, each SRAM cell consists of at least 6 transistors and the bit value remains stored only until the power supply is switched off.To overcome these limitations, the third option embraces emerging NVM devices for both nonvolatile storage of computational parameters and in situ MVM acceleration (Fig. 1d).
Here, we will focus on emerging NVM technologies that are suitable for the IMC concept of Fig. 1d.In general, these devices have three major advantages, namely (i) nonvolatile storage which allows for the persistence of synaptic weights even when the supply is disconnected, (ii) integration in the back-end of line (BEOL), which allows compatibility of the NVM process irrespective of the details of the front-end technology and (iii) high density compared to SRAM.The major NVM technologies for IMC applications are sketched in Fig. 2.
The resistive-switching random access memory (RRAM in Fig. 2a) consists of a metal-insulator-metal (MIM) stack, where the insulator serves as active switching material [24].The memory operation relies on the activation and deactivation of a conductive filament across the switching layer [25].RRAM generally displays binary states, referred to as low resistance state (LRS) and high resistance state (HRS) [26].However, RRAM can also display multilevel operation [27] where the conductance can be tuned in the analog domain [28].RRAM devices can be easily integrated into crosspoint arrays [25] and scaled down to 22nm CMOS technology [29].
The phase change memory (PCM in Fig. 2b) relies on the ability to electrically change the crystalline/amorphous phase of an active chalcogenide material, where the resistance correspondingly changes by at least two orders of magnitude [68].The most typical material is Ge 2 Sb 2 Te 5 (GST) [69], although Ge-rich alloys are adopted for hightemperature retention in embedded solutions [70].The phase change is induced by Joule heating via the application of voltage pulses.If the local temperature exceeds the melting temperature, the resulting phase is amorphous, corresponding to a HRS.If instead, the local temperature is below the melting temperature for sufficient time, the structure stabilizes to crystalline, corresponding to LRS [71].Thanks to the relatively mature technology, these devices have been extensively used for IMC demonstrators [72].
The ferroelectric random access memory (FeRAM in Fig. 2c) consists of a metal-ferroelectric-metal (MFM) structure, where the ferroelectric layer exhibits a permanent and switchable electrical polarization [73].FeRAM has received renewed interest after the discovery of ferroelectric hafnium oxides HfO 2 with orthorhombic structure [74].A key issue with FeRAM is its destructive readout operation, due to reading being performed above the coercive field.This limitation is overcome by ferroelectric tunnel junction (FTJ), where different polarization states seem to show different resistances even at low voltages [75].
The spin-transfer torque magnetic random access memory (STT-MRAM in Fig. 2d) consists of a MIM stack where the top and bottom metals are ferromagnetic (FM) metals, such as Fe, Co, Ni, and their alloys.The MIM displays a magnetotunnel junction (MTJ) effect, where different orientations of the magnetic polarization in the two FM layers, namely a parallel (P) or antiparallel (AP) state, result in a LRS or HRS, respectively [76].STT-MRAMs feature fast switching and good cycling endurance [77], despite suffering from a relatively small resistance window and difficult multilevel operation, which limits the use of STT-MRAM to binarized neural networks.Devices in Fig. 2a-d have a two-terminal structure, which makes them suitable for high-density crosspoint architectures [10].In many cases, two-terminal devices are connected to an access transistor resulting in a one-transistor/oneresistor (1T1R) structure with improved control of the device current during programming and readout.Alternatively, three-terminal devices have been proposed.The ferroelectric field-effect transistor (FeFET in Fig. 2e) consists of a field-effect transistor in which the gate stack contains a ferroelectric layer [78].The ferroelectric polarization is reflected by the threshold voltage V T of the device, resulting in a memory effect similar to floating gate devices.FeFET arrays with ferroelectric HfO 2 have been recently demonstrated [35], [79].
The spin-orbit torque magnetic random access memory (SOT-MRAM in Fig. 2f) consists of a magnetic tunnel junction (MTJ) structure deposited on top of a line of heavy metal, such as Pt or W [80].The MTJ is programmed in a P/AP state by a current flowing across the heavy-metal line via spin-orbit coupling.The cell is read by sensing the MTJ resistance, as in the STT-MRAM.The three-terminal structure allows the separation of the programming and the reading paths, improving the cycling endurance and the write speed [81].
The electrochemical random access memory (ECRAM in Fig. 2g) consists of a transistor device where the conductivity of the channel is modified in a nonvolatile way and can be reversed by injecting ionized dopants across an electrolyte layer [82].ECRAM generally shows high endurance and extremely low-power consumption thanks to the low mobility channel, for instance, WO 3 [83].ECRAM also exhibits a controllable, linear weight update that is suitable for training accelerators [82], [84].
The memtransistor (Fig. 2h) consists of a transistor device with a 2D semiconductor material for the channel layer [85], [86], [87].The memory behavior can be obtained by migration of dislocations in polycrystalline MoS 2 [88], lateral migration of Ag across the source/drain electrodes [85], or charge-trapping [89].In some cases, MoS 2 memtransistors display gradual weight-update characteristics that are useful for reservoir computing [89] and training accelerators [90].

A. COMPARISON OF NVM TECHNOLOGIES
In order to summarize and provide some quantitative information, Table 1 shows a comparison between the main emerging memories and the charge-based CMOS memories [91].Fig. 3a shows a correlation plot of speed, evaluated as the inverse of the read time, and density, evaluated as the inverse of the cell area.Data from the literature are compared to the typical ranges for CMOS-based conventional memory technologies, such as SRAM, DRAM, and NAND Flash.The performance/cost of emerging NVM is usually intermediate between CMOS memories, where speed approaches DRAM whereas density is still generally between SRAM and DRAM.
Fig. 3b shows the array size as a function of the technology node for various NVM demonstrators.The capacity spans the whole range from embedded memory (1-100 MB) to standalone memory (1-100 GB).Note that smaller technology nodes do not necessarily lead to higher array capacity, which is due to the different maturity levels of the technologies.Fig. 3c shows the memory capacity of some NVM demonstrators as a function of the year, highlighting the continuous development of various memory technologies.

III. IN-MEMORY MATRIX-VECTOR MULTIPLICATION
Most IMC implementations aim at accelerating matrix-vector multiplication (MVM), which is by far the most essential computing primitive in deep learning and machine learning [92].Fig. 4 shows a sketch of the MVM concept
Depending on the required specifications and the memory devices, various IMC implementations of MVM accelerators are possible.Fig. 5a shows the resistive crosspoint array, similar to Fig. 4, where device conductances can be programmed in the binary [94], [95] or multilevel domain [96], [97].Steady-state currents collected at the grounded rows are generally acquired by a readout chain consisting of a transimpedance amplifier (TIA) and an analog-to-digital converter (ADC) [98].A major limitation of this architecture is the programming operation, where voltage/currents might be difficult to control [99].In particular, when applying various programming schemes [100], [101], a certain number of half-selected cells experience a nonnegligible leakage current.To address these programming issues, an access device is normally added in series to the resistive element.Fig. 5b shows the 1T1R configuration, which allows finer control of the program/read current, at the cost of a larger cell footprint and of an additional line for the transistor gate terminal [102], [103].Fig. 5c shows the one-selector/oneresistor (1S1R) configuration [104], [105].A selector is a non-linear element capable of suppressing the leakage, also called sneak path, currents of half-selected cells during the programming phase, while maintaining a small cell footprint and a compact two-terminal configuration [106], [107].
Fig. 5d illustrates a crosspoint array based on capacitive memory elements, whose small-signal capacitance can be programmed.In this configuration, MVM computation is typically carried out in two distinct phases.First, the capacitors are pre-charged by applying a voltage proportional to the input vector.Then, the capacitors are discharged by switches placed at the end of columns and rows, while the accumulated charges are collected by analog integrators [108].In this case, multiplication is carried out by the characteristic law of the capacitance, namely Q i,j = C i,j • V j , where C i,j serves as the weight and V j is the applied input/activation.
The input signals can generally be encoded either in the voltage amplitude, through amplitude encoding, or in the pulse width, through temporal encoding.This approach, shown in Fig. 5e, is typically implemented in 1T1R arrays, where the memory elements are subject to a fixed voltage V READ while the input signals are applied to the transistor gates.By integrating the transient currents on a capacitance or through the adoption of analog integrators, the resulting voltage output will be proportional to the MVM result [72].
Kirchhoff's voltage law (KVL) can be used instead of KCL for accumulation [109].This is shown in Fig. 5f, where the adoption of a 2T2R cell configuration enables a binary XNOR multiplication between the input voltage and the conductance.The multiplication activates only one of the two paths, showing a LRS or an HRS depending on the result of the multiplication.By sensing the series resistance summation at each column, it is possible to collect the results of the MVM.

A. APPLICATIONS OF IMC MVM ACCELERATORS
Since MVM is ubiquitous in a variety of algorithms and workloads, IMC circuits to accelerate MVM have thus been demonstrated in several data-intensive computing tasks, as schematically depicted in Fig. 6.
Applications include image processing and image compression (Fig. 6a) via the discrete cosine transform (DCT).Here, image processing/compression can be achieved by applying the concept of MVM between a fixed DCT matrix and the pixel intensity input vector, preserving only frequencies within a desired frequency band based on the compression ratio [19], [111].
In closed-loop IMC (CL-IMC), the MVM array core is connected in the feedback loop of an array of operational amplifiers (OAs), as shown in Fig. 6b [112].This class of circuits allows the acceleration of a broad range of linear algebra operations, such as matrix inversion [113], eigenvector extraction [114], linear regression [115], and ridge regression [116] with a significant reduction in time complexity.
Combinatorial optimization (Fig. 6c) relies on the intrinsic noise of the memory elements and the peripheral circuit as an on-chip source of entropy to carry out a physical simulated annealing to escape from local minima during the iterative search [117].In these applications, MVM accelerators are typically used in recurrent architectures to map restricted Boltzmann machines (RBM) [13], [118], [119] or Hopfield neural networks [120], [121], [122].Similarly, Bayesian neural networks (Fig. 6d) rely on the intrinsic variations of programmed conductance to model the probability distributions of a Bayesian network [123].
The most popular application for MVM remains DNN inference (Fig. 6e) and training (Fig. 6f).A key difference between these applications is that synaptic weights are obtained from ex situ software-based training in the case of inference accelerators, while they are trained in situ via iterative gradient descent algorithms in the case of DNN training accelerators.Typically, a training accelerator is capable of performing inference via forward propagation, while featuring also an in situ weight-update scheme generally via vector-vector outer product within the crosspoint array [124].Weight update requires linearity and symmetry of the conductance update under the application of a sequence of identical pulses, in line with the backpropagation algorithm.The best candidate materials to yield a linear

FIGURE 7. DNN inference workload mainly consists of MVM, which is basically a Multiply-and-Accumulate operation. Crosspoint accelerators of DNN inference can be classified depending on the way these two operations are performed. A fully digital approach relies on memory logic gates implementing an XNOR-Multiply and on a counter for the accumulation. A mixed digital-analog approach requires an analog accumulation via Kirchhoff's current law (KCL).
A fully analog approach relies on resistive elements, that allow the encoding of multilevel weights and activations.Going from digital to analog, the parallelism and the information density of the accelerator increase, at the expenses of more severe parasitic effects and more complex peripheral circuits.Further explorations of the fully analog approach are needed to unleash the potential of IMC for DNN inference acceleration.

IV. IN-MEMORY ACCELERATION OF DNN INFERENCE
The computational workload of a DNN mostly consists of MVM with variable input vectors and stationary weight matrices, which can be directly accelerated by a memory array.Depending on multiply and accumulate operations being performed by analog or digital operations, three different options can be identified for MVM accelerators, as depicted in Fig. 7.

A. FULLY DIGITAL CIRCUITS
The fully digital approach relies on memory logic gates to perform the multiplication, and on counters to perform the sequential accumulation.To encode the binary alphabet of a  binary neural network (BNN) [126], where activations and weights can be −1 or 1, the logic gates usually implement an XNOR operation, that allows mapping a (−1, 1) multiplication in the classical (0, 1) binary domain [127], as schematically shown in Fig. 8.
Fig. 9 shows a building block based on differential 2T2R, also displaying the XNOR gate and the sense amplifier (from bottom to top).The binary weight is stored as a resistive pair (HRS, LRS) or (LRS, HRS) in the 2T2R cell.For instance, to map a weight equal to 1, the memory element corresponding to B is programmed to LRS while its complementary B is programmed to HRS.The activation (input) signal A and its complementary A connect the 2T2R cell to the sense amplifier in a straight path, for input A = 1, or crossed path, for input A = 0.When the clock signal closes the conductive path to ground, the cross-coupled latch of the sense amplifier compares the resistive states of the memory elements and raises the voltage in one of the two output nodes, while decreasing the other one.For instance, assuming A = 1 and B = 1, the XNOR node potential increases while XNOR decreases.The XNOR output is then digitally counted by a popcount operation.Thanks to the binary comparison of the two device resistances in the 2T2R structure, the memory cell is resilient to drift, noise, device variability, and temperature variations [128], [130].
SRAM-based digital accelerators have also been demonstrated with various memory cells, ranging from sixtransistor (6T) cells to twelve-transistor (12T) cells [21], [22], [23], [133].While providing only volatile storage of weights, SRAMs provide the advantage of a fully-CMOS integration which can be manufactured even for extremely scaled technology nodes, such as 5nm [133].
In general, the fully digital approach is exceptionally robust to various nonidealities, such as device variability, drift, noise, or IR drop, and it can have higher reconfigurability [134], [135], [136].However, because of the accumulation through counting, the parallelism of the computation is limited to just one row at a time, thus limiting the available throughput.

B. MIXED DIGITAL-ANALOG CIRCUITS
In a mixed digital-analog circuit for DNN acceleration, accumulation is performed in the analog domain by KCL, thus avoiding the sequential counting of the pulses, while multiplication remains implemented in the digital domain by an XNOR gate.
Fig. 10 shows the computing core of a mixed digitalanalog accelerator based on SRAM.XNOR is implemented in an eight-transistors/one capacitance (8T1C) cell, where the weight a and its negated ab are stored in the SRAM, while activations x and its negated xb are applied at the PMOS transistors connected to the cell capacitance [145].Assuming a = 1 and x = 0, the complementary node ab is shorted to the capacitance, setting the output voltage to 0 V. Accumulation is then performed through charge sharing of all cell capacitors to the shared bitline [145], [151], [152].Alternatively, charge accumulation has been proposed by charge redistribution on weighted capacitances [150], [153], [154].
When the multiplication results are produced in the form of steady state currents instead of charge, it is sufficient to collect them through a common node, exploiting KCL, and acquire the output current sums through a readout circuit [137], [138], [139], [140].Depending on the BNN, the resulting current sum can also be directly compared to a reference current by means of a sense amplifier [138], thus operating a threshold-type activation function.When adopting differential NVM cells, another proposed method to perform accumulation consists in implementing a voltage divider composed of pull-up or pull-down resistances according to XNOR results [141], [147], [148], and then acquiring the common node voltages, which are proportional to the result of the MVM.
Overall, a mixed digital-analog approach takes advantage of the inherent parallelism of IMC, virtually reaching a computational complexity of O(1).However, the analog accumulation requires a more complex peripheral circuitry, often involving a bulky and energy-hungry readout chain, and is more sensitive to parasitic effects, such as IR drop and noise.Furthermore, when the multiplication relies on a single NVM, without conductance comparisons or errorresilient circuits, also device variability and drift can affect the computation.
Thanks to the multilevel operation, resistive memories are suitable for implementing non-binary weights in the same circuit footprint, thus enabling a higher area efficiency, defined as the number of performed operations per area unit.Indeed, memory elements can be programmed in binary [103], [163] or multilevel mode [102], [160].Alternatively, multilevel weights are obtained through bitslicing techniques [156], [157], [162], differential implementations, or more complex cell structures, allowing several conductive levels to be obtained [159].Also, a hybrid binarymultilevel accelerator has been proposed to achieve the best trade-off between accuracy and area efficiency [161].Alongside the increase in the number of conductive levels, memory cells can contain a variable number of elements, for instance, 1T1R cell [95], [102], [164], differential 2T2R cell [158], or higher-complexity cells such as 8T4R [159].In addition to multilevel weights, analog accelerators typically feature multilevel or analog activation signals that can be modulated through amplitude [102], [159] or temporal encoding [72], [156], [165].Fig. 11 shows two possible implementations of fully analog circuits, that rely on either current or charge accumulation.Current-mode sensing requires applying a clamped voltage to the source lines, thus generating current contributions in each 1T1R cell that are collected and converted to a voltage by the current ADC.On the other hand, voltagemode sensing consists of two separate phases.First, the multiplication results are stored in the source line capacitances, then they are accumulated into a sample capacitance by charge sharing.The voltage across the sample capacitance is finally collected by the ADC [155].
Fully analog accelerators can harness the full potential of IMC, thanks to the massive parallelism and the extremely high information density of multilevel weights and activations.On the other hand, accurate readout and conversion circuits are essential to fully benefit from these features, resulting in a significant overhead of area, power, and cost.Furthermore, analog computing is critically affected by parasitic effects at device and circuit levels.

V. MEMORY NONIDEALITY AND METRIC
Memory devices and circuits rely on physical, materialsbased storage concepts that are never ideal.IR drop refers to the current-induced voltage drop across parasitic wire resistances along the rows and the columns of the memory arrays (Fig. 12a).Wire resistance is nonnegligible in very scaled arrays because of the small section of the metal lines.Furthermore, analog accumulation in parallel IMC requires several cells to be read at the same time, thus increasing the wire current, hence the IR drop.IR drop causes a modification of the effective cell voltage compared to the externally applied signal, thus resulting in a current error that is proportional to the average device conductance, to the wire resistance, and to the square of the array size [99], [169].In practice, the error induced by IR drop is the main limitation to array size up-scaling, thus preventing reaching the ideal computational complexity of O(1).Generally, IR drop is reduced by adopting low conductance devices, differential cells [158], or small computing-tile architectures [169].More elaborated techniques have been proposed at architectural level [170], [171], algorithmic level [169], [172], [173], and training level [174].
Multilevel operation allows the improvement of area efficiency [28], [175], [176].However, NVMs have limited precision in programming the conductance, for instance, due to size variations of the conductive filament in RRAM or crystalline grain size in PCM.The limited precision arises as a device-to-device (D2D) variability or a cycle-tocycle (C2C) variability within the same device [177], [178].D2D variability is shown in Fig. 12b, reporting a multilevel RRAM device with a non-negligible spread of the conductive states.Differently from the digital domain, where binary levels can be discriminated despite a possible spread, computing in the analog domain can be critically affected even by a small variation.
Drift is generally observed in PCM, where the structural relaxation of the amorphous phase causes an increase in resistance with time [179].Drift can also affect the polycrystalline phase in multilevel PCM devices, as a result of residual amorphous regions [167], [180].Fig. 12c shows the temporal decay of conductance of multilevel analog states, described by their slope ν on the bilogarithmic plot.Drift is also observed in other devices, such as RRAM and FeFET, although the physical mechanism is different from PCM. Drift can be mitigated by adopting reference PCM cells [140], [181], [182], [183] or differential 2T2R structures [128].
Finally, various sources of noise and fluctuations may affect NVM devices.For instance, Fig. 12d shows the 1/f current noise of RRAM devices, causing an increasing relative spread of the measured current [168].In addition to 1/f, thermal and random telegraph noise (RTN) can contribute to time-dependent variations of the weights, thus affecting the accuracy of the analog MVM.Noise might be mitigated by adopting analog integration of the readout current, although at the cost of reduced speed of computation.
To properly benchmark various NVM technologies for use in mixed or fully analog DNN accelerators, it is important to set a common metric.To this purpose, Fig. 13a shows a correlation plot between the average conductance value G and the standard deviation σ G .Data were obtained for various NVM devices, including FeFET [79], PCM [185], RRAM [166], and STT-MRAM [184].The conductance G should be minimized to reduce readout currents, hence energy consumption and IR drop effects.Similarly, σ G should be minimized to improve the computing accuracy in analog/mixed circuits.The observed trend in the figure is that σ G and G approximately correlate with a formula σ G /G ≈ 0.15, irrespective of the NVM technology and the programmed state.Fig. 13b illustrates the relative current error for an MVM operation in the presence of variations and IR drop as a function of the array size for the NVM devices in Fig. 13a.For relatively small array sizes, the error decreases as a result of variability averaging among NVM devices.As the array size increases, IR drop causes the error to steeply increase.The optimum size of the array, which is identified in correspondence with the minimum error, is dictated by σ G and G, which control variability and IR drop, respectively.

VI. OUTLOOK
IMC circuits are dense, fast, energy-efficient, and scalable.Several solutions and applications have already been identified and explored for both machine learning and deep learning.However, various technological and design challenges have also been identified.Further development and industrialization of IMC require addressing these challenges in two major directions.
The first direction concerns the study of device technology and materials.IMC paradigm would greatly benefit from the adoption of precise, stable, and low-current memory devices that could be easily integrated in the BEOL of extremely scaled lithographic processes, while also being programmable in multiple conductive levels.Investigation of materials and device physics can enlighten the phenomena underlying nonidealities such as fluctuations and drift, with the aim of developing new memory devices which are immune from parasitic effects.Besides device developments, the engineering of the memory cell configuration, such as 1S1R or 1T1R structure, could drastically reduce the operating current, with strong advantages in terms of lower energy consumption, lower IR drop, and higher area efficiency of the IMC system.In summary, developments at the device level would boost IMC performance in terms of increased information density, throughput, area, and energy efficiency.
The second direction to be explored is the study of computing architectures and their interplay with the workload.To maximize the system performance, computing parallelism should be maximized to prevent multiplexing of the readout chain.This approach is usually challenging since peripheral circuits consume the largest portion of energy and area budget.However, these limitations could be relaxed by an accurate co-design of the hardware and the neural network.On the one hand, IMC circuits must be designed specifically for an application, thus avoiding unnecessary features or excessive precision, for instance by reducing ADC quantization or implementing simplified activation functions.On the other hand, given a target application, the neural network can be customized to adopt the features that are suitable for IMC acceleration, such as low-level quantization or hardwareaware training procedures.Finally, an electronic design automation (EDA) toolchain is needed in order to bridge the gap between the end-user and the hardware system, ranging from application-specific, high-level-of-abstraction design tools [186], to dedicated compilers [187], [188], [189] performing low-level core optimization in real-world implementations, similarly to existing CPU-and GPU-based computing system.

VII. CONCLUSION
This work provides an overview of memory devices and circuit topologies for IMC-based acceleration of machine learning and deep learning.Among various applications, a particular focus is given to the IMC acceleration of DNN inference, for which various approaches are presented and discussed, considering circuit overheads and parasitic effects affecting the final accuracy.IMC is a potentially-disruptive paradigm shift, either in terms of architectural change or raw computing performances.Further research on memory device engineering and understanding as well as on the hardwarenetwork synergy could eventually unleash the full potential of IMC.

FIGURE 1 .
FIGURE 1. Several examples of CPU -memory integration.(a) Von Neumann architecture, in which CPU and memory are separated and connected through a high-bandwidth bus, (b) Near-memory computing, which features the embedding of a nonvolatile memory on the same silicon as the CPU, for increased bandwidth and reduced data transfers, (c) SRAM-based in-memory computing, in which the computation is performed directly in the SRAM memory array, and (d) eNVM-based in-memory computing, which features the integration of a high-density memory allowing both parameter storage and calculation.

FIGURE 4 .
FIGURE 4. Crosspoint memory array based on resistive memories can perform matrix-vector multiplication directly in situ, by means of Ohm's law and Kirchhoff's current law.By applying a voltage vector at the columns, the analog conductive elements produce a current that is collected at the rows, conveniently biased at 0 V.The resulting output current vector is the multiplication of the conductance matrix G with the voltage vector V.

FIGURE 5 .
FIGURE 5. Various implementations of IMC crosspoint accelerators of MVM.(a) Resistive crosspoint array (1R).(b) Array with one-transistor/one-resistor (1T1R) configuration.The transistor prevents sneak path currents during the programming phase and allows finer current control.(c) Array with one-selector/one-resistor (1S1R) configuration.The highly non-linear selector prevents sneak path currents, maintaining the cell footprint.(d) Capacitive crosspoint array, composed of memory elements whose small-signal capacitance can be programmed.(e) Temporal encoding of the input vector through gate voltage pulses whose widths represent the input signals.Integration is required to collect the transient currents.(f) MVM through resistance summation.An XNOR-Multiply is performed by the 2T2R cell, which activates the path corresponding to the multiplication result.The series of the resistive paths inherently performs the accumulation.

FIGURE 6 .
FIGURE 6. Example of applications that benefit from IMC matrix-vector multiplication.Depending on the frequency update requirements and the noise sensitivity of the application, each hardware solution should combine memory devices with specific physical properties with adequate peripheral circuits.For instance, applications that rely on one-time programming of weight values after an ex situ software-based training (e.g., DNN inference, CL-IMC, and DCT) can trade off the need for accurate tuning algorithms with less stringent requirements on the cycling endurance of the device itself.On the other hand, applications that demand frequent and continuous updates of the conductance matrix (e.g., DNN training) require efficient gradual programming and endurance capabilities of the adopted memory device.Image "Pillars of Creation" from James Webb Space Telescope gallery [110].

FIGURE 9 .
FIGURE 9. Error-resilient implementation of an XNOR in a fully digital accelerator, based on a differential 2T2R RRAM cell.The activation signal A enables the connection of the cell to the sense amplifier in a straight or crossed path, allowing the sense amplifier to perform the comparison between the two resistive states.Adapted with permission from [128].

FIGURE 10 .
FIGURE 10.Implementation of an XNOR with an 8T1C SRAM cell in a mixed digital-analog accelerator.Analog accumulation is performed by charge sharing on a shared bitline.Reprinted with permission from [145].

FIGURE 11 .
FIGURE 11.Possible implementations of the computing core in a fully analog approach based on 1T1R configuration, relying on current and charge accumulation, respectively.Reprinted with permission from [155].

FIGURE 12 .
FIGURE 12. Examples of memory nonidealities.(a) Parasitic resistance along wire connections responsible for the IR drop, (b) programming variability in multilevel programming, (c) conductance drift that affects the cells, and (d) Gaussian noise that is measured during the readout of a RRAM cell.Reprinted with permission from [166], [167], [168].

FIGURE 13 .
FIGURE 13.(a) Plot of the correlation between the conductance value G, and its standard deviation, for a certain technology.(b) The simulated relative current error of the MVM product as a function of the matrix size.Device parameters were extracted from [79], [166], [184], [185].