Photonic Multiply-Accumulate Operations for Neural Networks

It has long been known that photonic communication can alleviate the data movement bottlenecks that plague conventional microelectronic processors. More recently, there has also been interest in its capabilities to implement low precision linear operations, such as matrix multiplications, fast and efficiently. We characterize the performance of photonic and electronic hardware underlying neural network models using multiply-accumulate operations. First, we investigate the limits of analog electronic crossbar arrays and on-chip photonic linear computing systems. Photonic processors are shown to have advantages in the limit of large processor sizes (<inline-formula><tex-math notation="LaTeX">${>}\text{100}\; \mu$</tex-math></inline-formula>m), large vector sizes (<inline-formula><tex-math notation="LaTeX">$N > 500)$</tex-math></inline-formula>, and low noise precision (<inline-formula><tex-math notation="LaTeX">${\leq} 4$</tex-math></inline-formula> bits). We discuss several proposed tunable photonic MAC systems, and provide a concrete comparison between deep learning and photonic hardware using several empirically-validated device and system models. We show significant potential improvements over digital electronics in energy (<inline-formula><tex-math notation="LaTeX">${>}10^2$</tex-math></inline-formula>), speed (<inline-formula><tex-math notation="LaTeX">${>}10^3$</tex-math></inline-formula>), and compute density (<inline-formula><tex-math notation="LaTeX">${>}10^2$</tex-math></inline-formula>).

and crosstalk between one another compared to their electrical counterparts.
Photonic technology has traditionally been used for long distance communication. However, modern bandwidth requirements and the standardization of silicon photonic integrated circuits (PICs) has lead to the proliferation of shorter distance photonic links. For example, silicon photonic transceivers are now a pervasive component in data-centers. In addition, the efficiency of a photonic link, which is dominated by the E/O and O/E conversion costs between the electrical and photonic domains, is rapidly encroaching on the efficiency of electronic links: the cost to move data photonically between nodes at a data-center (∼1 pJ/bit [1]) is now within order unity from a modern DRAM memory stack to a processor [2].
At the same time, there has been a substantial increase in the use of many-core parallel processing systems for a variety of tasks in high performance computing (HPC). Artificial intelligence (AI), in particular, is growing at an alarming pace: deep learning models have been doubling in size every 3.5 months, far outpacing Moore's law [3]. These systems have much greater communication overheads than classical von Neumann architectures such as CPUs, resulting in a dramatic increase of both the area and energy consumption of metal interconnects (see, for example, Ref. [4]). They are also bottlenecked computationally by the ability to perform matrix multiplications efficiently, which represent the most common operations in HPC.
The most computationally expensive task in current AI models is the implementation of neural networks. Current deep learning models require dense, low-precision matrix computations. Digital instantiations of matrix (or tensor) units typically suffer from high communication overheads, expensive digital operations, and high latencies. On the hand, photonic linear operations-such as passive fourier transforms [5] or matrix operations [6]-exhibit stark advantages in bandwidth density, latency, and energy. As mentioned in [7], [8], photonic computations are passive, exhibiting favorable energy scaling costs which are potentially O(N ) for O(N 2 ) fixed point operations. Photonic matrix multiplication occurs in a single step, only bottlenecked by the periphery of modulation and detection. A more surprising observation is the computational density of such an approach: despite the large sizes of photonic devices, such systems can deliver more operations per second in a given area than those in digital electronics.
This manuscript analyzes the merits of using photonics for simulating neural networks. We begin by exploring the This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ implementation of multiply-accumulate operations (which take the form a ← a + w · x) in various platforms in Section III, discussing the costs and benefits of digital electronics, analog electronics, and photonics. We provide a comparison of the fundamental limits of electronic crossbar arrays and photonic linear computing systems in Section IV, and analyze the performance of these models across of metrics such as energy, speed, and computational density. We consider the general performance of photonic MACs along these metrics based on practical devices that are compatible with large-scale silicon photonic foundries. In the last section, we provide a concrete comparison between fully-tunable neuromorphic photonic networks based on known photonic device models and principles with electronic state-ofthe-art deep learning chips.

II. MULTIPLY-ACCUMULATE OPERATIONS
The multiply-accumulate (MAC) operation calculates the product of two numbers and adds the result to an accumulator. For a given accumulation variable a and modified state a , the operation takes the following form: MACs are constituents of a number of linear mathematical operations, including dot products, matrix multiplications, Fourier transforms, and convolutions. MACs have traditionally characterized the performance signal processing (DSP) applications [9], [10], but have become increasingly prominent in modern HPC. We are most interested in a specific use case: the simulation of neural network models. AI applications typically divide into training, in which models learn to understand a data set, and inference, in which trained models are deployed on new data to draw conclusions or extract information. For a set of input variables x i and output variables y j , each node j (or neuron) receives signals from a large number M of other nodes i. The inputs are combined via a weighted sum of the form y j = i w ij x i . The input to the next layer x j sees y j go through a nonlinear function: The function f {x} can represent any nonlinear operation (i.e., ReLUs, spiking neurons, pooling, etc.), and can be simulated in the analog or digital domains. The weighted sum can be broken down into a series of MAC operations of the form a i = a i−1 + w i x i for i = 1 . . . M. Each neuron requires M parallel MAC operations. Therefore, a neural network of size N requires M × N MAC operations per time step, or one operation per synapse. In a fully interconnected network with N nodes (M = N case), the number of MAC operations required per time step Δt-or characteristic time constant τ in analog hardware-is N 2 per step. The nonlinear function f {x} can also consume energy, but since this operation scales with O(N ) rather than O(N 2 ), it does not represent the most costly operation. As the size of the network N grows large, MACs become the most burdensome hardware bottlenecks in neural networks [11]. It is therefore Fig. 1. A typical signal pathway for a modern AI chip. Information is passed between memory chaches, between MAC processors performing a + (w × x) and nonlinear operations f {x}. Moving data (blue arrows) consumes the majority of the energy in current systems. no surprise that MACs are the most ubiquitous computations in deep learning hardware acceleration, both in training and in inference [12], [13]. Fig. 1 illustrates the signal pathway for a typical AI processor. Tensor or vector data that resides in memory is retrieved and sent to the processor, which performs MAC operations (a ← a + (w × x)) and some other nonlinear functions (encompassed in f {x}) before the result is sent back to memory and stored. Although MACs constitute the majority of operations in AI, in practice, most of the energy is lost data movement [14], [15]. Activations must be shuttled to and from various memory caches and buffers to the matrix multiplication units and back. The cost primarily comes from charging and discharging metal wires, which have a capacitance per unit length of around 100 aF/μm with charging energy proportional to ∼CV 2 . Since the voltage V is fixed by the fabrication node, conventional digital electronics must necessarily pay this cost [16] (see discussion in Section III-A).

A. Data Movement
It is well known that photonics has the capability of greatly reducing the data movement problem that currently plagues electronic chips [17]- [20]. Optical loss is nearly negligible for intrachip distances (see Section III-C), so instead of paying an energy cost proportional to the length of each connection, photonic links pay the cost upfront converting from the electrical domain to the photonic domain and back. Waveguides can thus beat metal wires in efficiency, provided that the cost of E/O/E conversion is less than that of charging a metal wire over the same distance.
It is not yet clear whether addressing the data movement problem alone is worthwhile-we still pay the E/O/E cost (∼0.1 pJ to 1 pJ [1]) communicating between cores, which is within order unity of the cost of each MAC operation in state-of-the-art AI chips (see Section VI). Instead, we can garner a larger advantage by using photonics for both data movement and MAC operations simultaneously, interfacing modulators and detectors in close proximity with both local memory banks and a photonic neural network processor. Photonic memory architectures have been studied in depth, having the potential for significant advantages over their electronic counterparts (see for example Ref. [21]- [23]). We focus primarily on the MAC processor in the pages that follow. A key advantage, in this case, is that the memory I/O cost is amortized over the operations performed in the processor. This can lead to significant energy savings, and ultimately, huge performance gains over digital systems.

B. Precision
Analog operations are far more resolution limited than standard floating-point operations. For example, representing a 16-bit value on an optical signal at minimum requires detecting 2 32 photons per time step to stay above the shot noise limit, which, at typical telecommunications wavelengths (λ ∼ 1.55 μm), puts us above the energy consumption of current digital processors (∼550 pJ per sample, leading to >1 pJ/MAC, see Table II). Since analog systems use physical representations of real numbers, they lack the dynamic range to represent different exponents. Their operations are equivalent to fixed point, in which the exponent is fixed during computation.
Thankfully, empirical research has shown that neural networks can operate effectively with both low precision and fixed point operations. Inference models work nearly just as well with 4-8 bits of precision in both activations and weights-sometimes even down to 1-2 bits [24], [25]-and training with nearly 8-16 bits of precision per computation [26], [27]. Training can even work with binary weight evaluations, as long as high resolution stored weights are applied stochastically during training [28]. There is also evidence that fixed point arithmetic within the matrix core is also effective for both inference [27] and training [29]. This puts deep learning in range of analog photonic processing, which has been shown to exhibit a tuning accuracy of more than 4 bits [8], [30], [31]. However, many of these studies have focused on quantized precision, in which signals are resolved deterministically via a set of threshold values. Analog systems are for more stochastic, with both unbiased noise from the signal pathway and biased noise from fabrication variation. In the digital domain, there are strict conditions on the number of bit errors that systems can handle (typically, we want SNR ∼ 10 dB for a digital channel with forward error correction [32]). The degree of noise or fault tolerance can vary significantly across different neural network models [33], but interestingly, such models can be made robust via proper construction and training [34]- [37]. In some cases, unbiased noise added during training results in a more robust model, effectively acting as a form of regularization [38]. The resulting network becomes more noise-tolerant with an accuracy that is equivalent to a network trained without noise [39]. In practice, noise levels can also approach deterministic precision thresholds: for example, stochastic rounding across signals and weights has many theoretical advantages [40], which is effectively similar to setting the SNR ∼ 0 dB relative to the quantization level. In this sense, robustly constructed neural networks can operate with far more noise than standard digital links.
For the remainder of this manuscript, we characterize our precision with respect to the analog noise in each channel. We define a parameter SNR ≡ 2 N b , where N b represents the number of bits of noise precision for a given computation. We will also define a parameter, ρ (see Ref. [41] for an equivalent definition) which represents the loss of precision in the analog domain from the digital domain. For ρ = N , we have fixed point arithmetic, in which the precision is only defined with respect to the dynamic range of the output after summation x i w i . This leads to scaling advantages, as discussed in Sections IV and VI. For ρ = 1, we guarantee every output w i x i maintains full precision N b , if even if the weight w i is small. We can also have 1 < ρ < N where the desired precision is in some way dependent to the amplitude of the signal: ρ = √ N represents an interesting case, guaranteeing that the precision of a signal in a prior layer maintains the same precision in the next layer after a 1/N fan-out loss. Importantly, we will consider only the fixed point case covering the full dynamic range of the output (ρ = N ), since it leads to great efficiency in the analog domain.

C. Compute Density
Throughout this manuscript, we define a figure-of-merit that we can use to compare various architectures with one another. This metric (previously used to benchmark power-performant floating point operations in digital electronics [42]) will be referred to as compute density, and is defined as follows: Compute density is related to several other well established metrics. For example, since it is limited by the ability to communicate across each MAC unit, its upper bounded by bandwidth density (bits/s/mm 2 in Ref. [18]). It is also affected by energy efficiency, since we must keep our system within a reasonable power density (<1 W/mm 2 [42]) to prevent thermal runaway. We analyze these limits in Section IV.
There are a number of reasons why compute density is useful, particularly when we are comparing different kinds of architectures that may multiplex signals differently or run at vastly different clock rates. When we look at crossbar arrays (such as Ref. [43]) or digital matrix configurations such as systolic arrays [12], [44], there are well defined notions of MAC area, MAC density, memory density, and speed. However, digital architectures that use time multiplexing strategies (i.e., TrueNorth [45]), or photonic strategies that could use either time or wavelength multiplexing (i.e., those described in Sections IV and V) do not necessarily have the same clear definitions, because there are many more MACs being implemented than the number of physical units. It depends on whether we consider these "virtual" MACs as part of the density calculation or not, which can complicate our comparisons.
Defining a compute density metric remedies these ambiguities, providing a grounded way to speak about processing power that is relatively invariant to the multiplexing or channelization strategy. We will also see that, like bandwidth density, it is not necessary to define how we divide the spectrum up into independent channels in order to talk intelligently about the limits of compute density. And ultimately, we are interested in the total amount of computational power (op/s) that a given system can exhibit. Microprocessor areas are fairly invariant-they tend to occupy 100 s of mm 2 because of limits in cost, yield, and reticle size. From this perspective, compute density also acts as a measure of the compute power of a microprocessor that uses a given architecture, since chipsets likely occupy areas that are within order unity.

III. PHYSICAL IMPLEMENTATIONS OF NEURAL NETWORK HARDWARE
In order to compare electronic and photonic processing with one another, we will use the multiply-accumulate (MAC) operation, defined in Section II. Below, we explore the advantages and disadvantages of implementing these operations in digital microelectronics, analog electronics, and photonics.
A. Electronic Implementations 1) Digital Electronics: Conventional digital computers are based on the von Neumann architecture [46] (also called the Princeton architecture), and are typically implemented in silicon microelectronics. They include a memory bank that stores both data and instructions, and a central processing unit (CPU) that performs nonlinear operations. Instructions and data stored in the memory unit lie behind a shared multiplexed bus which means that both cannot be accessed simultaneously. This leads to the well known von Neumann bottleneck [47] which fundamentally limits the performance of the system-a problem that is aggravated as processors run memory-bound algorithms. Nonetheless, this computing paradigm has dominated for over 60 years, driven in part by the continual progress dictated by Moore's law for CPU scaling-the number of transistors that can be put on a microchip doubles every 18 months to 24months [48]-and Koomey's law-the number of computations per joule of energy dissipated doubles approximately every 1.57 years [49].
These limitations have lead to the massive parallelization and specialization of hardware architectures [50]. CPUs used to be the most common choice for most applications, but in recent years, many-core architectures such as GPUs and FPGAs have expanded to encompass general purpose tasks in the high performance computing arena. Concurrently, specialized ASICs are becoming increasingly popular for the implementation of artificial intelligence algorithms, which require low precision, high density matrix computations, a notable example of which is Google's Tensor Processing Unit (TPU) [12]. Although parallelization can break down tasks that are highly distributable [51], the performance of this operation eventually leads to diminishing returns as a result of Amdahl's law [52]. As a separate issue, I/O latency and sequential processing capabilities cannot exceed the time resolution of the processor itself, which is ultimately bounded by its clock rate. Even MAC units need to serialize the summands to perform weighted addition ( Fig. 2(a)).
Although digital microelectronics continue to increase in performance as lower nodes are introduced, an increasing number of practical barriers are inhibiting the scaling of processing and energy densities. As an illustrative example, clock rates have saturated to around 500 MHz to 4 GHz [53], and chip designers have been forced to focus efforts on parallelism instead [54]. Attempting to drive processors faster, or with higher compute density, results in several runaway effects, including: r Energy Consumption: The scalability of modern microprocessors is largely limited by power density, or energy dissipation per unit area (W/mm 2 ). There is a trade-off between bandwidth and energy consumption in electronic devices. Ideally, the energy lost is almost entirely due to capacitive discharging. Dynamic power scales according to: for node transition activity factor α 0→1 , capacitance C, driving voltage V , and switching frequency f s [55]. However, at higher frequencies, secondary effects such as short circuit current and leakage become more pronounced, causing α 0→1 to decrease and P to hit a floor value. Different material structures, device architectures, or higher driving voltages V can offset these effects, but typically increase energy consumption. This, in turn, produces more heat, which can manifest in runaway thermal effects. These thermal limitations are often the dominant limiting factor for chip scalability [56]. The largest energy contribution originates from communication, which primarily involves charging and discharging many metal wires. Metal lines, like electronic devices, dissipate energy resistively (via Eq. (4)). In many processors with high communication overheads-such as FPGAs or deep learning chips-communication can easily occupy more than half the energy cost [57], [58]. As it stands, digital architectures are far from optimal: the power efficiency of biological systems is estimated to be <1 aJ [11] per MAC operation, which is six orders of magnitude greater than the power efficiency of current state-of-the-art machine learning chips at ∼1 pJ (see Fig. 7).
r Signal Bandwidth: Since interconnects are restricted by geometric constraints, microelectronic circuits typically rely on some form of temporal multiplexing for widespread, parallel data distribution between processors. For example, many neuromorphic architectures employ a digitization scheme called address event representation (AER) to communicate events between different neural processor cores [59], [60]. Unfortunately, electronic connections experience harsh trade-offs between bandwidth and interconnectivity. Signal bandwidth for both capacitive and inductive lines scale according to for bandwidth B l , cross sectional area A, and [61], [62]. As a result, metal wires are typically limited to signals no faster than several gigahertz in frequency. Temporal multiplexing strategies lead to even harsher trade-offs, since multiplexing N channels each with channel bandwidths B c requires a total bandwidth of at least B l ≥ NB c per multiplexed line.

B. Analog Electronics: Spatial Multiplexing
One way to avoid digital bottlenecks is to use an analog networking configuration in which each connection is represented by a physical wire. Dense connections can be instantiated in space-efficient topology such as crossbar arrays [63], [64]. Summation and multiplication can both be performed simultaneously using resistive elements together with Kirchhoff's current law ( Fig. 2(b)). However, closely spaced wires also experience bandwith-distance trade-offs. As an illustrative example, for a cluster of adjacent wires with pitch P , width P/2, thickness T , length L, RC bandwidth scales according to [61]: This can become particularly problematic for large L > 1 mm 2 , and is responsible for the enormous energy costs seen for offchip communication in electronics. That being said, if L is kept small, the bandwidth can actually be quite high and the energetic communication cost low [17]. One must be careful to shrink the cores in a small area to keep the efficiency as high as possible (this point is discussed in more detail in Section IV). One of the primary difficulties of analog electronic arrays is finding a good linear and tunable resistive element-traditional transistors, optimized for digital operations, do not have the linear transconductance profiles to make this tenable. New materials or fabrication approaches are therefore a necessity in creating efficient analog electronic arrays. To this end, memristive devices have been explored quite extensively (see Ref. [43], [65], [66]) along with phase change memory (see Ref. [67] for a good review), which have yielded a number of interesting approaches for high-density storage and computing. For example, memristive memory now beat traditional flash memory in performance along many metrics, including density, reliability, speed, and endurance (see for example Ref. [68]). Nonetheless, for tunable resistive elements to take full advantage of the possibilities that crossbar arrays have to offer (as discussed in Section IV), we need to see additional performance improvements, and there needs to be a low-cost way to integrate them into standard fabrication processes.

C. Photonic Implementations
Photonic signals can support much greater bandwidth densities and consume less energy for longer distances than the electrical counterparts [16]. This has motivated the development of fiber optic technology in telecommunication networks and now, interconnections in datacenters and processors [62]. The advantages of photonics are especially relevant for systems with high communication or bandwidth overheads. There are several unique physical properties that allow optical signals to manifest these advantages: r Bandwidth: Optical carrier waves possess different orthogonal features, including wavelength, spatial mode and polarization, which do not interact with each other in passive devices. The total complex electric field E(x, y, z, t) in a waveguide or fiber optic communication channel can be described as a sum over every optical mode m, polarization p, and wavelength n: for unit vector e p , mode profile A mp , time-dependent term B mnp , angular frequency ω n , propagation vector β n , t = t − z/v g , and group velocity v g . Each term can be modulated independently via B mnp and, in the absence of interference, can be separated using linear photonic devices. The optical telecommunication band itself has Δf ∼ 5 THz of spectral bandwidth, which provides approximately ∼5 Tb/s of information capacity for every r Impedance: In optical systems, one only needs to match the refractive index to prevent reflections. In addition, since electric/optical (E/O) and optical/electrical (O/E) conversion is an inherently quantum process, electric nodes which communicate using photonic edges need not be electrically impedance matched with one another [69]. This reduces many of the design constraints that typically limit microwave electronic circuits.
r Energy: Since photonic signals are not subject to Joule heating, waveguides and fibers can be designed with very low signal attenuation (i.e. <.1 dB/cm [70] and <.1 dB/m in some cases [71], [72]), allowing for communication costs that scale independently of distance. This allows for the propagation of higher power signals without the associated contribution to thermal runaway. In addition, communication or computations in the optical domain could be performed with minimal or theoretically even zero energy consumption-especially for linear or unitary operations. In addition to these physical benefits, there are also practical ones. While there has been research on photonic integration for some time, in the past five years, there has there been a paradigmshift in photonic integration that could garner the manufacturing benefits enjoyed by digital microelectronics [73], [74], namely: r Performance: Shrinking devices reduces their energy requirements, and allows for continuous performance scaling. Furthermore, the high yields attainable only in foundries enable the fabrication of complex photonic systems.
r Economics: The presence of large markets driving silicon photonics (i.e., data-center transcievers) enables economies of scale in production, amortizing the cost of fabrication and packaging.
r Standardization: Every foundry line has a standard library of heavily optimized device designs through which, smaller enterprises can effectively utilize the fruits of millions of dollars worth of industrial research. Silicon photonics offers a combination of foundry compatibility, device compactness, and cost that enables the creation of scalable photonic systems on chip. Its heavy use for data-center transcievers have lead to a decrease in overall packaging costs. Of course, the industry is still new, so photonic chips are not without their challenges. A prime example is that tunable photonic devices are currently energetically expensive: microring resonators and phase shifters currently use heaters for coarse tuning, which can consume significant energy. This point is discussed more in Section V-A.

IV. ANALOG MATRIX MULTIPLICATION: A COMPARISON BETWEEN PHOTONICS AND ELECTRONICS
It's clear that analog computing in both the electronic and photonic domains offer many advantages over digital microelectronics. So which one will win? To get a better sense of their performance bounds, we will compare an electronic crossbar array (the most common architecture for devices in Secton III-B) with a hypothetical dense photonic matrix core in which MACs are performed using a resistive approach in electronics and passive linear approach in photonics. Inputs for the electronic core are analog voltages and currents, whereas the inputs and outputs for the photonic core are optically multiplexed signals with analog light intensities.
We use an example of performing a single, square matrixvector operation, consisting of N input channels and N output channels (N 2 MAC operations) with a fixed preconfigured matrix. We implicitly assume that there is a set of devices that can fully tune resistance or optical loss locally and selectively without a significant quiescent power overhead. A schematic of these models is shown in Fig. 3.

A. Bandwidth Density
We first consider how our bandwidth density limits the overall compute density (see Section II-C) of each approach. A given compute core must simultaneously address both processing within the core (i.e., an efficient implementation of a MAC operation a = a + w × x) and data movement across the core (i.e., each MAC operation requires a result from a prior MAC unit in order to perform a full dot product w i x i at the end of each row). As we will see, the data movement constraint can bound the performance of each of the cores.
We assume that there is a tunable, resistive element at the interface between metal crossbars, and each tunable element emulates a simple resistor associated with a fixed weight w. Kirchhoff's current law performs the summation w i x i with the weights within each matrix, determined by the relative resistance values along each wire. A standard formula for assessing the bandwidth of on-chip metal interconnects is for on-chip RC interconnects [62]), cross sectional area A, and length L of the wire. Extending this analysis to crossbars, we make the simple observation that the area occupied by each resistive element is approximately equivalent to the crosssectional wiring area A in two dimensions. Computing over a N × N matrix multiply array with L = NP l for crossbar line pitch P l , this gives us our bandwidth-limited electronic compute density D E : in units of 1/s/mm 2 .
In the optical domain, each waveguide has an intrinsic bandwidth B O upper bounded by the speed of the wave itself-for standard telecommunications wavelengths (1550 nm), this upper bound is in the range of B O ∼ 100 × 10 12 s −1 for multiplexed signals (from f = 193 THz), but more realistically ∼5 THz for WDM-multiplexed systems in the 1.3 μm or 1.55 μm wavelength bands. Photonic waveguides are limited by the evanescent field coupling overlap between adjacent modes, which is a function of the wavelength of light. We can thus derive a minimum pitch P λ between waveguides. This leads to a maximum bandwidth-limited photonic compute density D O of: There is a critical difference here: electronic crossbars decrease in bandwidth density as the size of the crossbar (L 2 ) grows larger, whereas photonic systems maintain their density, independent of size. For fairly reasonable values based on the gain bandwidths in typical III-V devices and preventing crosstalk between waveguides (B O ∼ 3 × 10 12 bits/s, P 2 λ ∼ 2 um), the crossover point at which D O > D E occurs near L > 100 μm. Put another way, photonics is expected to exhibit a greater on-chip bandwidth density limit than electronics for cores that occupy more than L 2 > 0.01 mm 2 of area.
There are a number of factors that this analysis did not take into account. Channel crosstalk becomes a bigger problem for electronic systems, but this can be greatly reduced placing an isolating ground wire between each signal wire, keeping the bandwidth density still within order unity. Also, both optical and metal crossbar arrays can be scaled vertically with using 3D stacking technology (see [75] for the optical case), and optical waveguides can also include mode multiplexing, which may shrink the effective pitch P λ . Nonetheless, the analysis above provides a good first principles look at the bandwidth density, and shows that they are both capable of enormous compute densities, with photonics winning in the large L limit.

B. Switching & Driving Energy
Here, we consider contributions from the driving energyi.e., the amplitude of the signals required to drive any output circuitry-and the capacitive switching energy for both analog electronic and photonic cores. We will assume that the input and output voltages are compatible with transistors, restricting values to ∼0.5 V or larger to prevent thermal leakage (see discussion in Ref. [16]).
Given this voltage condition, the main way through which electronic crossbar arrays lose energy is capacitance discharge across the array. Note that it may also be possible to make the array appear purely resistive in dissipation-using, for example, inductors to cancel out the capacitance at a given frequency. This case is not considered here. The energy lost per cycle is where C is the capacitance of the array and V l is the line voltage. To arrive at a per-operation metric, we consider the contribution of charging each group of metal wires surrounding each resistive element: for a wire pitch P l , this is L = 2P l . Given standard capacitances of about c l = 200 aF/μm [62] a discharging according to 1 2 C l V 2 r for total capacitance C l = 2c l P l , our energy consumption becomes: per operation. For a standard line pitch P l ∼ 80 nm and V r ∼0.5 V, we arrive at E MAC(E) ∼ 4 aJ. This is quite low, and may be brought lower if advanced techniques are employed to reduce this pitch (i.e., P l ∼12 nm in Ref. [76]).
The optical case has a potential scaling advantage, because metal wires need not be charged at each junction. In particular, photonics only requires charging N detectors for N 2 operations. However, we must generate enough light to drive the detector with sufficient charge, which can be significantly limiting [62]. This depends on the amount of light that each detector receives, which can be affected by the precision loss ρ. For example, in a conservative estimate, a given signal in an N × N matrix is split to 1/N , and we must multiply our light power by N to make up for the loss if we are to maintain the same input precision (ρ = √ N ). In a better case (i.e., fixed point arithmetic with ρ = N ), we care less about the signal and more about the full dynamic range of the output. For some power P L driving a laser with efficiency η L , some loss through the optical system η wg , and detection efficiency η d , the current we see at the detector is I d = η L η wg η d P L /E ph for photon energy E ph = hν. Lumping the efficiencies into a single quantum efficiency η = η L η wg η d , this gives us a minimum energy of: for photon energy hν and elementary charge e. Note that we also have capacitive discharge from the detector (scaling according to (1/N ) · (1/2)CV 2 r per operation), although it typically has a smaller effect on the energy consumption than the driving condition above.
If we consider deep learning framework compatible with fixed point arithmetic (ρ = N ), we see that, unlike in the electronic case, the capacitive charging scales with N rather than N 2 . Choosing a high performance detector with C d ∼ 1 fF [77], V r = 0.5 V (bringing the optical link energy to <500 aJ, see Ref. [16] for further discussion), and assuming a fairly efficient laser source (η = 0.2), we start to see a difference around N > 500 as shown in Fig. 4. We once again observe optical matrix multiplication cores gaining an advantage as the matrix becomes larger-in this case, we have a direct dependence on the N × N matrix size. Note that the single digit aJ/MAC bound is still a factor of 1 × 10 5 out of range relative to current state-of-the-art technologies (which are >100 fJ/MAC, see Section VI), so it is a far cry from limits we are seeing in the near term. Nonetheless, it is clear that both approaches have the potential for very low energy operations, with photonics exhibiting a greater overall advantage in the large N limit for capacitively-limited arrays and fixed point operations.

C. Noise
Noise affects analog precision during computations and has a strong effect on the energy consumption of each analog core. Reading values from a resistive crossbar with some SNR is fundamentally limited by thermal noise [41]. Using ρ and N b as defined in Section II, this gives us the following expression for the energy per MAC operation: We again consider the case of full fixed point precision, where we define the precision with respect to the total output dynamic range (as discussed in Section II) and set ρ = N . Our MAC energy numbers become E MAC(E) ∼ 4 aJ/N for 4-bit operations, and E MAC(E) ∼ 1 fJ/N for 8-bit.
In the case of the optical matrix multiplier, we need to consider the noise on the E/O and O/E interfaces to and from the input and output. At the detector, the fundamental limit is shot noise, resulting from photon fluctuations from the incoming wave. Considering the total quantum efficiency η, we arrive at an analogous expression as above, but for shot noise: Using a fixed point representation (ρ = N ) with an efficient laser (η ∼ 0.2) in the C-band, this gives us .33 fJ/N at 4-bit and 84 fJ/N for 8-bit.
Comparing these two quantities directly, the optical shot noise factor 2hν η is about an order of magnitude off from the thermal noise factor 4k B T . If we let our E/O/E efficiency η → 1 in the best case, the ratio between the energies is still E MAC(O) /E MAC(E) ∼ 15, which is larger than order unity. So we see that in the limit of noise power limited operation at high precision, electrical crossbars have an advantage over photonics.

D. Discussion
We have considered the bandwidth density, switching energy, and noise at the physical limits of both electronic and optical matrix multiplier cores. We see that photonic cores exhibit scaling advantages over electronics for large core areas (L > 100 μm) or large channel counts (N > 500, see Fig. 4), but perform worse, in the limit, if the system is noise-power limited.
To illustrate performance differences between the two approaches, let's set a vector size of N = 1024, which is within an order of magnitude of current conventions [12]. We calculate the maximum compute density with both 4 bit and 8 bit operations. For a given energy E MAC , our power density is D P = E MAC Δf/P 2 and our computational density (ops/s/mm 2 ) is D = Δf/P 2 for pitch P between MAC elements and signal bandwidth Δf . We restrict the power density below a critical threshold D P < 1 W/mm 2 [42] to prevent anticipated thermal issues that would otherwise result. We use the following parameters: a pitch of P E = 80 nm  Table I. For 4-bit operations, switching energy largely dominates over noise energy for both photonics and electronics. Optical systems exhibit an advantage here: electronic cores hit the thermal density limit, but photonic cores are able to saturate their full bandwidth density limit before that point. In the 8-bit case, we see noise energy becoming significantly larger. There is a large jump in the photonic energy consumption as we move to higher precision, thanks to a quadratic dependence on the relative noise power of each signal. In cases in which high precision is necessary, operating in a noise power limited regime results in electronics crossbars performing better.
Note that although electrical crossbars are less noise-bound than photonic cores, it is unclear if this increased precision capacity is important for artificial intelligence. Ref. [27], [28] have shown that the forward compute step does not need high precision even during training, as long as the underlying weight storage and gradient rules maintain granularity. Also, since shot and thermal noise are unbiased, batching can be used to average the noise over a given set of training data (where the effective precision over the batch with M samples is equal to . The limits discussed here are a far cry away from current technology-compute densities in the range of 100 s of PMACs/s/mm 2 are a factor of >1 × 10 5 from the current stateof-the-art as discussed in Section VI. This shows that both electronic and photonic arrays have immense computational capacity, and what may ultimately differentiate them may be short term technological developments, i.e., cheap, high endurance, and tunable weight elements, or the efficiency of the nonlinear periphery surrounding each matrix core. An interesting note is that optical systems are optically limited by P λ , and electrical crossbars can have much smaller pitches (<100 nm). This means that, in the limit, photonic devices will be much larger but run at much higher speeds. This can actually a significant practical advantage: larger photonic devices may not be as sensitive to device variations or yield in a given fabrication process. We shall see that this size difference also occurs in nearer term systems, explored more closely in the sections that follow.

V. PHOTONIC MULTIPLY-ACCUMULATE OPERATIONS
Here, we consider the practical performance of photonic MACs based on existing photonic devices. There are a variety of methods for implementing photonic multiply-accumulate operations using tunable photonic elements [8], [78], [80], [81] and also in many fixed network implementations in reservoir computing approaches [82]- [85]. We will distinguish between two primary mechanisms for implementing linear summations: coherent or incoherent, as defined in Ref. [7]. The former uses interferometry to implement linear operations via constructive and destructive interference, changing the relative power levels of a coherent beam. The second utilizes excited carriers to perform summations or nonlinear operations, and can potentially accept multiple wavelengths or modes.
Coherent approaches can implement linear, unitary operations while only consuming energy resulting from passive loss. However, operations must be performed within a single wavelength and mode for a given matrix-or else constructive and destructive interference would not occur between interacting lightwaves-and all-optical nonlinearities are generally challenging to implement at low optical signal intensities. Systems that fall under the interference-mediated approach include the passive reservoir [85] and the interference-based processor described in Ref. [8].
Incoherent photonic MAC units are capable of operating across different wavelengths, modes, or polarizations. For dot product functionality, filter banks (described in [78], [79]), can apply weights via the partial transmission of signals to one (or more) detectors. This can greatly increase the information density on-chip, since many independent channels can coexist in a single waveguide. Performing a MAC is also passive in the incoherent approach: for a fixed filter topology, the computations are performed as lightwaves flow to their respective destinations. Unlike in the coherent approach, semiconductor devices (and therefore, O/E conversions) are required at each nonlinear processing stage. Systems that occupy this category include those described in Ref. [78], [82], [86], [87]. A more detailed discussion of these relationship is also provided in Ref. [7].
For both approaches, we will speak broadly about photonic MAC operations in the context of an N × N matrix operation. We consider the energy per MAC, speed (signal bandwidth and latency), and computation density (i.e., MACs/s/mm 2 ).

A. Energy
Photonic devices, much like their resistive electronic counterparts, implement matrix operations passively and linearly. This leads to a number of advantages-in particular, for an N × N matrix, many of the most expensive energy costs scale with the size of the vector O(N ) rather than the size of the matrix O(N 2 ). Below, we outline a general framework for understanding energy consumption in passive N × N photonic arrays, and provide some analysis on the trade-offs between various tunable devices.
First, we consider the cost of driving the system with a light source. An unavoidable, fundamental contribution is from shot noise, as explored in Section IV-C. We can also have relative intensity noise (RIN) on each laser input, which can affect our precision N b . However, this is typically close to the shot noise level for sufficiently high modulation frequencies. Secondly, we must drive the capacitor of the detector with enough light to switch it (see Section IV-B). The main point to consider is whether these energies scale with O(N ), O(N 2 ) or something worse, which depends on the precision loss ρ. As mentioned in Section II, it is likely that deep learning algorithms work well in fixed point arithmetic, allowing us to recover an O(N ) scaling law for our light input with ρ = N . Therefore, we potentially have a favorable scaling law for our light source, depending on the nature of the computations being performed.
Secondly, we consider costs that scale only with O(N ) rather than O(N 2 ), which are those involving the periphery around the N × N matrix. Since we must first retrieve data from memory, modulate N signals on the input and detect N such signals at the output to place back into memory, we must consider the intrinsic costs associated with the driving and receiving circuitry, the modulators, detectors, and memory I/O. These energies are similar to those in digital photonic links (see Ref. [23], [88]), which include both driving and tuning the modulating device and the amplification and the recovery circuitry in the electronic receiver. Energy per sample can reach in the hundreds of fJ for co-optimized photonic platforms [88], [89].
Lastly, we consider what can be the largest contribution to energy: costs that scale with O(N 2 ) with every photonic device. Although fixed systems can implement a pre-defined weight matrix W passively with low loss, tunable systems require a way of modifying the weight w. Photonic devices currently use heaters for coarse tuning, which consume a significant amount of power. Phase shifters in coherent approaches typically consume 10 mW to 20 mW per unit for thermal shifting [90], while microring heaters can consume ∼1 mW [91]. However, given the nature of passive photonic systems, these limits are not inherent. There are a variety of device modifications that promise to alleviate these problems that could see integration into foundries very soon. For example, phase shifters can be greatly enhanced with slow light cavities [92], and microresonators can be trimmed to the desired value using foundry-compatible techniques, negating the need for a heater [93]- [95].
Considering all these factors, our full energy per MAC equation is as follows: The first term accounts for the optical power supplied to the system, which may either be noise limited (left) or swing-limited (right). The second term accounts for the capacitive switching and driving circuitry for the modulators (E mod ), detectors (E rec ) and the memory retrieval cost (E mem ) (which includes DAC/ADC conversion if digital memory is used, which can be made quite efficient [97]). M refers to the number of compute cycles that occur before data is passed back to memory: for example, in a hardware neural network processor with a feedforward topology, M is equivalent to the number of network layers that have been fabricated onchip. The last term is the quiescent power use P q for each element, which includes the power of coarse tuned heaters and the leakage power across diode junctions. We operate our system over some characteristic sampling time window Δτ with some effective sampling rate 1/Δτ .
In practice, for heater-tuned resonators and phase shifters, the primary source of energy consumption is from tuning each element. If we operate the system at 10 GS/s (see Ref. [88] for various photonic link speeds), this puts the energy squarely in the range of 150 fJ/MAC to 1.5 pJ/MAC for rese shifters (on the high end). If we use techniques to remedy this cost as discussed above, our next primary contributions are the link energy E L = E mod + E rec + E mem /M -which is typically in the 100 s of fJ range-and the capacitive charge of the detector, which consumes several fJ, even with conservative assumptions on precision (ρ = (N )). The former quantity divides by N , so with channel counts in the hundreds, we are quickly brought to the low fJ/MAC range. This means that with N > 100 and the eradication of power hungry heaters, the single digit fJ/MAC range becomes tenable, a >10 2 improvement over the current state-of-the-art in energy efficiency.
In order for us to go beyond into the ∼aJ range that we have explored at the analog limits (Section IV), we rely on the (I) creation of very low energy optoelectronic devices to reduce E L significantly as discussed in Ref. [16], and (II) fixed point operations with ρ = N to reduce the energy cost of the light source, which reduces both the shot noise contribution and the light required to drive each detector, and (III) a reduction of the memory I/O cost via either efficient photonic links or many-layer physical neural networks. We explore an architecture aimed at bringing aJ/MAC efficiencies in Section VI.

B. Speed
Photonic MACs can be done at very high speeds, limited only by the optoelectronic devices that encode and decode the signals on the input and output. An N × N matrix only requires one time step to compute the result. We can divide speed into two primary components: signal bandwidth and latency. If the system is bandwidth-limited by multiple parts of the signal pathway with time constants τ 1 , τ 2 , τ 3 . . . , we can approximate the total bandwidth as The delay for each component is about half the bandwidth, i.e., τ 1 /2, τ 2 /2 . . . and the total latency is the addition of all the delays s.t.
Several properties of photonic devices lead to their operation at much higher speeds compared to digital and analog electronic devices: (1) they do not suffer from data movement and clock distribution costs along metal wires, reducing the Fig. 6. Schematic of the neuromorphic photonic models under comparison with photonically-connected memory [21]- [23]. The abstract neuron model (above) can be represented using: (A) A hybrid spiking laser neuron, investigate in Ref. [7], [103]. (B) A co-integrated silicon modulator neuron, based on the system in Ref. [80], [104]. (C) A sub-λ photonic crystal neuron, running close to fundamental photonic limits. Photonically connected memory refers to models such as [21]- [23]. *Note that A does not require the off-chip laser source since it generates its own light.
thermal barrier and allowing for higher clock rates, (2) a small number of photonic devices are required to perform the same MAC operations, greatly reducing latency, (3) photonic devices have a larger footprint than analog electrical devices and thus run faster to saturate the available bandwidth density, and (4) photonic arrays do not suffer from the clock jitter problems that plague metal wires and cause inconsistent signal arrival times. With typical bandwidths of >20 GHz per photonic device and only several photonic devices in a signal pathway for a given N × N matrix operation, the signal bandwidth of each input can readily exceed 10 GS/s. Similarly, a <50 ps delay time for most photonic components and only several devices per pathway (see, for example Fig. 6) results in a delay that is <100 ps. In other words, the entire matrix is effectively computed in less than a single digital electronic clock cycle. This contrasts quite sharply with the ∼μs latencies and > 1 ns speeds seen in current electronic approaches [12]. We thus see a stark >10 3 decrease in latency, meaning that any practical system will be limited more by the periphery circuitry than the neural network core itself.

C. Compute Density
We use the same compute density metric defined in Section II-C: the number of operations (MACs) performed in a given area (mm 2 ) per unit of time (seconds). The underlying density of a photonic compute core can be quite high using standard photonic components, which we will illustrate with a simple example: suppose we took the 512 × 512 AWG prototyped in Ref. [96] and used it to apply N 2 linear operations over a vectorized set of input light intensities. Suppose that there were multiple sets of these signals at different wavelengths s.t. they were multiplexed across the entire ∼5 THz wavelength band. If we took the number of operations and divided by the area of the chip, we get the rather large compute density of 6.8 PMAC/s/mm 2 , exceeding the state-of-the-art in digital electronics by >10 4 . This gives a picture for the capacity of photonics-the large value stems largely from the ability to multiplex both signals and connections, a technique exploited quite often by optical reservoir computing approaches (see for example Ref. [82], [85]).
However, making matrices with adjustable weight values w ij can be more challenging-tunable photonic systems typically require N 2 photonic devices, since there must be a device for every weight w ij . As discussed in Section V-A, there are a couple tunable approaches that have received significant attention: the coherent and incoherent approaches, which require 2N 2 Mach-Zehnder interferometers (MZIs) or N 2 resonators, respectively. The former currently loses on compute density, since each MZI requires significantly more area (∼10000 μm 2 in Ref. [8]) compared to microresonators (∼250 μm 2 or much smaller). Miniaturizing each MZI relies on some complex modifications, such slow-light enhanced structures [92] or perhaps inverse design [98], [99]. whereas resonators can increase in performance as they are shrunken in size [100].
To get a better sense of what N 2 photonic devices can achieve, we can look towards prototyped devices that are compatible with silicon photonic foundry models. Standard microrings of size 50 μm × 50 μm operating at a sampling speed of 10 GS/s results in a computational density of 10 TMACs/s/mm 2 . This is a major improvement over current digital electronic densities, which are around 580 GMACs/mm 2 (see Table II). A key point is that even though photonic devices are much larger than individual transistors, a single MAC unit in digital electronics is actually composed of many hundreds or thousands of transistors, occupying >100 μm in area [101], comparable to the one (or several) elements that can accomplish the same operation in analog photonics. With a higher energy efficiency, photonic elements can be clocked much faster without hitting energy density limits, leading to the overall larger compute density seen here.
What compute densities will photonics be able to attain in the near future? This is considered in the last part of Section VI, in which photonic crystal defect states [102] that occupy close to 2 μm 2 per resonator are closely packed together. As shown in Table II, this can lead to an enormous photonic compute density (5 PMACs/s/mm 2 ). In conclusion, we can expect photonic devices will exceed current digital electronic systems by >10 2 in compute density with miniaturized resonator components. In the future, more exotic structures (such as PhCs) could reach >10 3 as photonic devices reach their fundamental limits in size.

VI. NEURAL NETWORK HARDWARE COMPARISON
This section provides comparisons between neuromorphic photonic processing models and digital electronic processing systems. For concreteness, we focus specifically on Broadcastand-Weight architectures [78], [103], which have been developed enough for a comparison to be possible-in particular, the empirical validation of both tunable weight systems [30], [105]- [107] and nonlinear processors that have a direct functional correspondence with neuron models [7], [104]. Nonetheless, given that photonic architectures are bound by the same physical constraints and underlying devices, this comparison provides some insights for the performance of neuromorphic photonic systems in the more general case. For the photonic platforms, we choose three models with distinct characteristics: 1) a laser neural network based on an instantiation in a hybrid spiking III-V/silicon platform [108], [109], 2) a silicon photonics platform with tight co-integration with digital electronic drivers, controllers, and amplifiers [80], and 3) a nanophotonic platform operating close to fundamental noise limits. These hardware platforms are depicted in Fig. 6. A list of computed values is included in Table II, and a graph depicting the compute density and energy efficiency-along with some of the the analog limits discussed in Section IV-is shown in Fig. 7. 1) Hybrid Laser Neural Network: This model, which is largely the focus of Ref. [103], uses currently available silicon Fig. 7. Comparison of deep learning hardware accelerators with photonic platforms discussed in Section VI, modified from Ref. [7]. Photonic systems can support high bandwidth densities on-chip while consuming minimal energy both transporting data and performing computations. Metrics for digital electronic architectures taken from various sources [12], [124]- [127]. Also included are the analog limits for photonic and electronic matrix cores with N = 1024 and 4 bits of precision, from Table I. photonic technology together with integrated III-V lasers to emulate biological spiking behavior. It has been proposed together with the Broadcast-and-Weight networking framework [78], and has also received considerable experimental validation, both in the tunable weight units [105] and the nonlinear processors that communicate using such units [7], [111], [112]. These systems are limited by two primary sources of energy consumption: the quiescent power of the laser and amplifier units (which can be as large as 200 mW), and the static power consumption of the heaters used within each filter bank (which can be as large as 2 mW each). For the comparison, we assume an all-to-all network with a channel number of N = 56, based on limits discussed in Ref. [105]. The precision is based on experimentally-validated measurements of microring filters [107]. We assume that, for excitable operation, lasers are biased close to that threshold. We also consider a semiconductor optical amplifier on the output port to generate enough output power for the next stage. For an N × N fully-connected network, the energy consumption per MAC operation can be expressed as: where P λ(th) = I th V L represents the laser power consumption at threshold current I th , P SOA = I SOA V SOA is the power consumption of each output SOA, P h = I 2 h R h is the average power dissipation of each microring heater, and P l = I l V MRR is the current across the junction biased at V MRR . τ s represents the effective sampling rate, determined by the bandwidth of the real-time signal pathway and I/O (i.e., ∼10 GHz [109]) during operation. We distinguish between power use at each node (which scales with O(N ) for N 2 operations) and power use at each edge (which scales with O(N 2 ) for a MAC performed at each network edge), and omit the memory I/O cost, since it is not dominant here.
In this system, energy efficiency is primarily bottlenecked on the quiescent power consumption of the optical amplifier and that of the heaters. In practice, the remaining contributions-the laser threshold power and leakage terms-are negligible in comparison. In particular, the amplifier must provide enough energy to drive the next stage, meeting cascadability conditions as discussed in Ref [7]. With our assumed channel density N = 56, and other parameters based on current photonic devices (τ s ∼ 100 ps), we arrive at 0.22 pJ shown in Table II. This system is comparable to deep learning chips and neuromorphic electronic systems in energy consumption, fan-in, and compute density. In the following section, we will explore the improvements that can manifest in systems better optimized for higher energy efficiency.
2) Co-Integrated Neuromorphic Silicon Photonic Network: This platform (first discussed in Ref. [80], [104]) uses continuous models and can vastly reduce the energy consumption via a close interface between digital electronic and photonic systems. This interface allows easy E/O and O/E conversions between electrical nonlinearities and photonic linear computation elements. This system also uses silicon photonic technologies that are currently available in foundries, but its performance depends critically on several new developments and insights, including: (I) the use of active electronic amplification to sidestep the gainbandwidth trade-off in each nonlinear processing unit, and (II) the reduction of static power in microring filters by minimizing the use of heaters. For the remainder of this analysis, we also assume a close proximity, low capacitance interface between electronics and photonics (i.e., TOVs with <50 fF [113]), and low-node electronics (i.e., FinFETs [114]).
One of the first challenges is minimizing the quiescent power usage that results from each filter (scaling with O(N 2 )) requiring a power hungry heater. Note that this is not a problem inherent in photonic elements, since a pre-fabricated fixed photonic network performs the same computations without consuming power. To avoid the immense cost of tuning across the fabrication variation that occurs across microresonators, we assume that each element is trimmed to avoid the use of heaters, as discussed in Section V-A. Integrating these approaches into the fabrication process would allow for an tremendous reduction (P h → 0) in energy consumption.
Next, we consider the limitations imposed by amplitude cascadability. In a passive neuron configuration in which a detector directly drives a modulator with no intermediate circuitry (i.e., Ref. [104], each nonlinear element must replenish the energy lost from the previous layer. In an all-to-all N -node network with N 2 connections, we must assure that the small-signal gain from layer to layer allows is greater than unity (i.e., g = dP out /dP in > 1). This puts the following lower bound on the energy consumption per MAC operation [115]: switching charge for gain cascadability (15) where η = η L η wg η d is laser efficiency, photonic link efficiency, and photodetector efficiency, respectively; V s is the inverse slope of the modulator's voltage-to-transmission curve T (V ); and C mod , C PD are the joint capacitances of the photodetetor and modulator. In a typical foundry-model where V s (C mod + C PD ) ∼ 70 fC and η ∼ .06 (which includes the passive losses through the weight banks, which can be made quite small [116]), even with ρ = N in fixed point systems, we arrive at a floor of approximately E MAC ≥ 30 fJ/MAC.
Going beyond this barrier requires the use of an active transimpedance amplifier (TIA) placed between the detector and modulator, which can be instantiated using digitally-compatible circuitry in a number of different configurations. This serves several functions: it can separate capacitive contributions of the photodetector and modulator, and it also reduces the impedances associated with each stage. In a low-node electronics platform with TOVs, the energy consumption per sample can be quite low (<100 fJ) for a good TIA, see for example analysis in Ref. [88] or Ref. [89], [117]. Given that the signal-to-noise ratio must exceed the given bit precision N b (i.e., SNR = I p /σ i > 2 N b for received current I p and RMS shot noise current σ i at each detector), we arrive at a new energy-per-MAC metric: Here, E link includes contributions from the active TIA, the modulator switching energy per unit of time τ s , which is typically expressed in J/bit, and the energy associated with memory I/O. We conservatively assume just one neural network layer [M = 1 from Eq. (13)]. Note that we neglect the effect of nonlinearity on noise reduction, which can have positive effects on the resulting precision, as discussed in Ref. [115]. With fixed point like precision (ρ = 1), power dissipation is dominated by E/O and O/E interfaces together with digital circuitry. Given the similarity between each modulator neuron and the E/O/E interface in a standard photonic link-requiring the same electrical interfaces, amplification, and driver circuitry-we can use E link estimates from digital links [88], [118] and those from photonically connected DRAM memory [21] to arrive at 400 fJ/sample as a relatively accurate proxy for the link energy.
We arrive at our energy consumption of 2.7 fJ as shown in Table II. We assume an improvement in areal density and channel density by shrinking the resonators to ∼10 μm in diameter [100] and high fidelity photonic two-pole filters as described in [107].
3) Sub-λ Nanophotonics: Here, we consider the performance of photonic devices as they begin to hit their physical limits in the B&W architecture. The basic principle of operation of each unit is similar to the co-integrated silicon photonic network. The platform is assumed to include both low node electronics and photonics on the same platform (i.e., a variant of [119]) to avoid additional capacitances at the interfaces. Additionally, we assume that there are a significant number of layers in the network (M ∼ 100) before the information is passed back to memory, further amortizing the energy cost (400 fJ/sample) of the photonic memory link by the factor 1/M [as described in Eq. (13)]. Each sub-λ neuron uses (I) a nanophotonic photodetector such as [77] with <1fF of capacitance, (II) operate in the "near-receiverless" regime discussed in [16], i.e., a minimal gain stage, if any, between the detector and modulator such as a single inverter amplifier (see Ref. [117], [120]), and (III) the filters and modulators are instantiated efficiently using more exotic enhancement techniques [121], [122]. We utilize devices that have been empirically prototyped, but not yet scaled in foundries. Our metrics are based on several insights: r Compute Density: Photonic devices can be shrunk significantly in size compared to where they are now. The smallest known resonators are photonic crystal defect states, [102] which can occupy small footprints-if we pack them very tightly, they can be as small as ∼2 μm 2 . A single defect state can potentially perform a weight multiplication. This has significant ramifications for compute density (∼10 3 ) compared to microring filter banks, even if the effective sampling rate is kept constant.
r Channel Number: The number of channels is limited by the total bandwidth available in the optical spectrum. At 10 GS/s, we can fit about fit about ∼300 channels in a 30 nm spectral gain curve. Although channel number can be extended further through the use of heteregeneous laser sources or frequency combs, this goes beyond the scope of this work. We also assume low precision, fixed point operations (ρ = N ).
r Energy Consumption: There are many vectors for improvement in Eq. (16). We will assume the reversebiased filter leakage can be brought down from microamperes [100] to nanoamperes with better manufacturing. The O/E/O switching energy E samp -which shares many properties with digital links-can be improved significantly using a variety of techniques to reach the ∼1 fJ level [16]. Modulators, for example, can reach in the ∼100 aJ per bit range [123]. We also assume that optical losses through the system are small, which can be optimized via passive device engineering. With this in place, the system is now bottlenecked by shot noise at the detector and the cost of the I/O to memory. limiting precision for a given input power. With more efficient laser sources, the total quantum efficiency to as high as η ∼ 20%. All together, this leads to E MAC = 17.3 aJ + 13.3 aJ (memory) = 30.6 aJ.

VII. SUMMARY AND CONCLUDING REMARKS
Historically, both electronic neuromorphic systems and electronic emulations of neural networks have been constrained by the inherent scaling laws of digital systems and metal interconnects. In particular, energy scales with O(N 2 ), where N is the number of neurons, and for systems of large numbers of neurons, this becomes untenable for modern applications. Photonics provides a solution, alleviating the energy consumption of both data movement across metal wires and of multiply-accumulate (MAC) computation itself, both of which are major bottlenecks in neural computing.
We have extensively compared the limits of electronic crossbar arrays with photonics linear compute cores, and have shown that photonics exhibits advantages for large processor sizes (>100 μm), large vector sizes (N > 500), and low precision (≤4 bits). We have discussed the myriad advantages that photonic multiply-accumulate (MAC) operations possess over their digital electronic counterparts in energy (>10 2 ), speed (>10 3 ), and compute density (>10 2 ). We have analyzed how they can manifest in practical models, based on empirically validated, foundry compatible photonic devices. Although we considered resonator-based methods for networking and linear operations, the advantages of photonic MACs remain relevant for many architectures beyond those presented in this work.
Although photonics has traditionally been studied for its role in communication, there is great potential to address new and emerging bottlenecks in computing. Artificial intelligence has brought unique challenges to processor architectures: modern GPUs and machine learning ASICs now implement high volume, high density, low precision matrix operations with specialized compute cores. These processors are subject to trade-offs that are significantly more communication bottlenecked than traditional von Neumann architectures. There are still many challenges towards seeing functional analog computing systems: for example, one must consider the cost of the periphery, the cost of reprogramming weights during training, the cost of A/D and D/A conversion, and the higher-level communication protocols between multiple neuromorphic cores. Nonetheless, photonics has the potential to address the major bottlenecks present in AI hardware, providing a means to simultaneously move data across a chip and perform matrix multiplication with little cost.
ACKNOWLEDGMENT I would like to thank both Michael Gao and Marcus Gomez for their stimulating discussions regarding artificial intelligence algorithms and hardware.