11 Tera-FLOP/s photonic convolutional accelerator and deep learning optical neural networks

Convolutional neural networks (CNNs), inspired by biological visual cortex systems, are a powerful category of arti�cial neural networks that can extract the hierarchical features of raw data to greatly reduce the network parametric complexity and enhance the predicting accuracy. They are of signi�cant interest for machine learning tasks such as computer vision, speech recognition, playing board games and medical diagnosis [1-7]. Optical neural networks offer the promise of dramatically accelerating computing speed to overcome the inherent bandwidth bottleneck of electronics. Here, we demonstrate a universal optical vector convolutional accelerator operating beyond 10 Tera-FLOPS (�oating point operations per second), generating convolutions of images of 250,000 pixels with 8-bit resolution for 10 kernels simultaneously — enough for facial image recognition. We then use the same hardware to sequentially form a deep optical CNN with ten output neurons, achieving successful recognition of full 10 digits with 900 pixel handwritten digit images with 88% accuracy. Our results are based on simultaneously interleaving temporal, wavelength and spatial dimensions enabled by an integrated microcomb source. This approach is scalable and trainable to much more complex networks for demanding applications such as unmanned vehicle and real-time video recognition.


Introduction
Arti cial neural networks (ANNs) are collections of nodes with weighted connections that, with proper feedback to adjust the network parameters, can "learn" and perform complex operations for face recognition, speech translation, playing board games and medical diagnosis [1][2][3][4].While classic fully connected feedforward networks face challenges in processing extremely high-dimensional data, convolutional neural networks (CNNs), inspired by the (biological) behavior of the visual cortex system, can abstract the representations of input data in their raw form, and then predict their properties with both unprecedented accuracy and greatly reduced parametric complexity [5].CNNs have been widely applied to computer vision, natural language processing and other areas [6,7].
The capability of neural networks is dictated by the computing power of the underlying neuromorphic hardware.Optical neural networks (ONNs) [8][9][10][11][12] are promising candidates for next-generation neuromorphic computation, since they have the potential to overcome the bandwidth bottleneck of their electrical counterparts [6, [13][14][15][16]] and achieve ultra-high computing speeds enabled by the >10 THz wide optical telecommunications band [8].Operating in analog frameworks, they avoid the limitations imposed by the energy and time consumed during reading and storing data back and forth, known as the von Neumann bottleneck [13].Signi cant progress has been made in highly parallel, high-speed and trainable ONNs [8][9][10][11][12][17][18][19][20][21], including approaches that have the potential for full integration on a single photonic chip [8,12], in turn offering an ultra-high computational density.However, there remains opportunities for signi cant improvements in ONNs.Processing large-scale data, as needed for practical real-life computer vision tasks, remains challenging for ONNs because they are primarily fully connected structures where their input scale is determined solely by hardware parallelism.This leads to tradeoffs between the network scale and footprint.Moreover, ONNs have not achieved the extreme computing speeds that analog photonics is capable of, given the very wide optical bandwidths that they can exploit.
Here, we demonstrate an optical convolution accelerator that operates beyond 10 Tera-FLOPS ( oating point operations per second) and use it to process and compress large-scale data.Through interleaving wavelength, temporal, and spatial dimensions using an integrated Kerr frequency comb (or "microcomb" ), we achieve a vector computing speed as high as 11.322 Tera-FLOPS.We then use it to process images with a length of 250,000 pixels with ten convolution kernels at 3.8 TeraFLOPs.Our convolution accelerator is fully and dynamically recon gurable, as well as scalable, so that it can serve as both a convolutional accelerator front-end to generate convolutions with multiple and simultaneous parallel kernels, as well as forming an optically deep CNN with fully connected neurons, without any change in hardware.We use the deep CNN to achieve successful recognition of the full ten digits (0-9) for handwritten images, achieving and accuracy of 88%.Our optical neural network represents a major step towards realizing monolithically integrated ONNs and is enabled by our use of an integrated microcomb chip.Moreover, our accelerator scheme is stand alone and universal -fully compatible with either electrical or optical interfaces.Hence, it can serve as a universal ultrahigh bandwidth data compressing front end for any neuromorphic hardware -either optical or electronic based -bringing massive-data machine learning for both real-time and ultrahigh bandwidth data within reach.

Principle Of Operation
Figure 1 shows the principle of operation for the photonic vector convolutional accelerator (VCA) which features high-speed electrical signal ports for data input and output.The input data vector X is encoded as the intensity of temporal symbols in a serial electrical waveform at a symbol rate 1/τ (baud), where τ is the symbol period.The convolution kernel is similarly represented by a weight vector W of length R that is then encoded in the optical power of the microcomb lines through spectral shaping performed by a Waveshaper.The temporal waveform X is then multi-cast onto the kernel wavelength channels via electro-optical modulation, thus generating the replicas weighted by W. Next the optical waveform is transmitted through a dispersive delay with a delay step (between adjacent wavelength channels) equal to the symbol duration of X, effectively achieving time and wavelength interleaving.Finally, the delayed and weighted replicas are summed via high speed photodetection so that each time slot yields a convolution between X and W for a given convolution window, or receptive eld.As such, the convolution window effectively slides at the modulation speed matching the baud rate of X.Each output symbol is the result of R multiply-and-accumulate operations, with the computing speed given by 2R/τ FLOPS.Since the speed of this process scales with both the baud rate and number of wavelengths, it can be dramatically boosted into the Tera-FLOP regime by using the massively parallel wavelength channels of the microcomb source.Moreover, the length of the input data X is theoretically unlimited so that the convolution accelerator can process data with an arbitrarily large scale-the only practical limitation being the capability of the external electronics.
We achieve the simultaneous convolution of multiple kernels in parallel simply by adding additional subbands of R wavelengths for each additional kernel.Following multicasting and dispersive delay, the subbands (kernels) are demultiplexed and detected separately with high speed photodetectors, generating a separate electronic waveform for each kernel.The VCA is fully recon gurable and scalable -the number and length of the kernels are arbitrary, limited only by the total number of wavelengths.
While the core convolutional accelerator system typically processes vectors, it can easily be adapted to operate on matrices for image processing.For optical processing of matrix operations, the matrix must rst be attened into a vector, and the precise way that this is performed determines both the sliding convolution window's stride and the equivalent matrix computing speed.Our attening method sets the receptive eld (convolution slot) to slide with a horizontal stride of unity (ie., every matrix input element has a corresponding convolution output) and a vertical stride that scales with the size of the convolutional kernel.The larger vertical stride effectively resulted in sub-sampling across the vertical direction of the raw input matrix, equivalent to a partial pooling function [68] in addition to convolution.This resulted in an effective reduction (or overhead) in matrix computing speed that scales inversely with the size of the kernel (eg., a 3x3 kernel yields an overhead (speed reduction) of a factor 3).While this can be alleviated by various means to produce convolutions with a symmetric stride and no speed overhead, this is actually not necessary for most applications.
Finally, this approach is highly exible and recon gurable without any change in hardware -we use same system for the convolutional accelerator for image processing as well as to form an optical deep learning CNN which we use to perform a separate series of experiments.The convolutional accelerator hardware forms both the input processing stage as well as the fully connected neuron layer of the CNN (see below).
The system can achieve matrix multiplication by simply sampling one time slot of the output waveform, since the vector dot product is equivalent to the special convolution case where the two input vectors X and W have the same length.

Experiment Matrix Convolutional Accelerator
Figure 2 shows the experimental setup for the full matrix convolutional accelerator that we use to process a classic 500×500 face image.The system performs 10 simultaneous convolutions with ten 3×3 kernels to achieve distinctive image processing functions.The weight matrices for all kernels were attened into a composite kernel vector W containing all 90 weights (10 kernels with 3x3=9 weights each), which were then encoded onto the optical power of 90 microcomb lines by an optical spectral shaper (Waveshaper), each kernel occupying its own frequency band of 9 wavelengths.The wavelength channels were supplied by a coherent soliton crystal microcomb (Fig. 3) via optical parametric oscillation in a single micro-ring resonator (MRR Fig. 3b)  with a radius of 592 μm [22,23], corresponding to a spacing of ~ 48.9 GHz [31] with an optical bandwidth of ~ 36 nm for the 90 wavelengths across the telecommunications Cband (1540-1570 nm) (see Methods) [30].
The raw 500×500 input face image was attened electronically into a vector X and encoded as the intensities of 250,000 temporal symbols with a resolution of 8 bits/symbol (limited by the electronic arbitrary waveform generator (AWG)), to form the electrical input waveform via a high-speed electrical digital-to-analog converter, at a data rate of 62.9 Giga Baud (time-slot τ =15.9 ps) (Fig. 4b).The waveform duration was 3.975µs for each image corresponding to a processing rate for all ten kernels of over 1/3.975µs, equivalent to 0.25 million of these ultra-large-scale images per second.
The input waveform X was then multi-cast onto the 90 shaped comb lines via electro-optical modulation, yielding replicas weighted by the kernel vector W. Following this, the waveform was then transmitted through a ~2.2 km length of standard single mode bre having a dispersion of ~17ps/nm/km.The bre length was carefully chosen to induce a relative temporal shift in the weighted replicas with a progressive delay step of 15.9 ps between adjacent wavelength channels.This delay exactly matched the duration of each input data symbol τ, which effectively resulted in time and wavelength interleaving for all ten kernels.
The 90 wavelengths were then de-multiplexed into 10 sub-bands of 9 wavelengths, each sub-band corresponding to a kernel, and separately detected by 10 high speed photodetectors.The detection process effectively summed the aligned symbols of the replicas (the electrical output waveform of one of the kernels (kernel 4) is shown in Fig. 4c).The 10 electrical waveforms were converted into digital signals via ADCs and resampled so that each time slot of each of the waveforms corresponded to the dot product between one of the convolutional kernel matrices and the input image within a sliding window (i.e., receptive eld).This effectively achieved convolutions between the 10 kernels and the raw input image.The resulting waveforms thus yielded the 10 feature maps (convolutional matrix outputs) containing the extracted hierarchical features of the input image (Fig. 4d).
The convolutional vector accelerator makes full use of time, wavelength, and spatial multiplexing, where the convolution window effectively slides across the input vector X at a speed equal to the modulation baud-rate -62.9 Giga Symbols/s.Each output symbol is the result of 9 (the length of each kernel) multiply-and-accumulate operations, thus the core vector computing speed (i.e., throughput) of each kernel is 2×9×62.9= 1.13 Tera FLOPS.For ten kernels computed in parallel the overall computing speed of the VCA is therefore 1.13×10 =11.3 Tera FLOPS, or 11.321×8=90.568tera-bits per second (Tb/s) (reduced slightly by the optical signal to noise ratio (OSNR)).This speed is over 500 times higher than the fastest speed of ONNs reported to date.
For the image processing matrix application demonstrated here, the convolution window had a vertical sliding stride of 3 (resulting from the 3×3 kernels), and so the effective matrix computing speed was 11.3/3=3.8TeraFLOPs.Homogeneous strides operating at the full vector speed can be readily achieved by duplicating the system with parallel weight-and-delay paths, although we found that this was unnecessary.While the length of the input data processed here was 250,000 pixels, the convolution accelerator can process data with an arbitrarily large scale, the only practical limitation being the capability of the external electronics.

Deep Learning Optical Convolutional Neural Network
The convolutional accelerator architecture presented here is fully and dynamically recon gurable and scalable with the same hardware system.We were thus able to use the accelerator to sequentially form both a frontend convolution processor as well as a fully connected layer, together yielding an optical deep CNN.We applied the CNN to the recognition of full 10 (0-9) handwritten digit images.
Figure 5 shows the overall principle of the optical deep CNN while Figure 6 shows the detailed experimental con guration.The convolutional layer performs the heaviest computing duty of the entire network, generally taking 55% to 90% of the total computing power, and operated as described in the previous section.The digit images -30×30 matrices of grey-scale values with 8 bit resolution -were attened into vectors and multiplexed in the time-domain at 11.9 Giga Baud (time-slot τ =84 ps).Three 5×5 kernels were used, requiring 75 microcomb lines (Fig. 7) and hence resulted in a vertical stride of 5.
The dispersive delay was achieved with ~13 km of standard SMF to match the data baud-rate.The wavelengths were de-multiplexed into the three kernels which were detected by high speed photodetectors and then sampled and nonlinearly scaled with digital electronics to recover the extracted hierarchical feature maps of the input images.The feature maps were then pooled electronically and attened into a vector X FC (72×1= 6×4×3) per image that formed the input data to the fully connected layer.
The fully connected layer had ten neurons, each corresponding to one of the ten categories of handwritten digits from 0 to 9, with the synaptic weights represented by a 72×10 weight matrix W FC (l) (ie., ten 72×1 column vectors) for the lth neuron (l ∈ [1, 10]) -with the number of comb lines (72) matching the length of the attened feature map vector X FC .The shaped optical spectrum at the lth port had an optical power distribution proportional to the weight vector W FC (l) , thus serving as the equivalent optical input of the lth neuron.After being multicast onto the 72 wavelengths and progressively delayed, the optical signal was weighted and demultiplexed with a single Waveshaper into 10 spatial output portseach corresponding to a neuron.Since this part of the network involved linear processing, the kernel wavelength weighting could be implemented either before the EO modulation or at a later stage just before photodetection.The advantage of the latter con guration is that both the demultiplexing and weighting can then be achieved with a single Waveshaper.Finally, the different node/neuron outputs were obtained by sampling the 73th symbol of the convolved results.The nal output of the optical CNN was represented by the intensities of the output neurons (Fig. 8), where the highest intensity for each tested image corresponded to the predicted category.The peripheral systems, including signal sampling, nonlinear function and pooling, were implemented electronically with digital signal processing hardware, although some of these functions (e.g., pooling) can in principle be performed in the optical domain with the VCA.Supervised network training was performed o ine electronically.
We experimentally tested fty 8-bit 30 × 30 resolution images of the handwritten digit dataset with the deep optical CNN.The confusion matrix (Figure 8) shows an accuracy of 88% for the generated predictions, in contrast to 90% for the numerical results calculated on an electrical digital computer.The computing speed of the VCA component of the deep optical CNN was 2×75×11.9=1.785 Tera FLOPS, or 14.3 Terabits/s.For the application to process the image matrices with 5×5 kernels, the convolutional layer had a matrix attening overhead of 5, yielding an image computing speed of 1.785/5= 357 Giga FLOPS.The computing speed of the fully connected layer was 119.8 Giga-FLOPS.The waveform duration was 30×30×84ps=75.6ns for each image, and so the convolutional layer processed images at the rate of 1/75.6ns= 13.2 million handwritten digit images per second.
We note that handwritten digit recognition, although widely employed as a benchmark test in digital hardware, is still (for full 10 digit (0 -9) recognition) beyond the capability of existing analog recon gurable ONNs.Digit recognition requires a large number of physical parallel paths for fullyconnected networks (e.g., a hidden layer with 10 neurons requires 9000 physical paths), which poses a huge challenge for current nanofabrication techniques.Our CNN represents the rst recon gurable and integrable ONN capable not only of performing high level complex tasks such as full handwritten digit recognition, but at ultrahigh TeraFLOP speeds.

Discussion
This approach can be readily scaled in performance in terms of input data size, as well as network size and speed.The data size is limited in practice only by the memory of the electrical digital-to-analog converters, and so in principle it is possible to process 4K-resolution (4096×2160) images.By integrating 100 photonic convolution accelerators layers (still much less than the 65536 processors integrated in the Google TPU [15]), the optical CNN would be capable of solving much more di cult image recognition tasks at a vector computing speed of 100 × 11.3 = 1.130Peta-FLOPS.Further, the optical CNN presented here supports online training, since the optical spectral shaper used to establish the synapses can be dynamically recon gured with a response time of < 500 ms or even faster with integrated optical spectral shapers [69].
Although the current embodiment presented here had a non-trivial optical latency of 0.11 µs introduced by the propagation delay of the dispersive bre spool, this did not affect the operational speed.Moreover, the latency of the delay function can be virtually eliminated (to < 200 ps) by using integrated highly dispersive devices such as photonic crystals or customized chirped Bragg gratings [70].
Finally, current nanofabrication techniques can enable signi cantly higher levels of integration of the convolutional accelerator.The micro-comb source itself is based on a CMOS compatible platform that is intrinsically designed for large-scale integration.Other components such as the optical spectral shaper, modulator, dispersive media, de-multiplexer and photodetector have all been realized in integrated (albeit simpler) forms [71][72][73][74][75].

Conclusion
We demonstrate a universal optical convolutional accelerator operating at 11.3 Tera-FLOPS for vector processing, and use a matrix processing version to perform convolutions on face images with 250,000 8-bit resolution pixels.We then use it to sequentially form an optical deep learning CNN to achieve successful recognition of handwritten digit images.Our network is capable of recognizing and processing large-scale data and images at ultra-high computing speeds for real-time massive-data machine learning tasks, such as identifying faces in cameras or pathology identi cation in clinical scanning applications [76][77][78][79].
In this work we use a particular class of microcomb termed soliton crystals.They were so-named because of their crystal-like pro le in the angular domain of tightly packed self-localized pulses within micro-ring resonators [30].They are naturally formed in micro-cavities with appropriate mode crossings, without the need for complex dynamic pumping and stabilization schemes (described by the Lugiato-Lefever equation [22]).They are characterized by distinctive ' ngerprint' optical spectra (Fig. 2f) which arise from spectral interference between the tightly packaged solitons circulating along the ring cavity.This category of soliton micro-comb features deterministic soliton formation originating from the mode crossing-induced background wave and the high intra-cavity power (the mode crossing is measured as in Fig. 2c).This in turn enables simple and reliable initiation via adiabatic pump wavelength sweeping [29] that can be achieved with manual detuning (the intracavity power during the pump sweeping is shown in Fig. 2d).The key to the ability to adiabatically sweep the pump lies in the fact that the intra-cavity power is over thirty times higher than single-soliton states (DKS), and very close to that of spatiotemporal chaotic states [22].Thus, the soliton crystal displays much less detuning or instability resulting from the 'soliton step' that makes resonant pumping of DKS states more challenging [22].It is this combination of ease of generation and overall conversion e ciency that makes soliton crystals highly suited for demanding applications such as ONNs.
The coherent soliton crystal microcomb (Fig. 2) was generated by optical parametric oscillation in a single integrated micro-ring resonator (MRR).The MRR (Fig. 2b) was fabricated on a CMOS-compatible doped silica platform [22,23], featuring a high Q factor of over 1.5 million and a radius of 592 μm, which corresponds to a low free spectral range of ~ 48.9 GHz [31].The pump laser (Yenista Tunics -100S-HP) was boosted by an optical ampli er (Pritel PMFA-37) to initiate the parametric oscillation.The soliton crystal microcomb provided over 90 channels over the telecommunications C-band (1540-1570 nm), offering adiabatically generated low-noise frequency comb lines with a small footprint of < 1 mm 2 and potentially low power consumption (>100 mW using the technique in [30]).

Evaluation of the computing
Since there are no common standards in the for classifying and quantifying the computing speed and processing power of ONNs, we explicitly outline the performance de nitions that we use in characterizing our performance.We follow the approach that is widely used to evaluate electronic microprocessors.The computing power of the convolution accelerator-closely related to the operation bandwidth-is denoted as the throughput, which is the number of operations performed within a certain period.Considering that in our system the input data and weight vectors originate from different paths and are interleaved in different dimensions (time, wavelength, and space), we use the temporal sequence at the electrical output port to de ne the throughput in a more straightforward manner.
At the electrical output port, the output waveform has L+R−1 symbols in total (L and R are the lengths of the input data vector and the kernel weight vector, respectively), among which L−R+1 symbols are the convolution results.Further, each output symbol is the calculated outcome of R multiply-and-accumulate operations or 2R FLOPS, with a symbol duration τ given by that of the input waveform symbols.Thus, considering that L is generally much larger than R in practical convolutional neural networks, the term (L−R+1)/(L+R−1) would not affect the vector computing speed, or throughput, which (in FLOPS) is given by We note that when processing data in the form of vectors, such as audio speech, the effective computing speed of the accelerator would be the same as the vector computing speed 2R/ τ.Yet when processing data in the form of matrices, as for images, we must account for the overhead on the effective computing speed brought about by the matrix-to-vector attening process.The overhead is directly related to the width of the convolutional kernels, for example, with 3-by-3 kernels, the effective computing speed would be ~1/3 * 2R/τ, which, however, we note still is in the ultrafast (TeraFLOP) regime due to the high parallelism brought about by the time-wavelength interleaving technique.
For the accelerator, the output waveform of each kernel (with a length of L−R+1=250,000−9+1=249,992) contains 166×498=82,668 useful symbols that are sampled out to form the feature map, while the rest of the symbols are discarded.As such, the effective matrix convolution speed for the experimentally performed task is slower than the vector computing speed of the convolution accelerator by the overhead factor of 3, and so the net speed then becomes 11.321×82,668/249,991=11.321×33.07%= 3.7437 Tera-FLOPS.
In addition, the intensity resolution (i.e., the bit-resolution for digital systems) for analog ONNs is mainly limited by the signal-to-noise ratio (SNR).To achieve 8-bit resolution, the SNR of the system needs to reach over 20•log10(2 8 ) = 48 dB.This is within the capability of our accelerator and so our system speed in Terabits/s is simply our speed in FLOPs times 8 -ie., not reduced by our OSNR.

Experiment
To achieve the designed kernel weights, the generated microcomb was shaped in power using liquid crystal on silicon based spectral shapers (Finisar WaveShaper 4000S).We used two WaveShapers in the experiments -the rst was used to atten the microcomb spectrum while the precise comb power shaping required to imprint the kernel weights was performed by the second, located just before the photodetection.A feedback loop was employed to improve the accuracy of comb shaping, where the error signal was generated by rst measuring the impulse response of the system with a Gaussian pulse input and comparing it with the ideal channel weights.(Figure S6 and S7 show the shaped impulse response for the convolutional layer and the fully connected layer of the CNN).
The electrical input data was temporally encoded by an arbitrary waveform generator (Keysight M8195A) and then multicast onto the wavelength channels via a 40 GHz intensity modulator (iXblue).For the 500×500 image processing, we used sample points at a rate of 62.9 Giga samples/s to form the input symbols.We then employed a 2.2 km length of dispersive bre that pro ided a progressive delay of 15.9 ps/channel, precisely to the input baud rate.For the convolutional layer of the CNN, we used 5 sample points at 59.421642 Giga Samples/s to form each single symbol of the input waveform, which also matched with the progressive time delay (84 ps) of the 13km dispersive bre (the generated electronic waveforms for 50 images are shown as Fig. S8 and S9, which served as the electrical input signal for the convolutional and fully connected layers, respectively).
For the convolutional accelerator in both experiments -the 500×500 image processing experiment and the convolutional layer of the CNN -the second Waveshaper simultaneously shaped and de-multiplexed the wavelength channels into separate spatial ports according to the con guration of the convolutional kernels.As for the fully connected layer, the second Waveshaper simultaneously performed the shaping and power splitting (instead of de-multiplexing) for the ten output neurons.Here, we note that the demultiplexed or power-split spatial ports were sequentially detected and measured.However, these two functions could readily be achieved in parallel with a commercially available 20-port optical spectral shaper (WaveShaper 16000S, Finisar) and multiple photodetectors.
The negative channel weights were achieved using two methods.For the 500×500 image processing experiment and the convolutional layer of the CNN, the wavelength channels of each kernel were separated into two spatial outputs by the WaveShaper according to the signs of the kernel weights, and then detected by a balanced photodetector (Finisar XPDV2020).Conversely, for the fully connected layer the weights were encoded in the symbols of the input electrical waveform during the electrical digital processing stage.Incidentally, we demonstrate the possibility using of different methods to impart negative weights, both of which work in the experiments.
Finally, the electrical output waveform was sampled and digitized by a high-speed oscilloscope (Keysight DSOZ504A, 80 Giga Symbols/s) to extract the nal convolved output.
In the CNN, the extracted outputs of the convolution accelerator were further processed digitally, including rescaling to exclude the loss of the photonic link via a reference bit, and then mapped onto a certain range using a nonlinear tanh function.The pooling layer's functions were also implemented digitally, following the algorithm introduced in the network model.
The residual discrepancy or inaccuracy in our work for both the recognition and convolving functions, as compared to the numerical calculations, was due to the deterioration of the input waveform caused by intrinsic limitations in the performance of the electrical arbitrary waveform generator.Addressing this would readily lead to a higher degree of accuracy (i.e., closer agreement with the numerical calculations).

Figures
Page 17/25 The architecture of the optical CNN, including a convolutional layer, a pooling layer, and a fully connected layer.
Page 22/25 Experimental schematic of the optical CNN.Left side is the input front end convolutional accelerator while the right side is the fully connected layer, both of which form the deep learning optical CNN.The microcomb source supplies the wavelengths for both the tera-FLOPS photonic convolution accelerator as well as the fully connected layer systems.The electronic digital signal processing (DSP) module used for sampling and pooling etc. is external to this structure.

Figure 1 Operation
Figure 1

Figure 3 a
Figure 3

Figure 8 Fully
Figure 8