Design of optical neural networks with component imprecisions

For the benefit of designing scalable, fault resistant optical neural networks (ONNs), we investigate the effects architectural designs have on the ONNs' robustness to imprecise components. We train two ONNs -- one with a more tunable design (GridNet) and one with better fault tolerance (FFTNet) -- to classify handwritten digits. When simulated without any imperfections, GridNet yields a better accuracy (~98%) than FFTNet (~95%). However, under a small amount of error in their photonic components, the more fault tolerant FFTNet overtakes GridNet. We further provide thorough quantitative and qualitative analyses of ONNs' sensitivity to varying levels and types of imprecisions. Our results offer guidelines for the principled design of fault-tolerant ONNs as well as a foundation for further research.


Introduction
Motivated by the increasing capability of artificial neural networks in solving a large class of problems, optical neural networks (ONNs) have been suggested as a low power, low latency alternative to digitally implemented neural networks. A diverse set of designs have been proposed, including Hopfield networks with LED arrays [1], optoelectronic implementation of reservoir computing [2,3], spiking recurrent networks with microring resonators [4,5], convolutional networks through diffractive optics [6], and fully connected, feedforward networks using Mach-Zehnder interferometers (MZIs) [7].
We will focus on the last class of neural networks, which consist of alternating layers of modules performing linear operations and element-wise nonlinearities [8]. The N -dimensional complex-valued inputs to this network are represented as coherent optical signals on N single-mode waveguides. Recent research into configurable linear optical networks [9,10,11,12,13] enables the efficient implementation of linear operations with photonic devices. These linear multipliers, layered with optical nonlinearities form the basis of the physical design of ONNs. In Sec. 2, we provide a detailed description of two specific architectures -GridNet and FFTNet -both built from MZIs.
While linear operations are made much more efficient with ONNs in both power and speed, a major challenge to the utility of ONNs lies in their susceptibility to fabrication errors and other types of imprecisions in their photonic components. Therefore, realistic considerations of ONNs require that these imprecisions be taken into account. Previous analyses of the effects of fabrication errors on photonic networks were in the context of post-fabrication optimization of unitary networks [14,15,16]. Our study differs in three main areas.
First, In the previous work, unitary optical networks were optimized to simulate randomly sampled unitary matrices. We, instead, train optical neural networks to classify structured data. ONNs, in addition to unitary optical multipliers, include nonlinearities, which add to its complexity.

Physical design of optical neural networks
The ONN consists of multiple layers of programmable optical linear multipliers with intervening optical nonlinearities (Fig. 2). The linear multipliers are implemented with two unitary multipliers and a diagonal layer in the manner of a singular-value decomposition (SVD). These are, in turn, comprised of arrays of configurable MZIs, which each consist of two phaseshifters and two beamsplitters ( Fig. 1(a)).
Complex-valued N −dimensional input vectors are encoded as coherent signals on N waveguides. Unitary mixing between the channels is effected by MZIs and forms the basis of computation for ONNs. A single MZI consists of two beamsplitters and two phaseshifters (PS) ( Fig. 1(a) inset). While the fixed 50:50 beamsplitters are not configurable, the two phaseshifters, parameterized by θ and φ, are to be learned during training. Each MZI is characterized by the following transfer matrix (see App. A for details): U M Z (θ, φ) = U BS U P S (θ)U BS U P S (φ) = ie iθ/2 e iφ sin θ (1) Early work has shown that universal optical unitary multipliers can be built with a triangular mesh of MZIs [9]. These multipliers enabled the implementation of arbitrary unitary operations and were incorporated into the ONN design by Shen et al. [7]. Its asymmetry prompted the development of a symmetric gridlike network with more balanced loss [10]. By relaxing the requirement on universality, a more compact design, inspired by the Cooley-Tukey FFT algorithm [20], has been proposed [11]. It can be shown that FFT transforms, and therefore convolutions, can be achieved with specific phase configurations (see appendix H). We allow the phase configurations to be learned for implementation of a greater class of transformations.
In this section, we focus on the last two designs, referring to them as GridUnitary ( Fig. 1(a)) and FFTUnitary ( Fig. 1(b)), respectively. GridUnitary can implement unitary matrices directly by setting the phaseshifters using an algorithm by Clements et al. [10]. Despite being non-universal and lacking a decomposition algorithm, FFTUnitary can be used to reduce the depth of the unitary multipliers from N to log 2 (N ). Reducing the number of MZIs leads to lower overall noise and loss in the network. However, due to the FFT-like design, waveguide crossings are necessary. To overcome this challenge, low-loss crossings [21] or 3D layered waveguides [22,23] could be utilized.
MZIs can also be used to attenuate each channel separately without mixing. This way, a diagonal multiplier can be built. Because signals can only be attenuated by MZIs, subsequent global optical amplification [24] is needed to emulate arbitrary diagonal matrices. Through SVD, a universal linear multiplier can be created from two unitary multipliers and a diagonal multiplier ( Fig. 1(a)). Formally, a linear transformation represented by matrix M can be decomposed as Here both U and V † are unitary transfer matrices of GridUnitary multipliers while Σ represents a diagonal layer with eigenvalues no greater than one. β is a compensating scaling factor. Along with linear multipliers, nonlinear layers are required for artificial neural networks. In fact, the presence of nonlinearties sets the study of ONNs apart from earlier research in linear photonic networks [25]. One possible implementation is by saturable absorbers such as monolayer graphene [26]. This is has the advantage of being easily approximated with a Softplus function (see Sec. 3 for details on implementation). However, it has been demonstrated that Softplus underperforms, in many regards, when compared to rectified linear units (ReLU) [27]. Indeed, a complex extension of ReLU, ModReLU, has been proposed [28]. While it is physically unrealistic to implement ModReLU, the nonoptimality of Softplus functions still motivates the exploration of other optical nonlinearities, such as optical bistability in microring resonators [29], and two-photon absorption [30,31] as alternatives.

Neural network architecture and software implementation
We considered a standard deep learning task of MNIST handwritten digit classification [32]. Fully connected feedforward networks with two hidden layers of 256 complex-valued neurons each were implemented with GridNet and FFTNet architectures (Fig. 2) and simulated in PyTorch [33]. The 28 2 = 784 dimensional real-valued input was converted into 392 = 784/2 dimensional complex-valued vectors by taking the top and bottom half of the image as the real and imaginary part. This was done to ensure the data is distributed evenly throughout the complex plane rather than just along the real number line. Each network consists of linear multipliers followed by nonlinearities. The linear layers of GridNet and FFTNet were described in the previous section and illustrated in Fig. 1. The response curve of the saturable absorption is approximated by the Softplus function [34] (App. C), a commonly used nonlinearity available in most deep learning libraries such as PyTorch. The nonlinearity is applied to the modulus of the complex numbers. A modulus squared nonlinearity modeling an intensity measurement is then applied. The final SoftMax layer allows the (now real) output to be interpreted as a probability distribution. A cross-entropy [35] loss function is used to evaluate the output distribution against the ground truth.
An efficient implementation of GridNet requires representing matrix-vector multiplications as elementwise vector multiplications [36]. Nevertheless, training the phaseshifters directly was still time consuming. Instead, a complex-valued neural network [37] was first trained. An SVD (Eq. (2)) was then performed on each complex matrix. Finally, phaseshifters were set to produce the unitary (U, V † ) and diagonal (Σ) multipliers through a decomposition scheme by Clements et al. [10].
However, note that SVD is ambiguous up to permutations (Π) of the singular values and the columns of U and V .
Conventionally, the ambiguity is resolved through ordering the singular values from largest to smallest. In Sec. 4.3 we show that randomizing the singular values increases the error tolerance of GridNet. FFTNet is trained directly and its singular values are naturally unordered. For a fair comparison, we randomly permute the singular values of GridNet. After 10 training epochs with standard stochastic gradient descent [38], classification accuracies of 97.8% (GridNet) and 94.8% (FFTNet) were achieved. Better accuracies can be achieved through convolutional layers [39], Dropout regularization [40], better training methods, etc. However, we omitted these in order to focus purely on the effects of architecture.
The networks were trained assuming ideal components represented with double-precision floating point values. Under realistic conditions, due to imprecision in fabrication, calibration, etc., the realizable accuracy could be much lower. During inference, we modeled these imprecisions by adding independent zero-mean Gaussian noise of standard deviation σ P S and σ BS to the phases (θ, φ) of the phaseshifters and the transmittance T of the beamsplitters, respectively. Reasonable values for such imprecisions can be taken to be approximately σ P S ≈ 0.01rad and σ BS ≈ 1% = 0.01 [41,42]. Note that the dynamical variation due to laser phase noise can be modeled by σ P S as well. However, we show in App. B that typical values would be well below 0.01 rad.

Degradation of network accuracy
Visualizing the degradation of ONN outputs, FFTNet is seen to be much more robust than GridNet. Identical input is fed through GridNet (a, b) and FFTNet (c, d), simulated with ideal components (a, c) and imprecise components (b, d) with σ BS = 0.01 and σ P S = 0.01rad. Imprecise networks are simulated 100 times and their mean output is represented by bar plots. Error bars represent the 20th to 80th percentile range.
To investigate the degradation of the networks due to imprecisions, we started by simulating 100 instances of imprecise networks with σ BS = 1% and σ P S = 0.01rad. Identical inputs of a digit "4" (Fig. 3(a) inset) are fed through each network. The mean and spread of the output of the ensemble is plotted and compared against the output from the ideal network (Fig. 3).
The degradation of classification output is significant for GridNet. Without imprecisions in the photonic components, the digit is correctly classified with near 100% confidence ( Fig. 3(a)). When imprecisions are simulated, we see a large decrease in classification confidence ( Fig. 3(b)). In particular, the image is often misclassified when the prediction probability for class "9" is greater than that for class "4". Repeating these experiments on FFTNet demonstrated that they were much more resistant to imprecisions ( Fig. 3(c), 3(d)). In Appendix D, we show confusion matrices of both networks with increasing error to further support this conclusion.
Evaluating the two networks on overall classification accuracy confirms the superior robustness to imprecisions of FFTNet. GridNet and FFTNet were tested at levels of imprecisions with of imprecisions with σ P S /rad and σ BS ranging from 0 to 0.02 with a step size of 0.001. At each level of imprecision, 20, instances of each network were created and tested. The mean accuracies are plotted in Fig. 4(a), 4(b). A direct comparison between the two networks along the diagonal (i.e., σ P S = σ BS cut line, taking 1% = 0.01 rad) is shown in Fig. 4(c).
Starting at roughly 98% with ideal components, the accuracy of GridNet rapidly drops with increasing σ P S and σ BS . By comparison, very little change in accuracy is seen for FFTNet despite starting with a lower ideal accuracy. Also of note are the qualitatively different levels of sensitivity of the different components to imprecision. In particular, FFTNet is much more resistant to phaseshifter error compared to beamsplitter error.
The experiments described in this section confirm the significant effect component imprecisions have on the overall performance of ONNs, as well as the importance of architecture in determining the network's robustness of the network to these imprecisions. Despite having a better classification accuracy in the absence of imprecisions, GridNet is surpassed by FFTNet when a small amount of error (σ P S = 0.01, σ BS = 1%rad) is present. In Appendix E, we demonstrate that FFTNet is also more robust to quantization error that GridNet.

Stacked FFTUnitary and truncated GridUnitary
One obvious reason why FFTNet would be more robust than GridNet is its much lower number of MZI layers. Their respective, constituent unitary multipliers, FFTUnitary and GridUnitary contains log 2 (N ) and N layers respectively. For N = 2 8 = 256, GridUnitary is 32 times deeper than FFTUnitary which contains only 8 layers.
To demonstrate that FFTUnitary is more robust due architectural reasons beyond its shallow depth, in this section, we introduce two unitary multipliers -StackedFFT ( Fig. 5(a)) and TruncGrid ( Fig. 5(b)).   Unitary multipliers by themselves are not ONNs and cannot be trained for classification tasks. Instead, after introducing imprecisions to the each multiplier, we evaluated the fidelity F (U 0 , U ) between the original, error-free transfer matrix U 0 and the imprecise transfer matrix U . The fidelity, a measure of "closeness" between two unitary matrices, is defined as [43] F Ranging from 0 to 1, F (U 0 , U ) = 1 only when U = U 0 . Using this metric of fidelity, we show that StackedFFT is more robust to error than GridUnitary ( Fig. 6(a)) and TruncGrid more than FFTUnitary (Fig. 6(b)). Both comparisons are between multipliers with the same number of MZI layers. Yet, the FFT-like architectures are still more robust to their grid-like counterparts.
One possible explanation could be the better mixing facilitated by FFTUnitary. GridUnitary and thus TruncGrid, at each MZI layer, only mixes neighboring waveguides. After P layers, each waveguide is connected to, at most, to its 2P nearest neighbors. In comparison, after P layers, FFTUnitary connects N = 2 P . Here, we have compared the robustness of different unitary multipliers in isolation. We stress that the overall robustness of neural networks is a much more complex and involved problem. A rough understanding can be formulated as follows. A trained neural network defines a decision boundary throughout the input space. Introduction of errors perturbs the decision boundary which can lead to misclassification. To reduce this effect, we can make the decision boundary of ONNs more robust to errors. However, it is also important to consider the robustness of misclassification due to perturbations of decision boundaries. Indeed, it has been shown that robustness of neural networks are dependent on the geometry of the boundary [44].
A complete analysis of the robustness of neural networks to various forms of perturbations is outside the scope of this paper. Nonetheless, it is important to understand the dependence of ONNs on both architectural and algorithmic design.  Fig. 1(a)). The transmissivity of each waveguide through the diagonal layer Σ 2 is also plotted (center panel).

Localized imprecisions
To better understand the degradation of network accuracy, we mapped out the sensitivity of GridNet to specific groups of MZIs. A relatively large amount of imprecision (σ P S = 0.1rad) was introduced to 8 × 8 blocks of MZIs in layer 2 (Fig. 2) of an otherwise error-free GridNet. The resulting change in classification accuracy is plotted as a function of the position of the MZI block (Fig. 7). We see no strong correlation between the change in accuracy and the spatial location of the introduced error. In fact, error in many locations led to small increases in accuracy, suggesting that much of the effect is due to chance.
This result seems to contradict previous studies on the spatial tolerance of MZIs in a GridUnitary multiplier [14,15,16]. It was discovered that the central MZIs of the multiplier had a much lower tolerance than those near the edges. When learning randomly sampled unitary matrices, the central MZIs needed to have phase shift values very close to 0 (π, following the convention used in this paper). This would only be achievable with MZIs with extremely high extinction ratios and thus low fabrication error.
Empirically, this distribution of phases was observed in GridUnitary multipliers of trained ONNs (See app. F). However, the idea of tolerance of a MZI to beamsplitter fabrication imprecision, while related, is not the same as the network sensitivity to localized imprecisions. To elaborate, tolerance is implicitly defined, in references [14,15,16], as roughly the allowable beamsplitter imperfection (deviation from 50:50) that still permits post-fabrication optimization of phaseshifter towards arbitrary unitary matrices. In our prefabrication optimization approach, we take sensitivity to be the deviation from ideal classification accuracy when imprecision is introduced to the MZI with no further reconfiguration. See App. G for this difference further illustrated by experiments with another architecture.  Fig. 7, except GridNet has its singular values ordered. Therefore, the transmissivity is also ordered (center panel).
Recall that the singular values Σ of the GridNet's linear layers could be permuted together with columns and rows of U and V † respectively without changing the final transfer matrix (Eq. (3)). The singular values were randomized to provide a fair comparison with FFTNet. We then performed the same experiment on GridNet where the singular values of each layer were not randomized but ordered from largest to smallest. Therefore, the transmissivity T = | sin(θ/2)| 2 of the diagonal multiplier Σ is also ordered (Fig. 8). In this case, there is a significant, visible pattern because most of the signal travels through the top few waveguides of Σ 2 due to the ordering of transmissivities. Only MZIs connected to those waveguides have a strong effect on the network. In fact, the network is especially sensitive to imprecisions in MZIs closest to this bottleneck (Fig. 8, top-right of V † 2 and top-left of U 2 ). It is important to note that this bottleneck only exist due to the locality of connections in GridNet where only neighboring waveguides are connected by MZIs. In FFTNet, due to crossing waveguides, no such locality exist.  9. The degradation of accuracies with increasing σ P S = σ BS compared between two GridNets one with ordered and another with randomized (but fixed) singular values.
In addition to, and likely due to the spatial non-uniformity in error sensitivity, GridNet with ordered singular values is more susceptible to uniform imprecisions (Fig. 9). The same GridNet architecture, could be made more resistant by shuffling its singular values. This difference between two identical architectures implementing identical linear and non-linear transformations demonstrates that the resistance to error in ONNs is effected by more than architecture.

Conclusion
Having argued that pre-fabrication, software optimization of ONNs is much more scalable than postfabrication, on-chip optimization, we compared two types of networks-GridNet and FFTNet in their robustness to error. These two networks were selected to showcase the trade-off between expressivity and robustness. We demonstrated in Sec. 4.1 that the output of GridNet is much more sensitive to errors than FFTNet. We have illustrated the robustness of FFTNet by a providing a thorough evaluation of both networks operating with imprecisions ranging between 0 ≤ σ BS , σ P S ≤ 0.02. With ideal accuracies of 97.8% and 94.8% for GridNet and FFTNet respectively, GridNet accuracy dropped rapidly to below 50% while FFTNet maintained near constant performance. Under conservative assumptions of errors associated with the beamsplitter (σ BS > 1%) and phaseshifter (σ P S > 0.01 rad), a more robust network (FFTNet) can be favorable over one with greater expressivity (GridNet).
We then demonstrate, in Sec. 4.2, through modified unitary multipliers, TruncGrid and StackedFFT, that controlling for MZI layer depth, FFT-like designs are inherently more robust than grid-like ones.
To gain a better understanding of GridNet's sensitivity to imprecision, in Sec. 4.3, we probed the response of the network to localized imprecisions by introducing error to small groups of MZIs at various locations. The sensitivity to imprecisions was found to be less affected by the MZIs' physical position within the grid and more so by the flow of the optical signal. We then demonstrated that beyond architectural designs, small procedural changes to the configuration of an ONN, such as shuffling the singular values, can change affect the its robustness.
Our results, presented in this paper, provide clear guidelines for the architectural design of efficient, fault-resistant ONNs. In looking forward, it would be important to investigate algorithmic and training strategies as well. A central problem in deep learning is to design neural networks complex enough to model the data while being regularized to prevent over-fitting of noise in the training set [8]. To this end, a wide variety of regularization techniques such as Dropout [40], Dropconnect [45], data augmentation, etc. have been developed. This problem parallels the trade-off between an ONN's expressivity and its robustness to imprecisions presented here. Indeed, an important conclusion in Sec. 4.3 is that in addition to architecture, even minor changes in the configuration of ONNs also have a great effect on the network's robustness to faulty components.
The robustness of neural networks to perturbations [44] is a well studied and open problem that is outside of the scope of this article on architectural design. Nevertheless, a complete analysis of ONNs with imprecise components requires an understanding of robustness due to architectural design as well as due to software training, possibly under a unifying framework. A natural direction for further exploration is to consider analogies to regularization in the context of imprecise photonic components and to focus on the development of algorithms and training strategies for error-resistant optical neural networks.

Code repository and results
The code repository, results and scripts used to generate figures in this paper are freely available at https://github.com/mike-fang/imprecise optical neural network where t ≡ √ 1 − r 2 and With the construction of PS-BS-PS-BS ( Fig. 1(a), inset), the MZI transfer matrix is the following matrix product: Assuming that the beamsplitter ratios are 50:50, we can take and therefore, In our convention, the transmission and reflection coefficient is respectively. In particular, the MZI is in the bar state (T = 0) when θ = π and in the cross state (T = 1) when θ = 0. However, in other conventions, the beamsplitter is often taken to be the Hardamard gate.
Note in this convention the internal phase shift is now θ + π and thus the bar and cross states are now at θ = 0 and θ = π respectively.
Here, τ is the time of integration and δf the linewidth of the laser. For an order or magnitude calculation, we ignore the refractive index and take τ = L/c where L is the distance between two subsequent phaseshifters on an MZI. Again, as an order of magnitude estimate, we take L = 100µm = 10 −4 m and thus τ ≈ 3 × 10 −13 . We wish to solve for the linewidth required for σ φ = 0.01rad: A linewidth of 50 MHz is easily achieved by modern lasers. For example, Bragg reflector lasers have been shown to achieve a linewidth of 300 kHz [47]. Thus, the contribution to phase noise from the laser is roughly two orders of magnitude smaller than that from MZIs.  Saturable absorption can be modeled by the relation [48] u 0 = 1 2

C Approximating saturable absorption
where T = u/u 0 and u = στ s I and u 0 = στ s I 0 . I 0 , I are the incidental and transmitted intensities, respectively. The above equation can be solved to be Where W is the product log function or Lambert W function. However, since W is not readily available in most deep learning libraries and difficult to implement, we wish to approximate the above by the shifted and biased Softplus non-linearity of the form The bias of −β −1 log(1 + e −βu0 ) was chosen to ensure that σ(0) = f (0) = 0. We now choose β and u 0 to ensure that Requiring that it equals to f (0) = T 0 allows us to solve for Next, in the large u limit, the biased Softplus converges to Solving for equality with f (u) → u + 1 2 log T 0 gives Going back to Eq. (26), we obtain Fig. 10 plots the saturable absorption response curve compared to the Softplus approximation derived above.

D Confusion matrices
To investigate the degradation of the networks due to imprecisions, we produce confusion matrices for both networks in the ideal case, with no imprecisions, and with different levels of error. σ BS = 1%, σ P S = 0.01rad and σ BS = 2%, σ P S = 0.02rad (Fig. 11). The imprecisions were simulated 10 times and the mean of the output was used in generating the confusion matrices.

E Quantization error
In this section, we explore the quantization error introduced by thermo-optic phaseshifters. Assuming a linear relationship between refractive index and temperature and quadratic relationship between temperature and voltage, we have We have taken V 2π to be the voltage required for a 2π phaseshift and defined the dimensionless voltage u = V /V 2π . Assuming that the voltage can be set with B-bit precisions, u must take on values of The quantization procedure then takes To evaluate the sensitivity to quantization, we quantized GridNet and FFTNet with varying levels of precision. Since quantization is deterministic, we trained 10 instances of both networks with randomized initialization and thus different configuration but similar ideal accuracies (∼ 95% and ∼ 98%). The networks were then quantized at varying levels -from 4 to 10 bits. Their classification accuracy at each level is shown in Fig. 12. Similar to results with simulated Gaussian noise, FFTNet is more robust than GridNet. Note that in this case, the quantization was applied after training has finished. Neural networks in which quantization happens as part of the training procedure has been demonstrated to have accuracies very near their full precision counterpart, down to even binary weights [49,50]. Analyses has been done on the distribution of the internal phase shift (θ) of MZIs of GridUnitary multipliers when used to implement randomly sampled unitary matrices [15,16,14]. It was shown that the phases are not uniformly distributed spatially. To be more concrete, We denote d the waveguide number and l the layer number (see Fig. 1(a)). The distribution of the MZI reflectivity (r = sin(θ/2)) is [15] r d,l ∼ Beta(1, β d,l ).
For large dimensions N , β decreases from N at the center of the grid layout to 0 at the edge of the grid. For large β (i.e. near the center), the mean and variance of r d,l are approximately Consequently, the reflectivity, and therefore the internal phases, of MZIs near the center of Gird Unitary multipliers are distributed very close to 0, with low variance. This effect is magnified with larger dimensions N . This result was derived with the assumption of Haar-random unitary matrices. Such a distribution is not guaranteed and not expected for layers of trained neural networks. (Fig. 13(a)) shows the spatial distribution of phases in the GridUnitary multiplier U 2 (see Fig. 2). While the empirical histogram ( Fig. 13(b)) does not match the theoretical distribution (Eq. (33)), the general trend of lower variance near the center of GridUnitary multipliers is evident. This is claimed to translates to a lower tolerance for error [14].
A similar analysis was conducted for FFTNet. Immediately we notice that the distribution of phase shifts is mostly uniform across the MZIs (Fig. 14(a)). This can be attributed to the non-local connectivity of FFTUnitary multipliers. Histograms constructed from an ensemble of 100 trained FFTNets with random initial weights (Fig. 14(b)) confirms this observation. The histogram for the region near the center (red) is nearly identical to the top (green).
We reiterate the distinction, made in Section 4.3, between pre-fabrication error tolerance and the sensitivity of error introduced post-fabrication. Pertinent to the first concept is how well the network can be optimized after a known set of imperfections are introduced to the network. The latter concept, which is relevant for our discussion, describes the sensitivity of the network with no further reconfiguration to unknown errors. In contrast to pre-fabrication error tolerance, our analysis in 4.3 does not show significant spatial dependence for post-fabrication error sensitivity.

G BlockFFTNet
We introduce a network with similar depth as GridUnitary but with non-local, crossing waveguides in between as those seen in FFTUnitary ( Fig. 15(a)). This is similar to the coarse-grained rectangular design mesh in [14] which was motivated to produce a spatially uniform distribution of phase and thus better tolerance for post-fabrication optimization. We also empirically observe that when incorporated as part of a ONN (BlockFFTNet), the distribution of phases are also uniformly distributed ( Fig. 15(b)). We directly demonstrate that better tolerance for post-fabrication optimization does not directly to better error-resistance for a network optimized pre-fabrication. The accuracy loss due to increasing imprecision is shown in Fig.  16.

H FFT algorithm and convolution
We show that the actual Cooley-Tukey FFT algorithm can be implemented with appropriate configurations of the phases of FFTUnitary multiplier.
If we denote the input as x n ∈ C N , its Fourier transform is x n e − 2πi N nk .
The FFT algorithm, in short, is to rewrite the above as Here, we have defined O k and E k to be the Fourier transform on the odd and even elements of x n respectively. The calculation of E k and O k are done recursively. For N = 2 K , a total of K iterations are needed. It is well known that if x n is in bit-reversed order, the calculations can be done in place. Furthermore, in matrix form, From Eq. (1), we note that U k = U M Z (θ = π/2, φ = 2πk/N ), up to some global phase. Therefore, if x n is in bit-reversed order, and passed through a FFTUnitary multiplier where the kth layer is configured with θ = π/2, φ = 2πk/N , FFT can be performed.
Going further, a convolution can be easily performed through multiplication of the Fourier transformed signal by the Fourier transformed convolutional kernel, followed by a inverse Fourier transform.