Neural Schr\"{o}dinger Equation:Physical Law as Neural Network

We show a new family of neural networks based on the Schr\"{o}dinger equation (SE-NET). In this analogy, the trainable weights of the neural networks correspond to the physical quantities of the Schr\"{o}dinger equation. These physical quantities can be trained using the complex-valued adjoint method. Since the propagation of the SE-NET can be described by the evolution of physical systems, its outputs can be computed by using a physical solver. As a demonstration, we implemented the SE-NET using the finite difference method. The trained network is transferable to actual optical systems. Based on this concept, we show a numerical demonstration of end-to-end machine learning with an optical frontend. Our results extend the application field of machine learning to hybrid physical-digital optimizations.

structure is physically transferable. We can train the physical structure as a black box on a machine learning (ML) platform such Pytorch. A physical neural network provides the chance to access such natural data.
End-to-end hybrid physical digital machine learning: In general, much information is lost during measurement: e.g. camera images lose depth (phase), polarization, and wavelength information through measurement. By using our framework, we can jointly optimize the physical structure and digital DNNs under the same cost function. All gradients can be computed by the chain rule and adjoint. This enables end-to-end ML with a physical structure manipulating the natural information. As an example, we demonstrate a data-driven end-to-end design of an ultracompact optical spectrometer.
In the following section, we demonstrate the above-mentioned features of the SE-NET. Note that our discussion basically assumes classical (not quantum) systems. Thus, the discussion does not pertain to quantum ML systems, but there are several similarities between our learning scheme and quantum systems as described in Section 6.

Theory
Neural ODEs: First, we describe the continuous representation of neural networks. Hidden state x(L) ∈ R for a ResNet is represented by x(L + 1) = x(L) + f [x(L), θ], where L ∈ N denotes the layer number, f denotes the nonlinear function, and θ ∈ R is a trainable weight. It has been shown that a residual block can be interpreted as a discrete approximation of an ODE by setting the discretization step to one. When the discretization step approaches zero, it yields a family of neural networks, which are called neural ODEs [9]. A block diagram of the processing in an ODE-Net is shown in Fig. 1(b). Formally, in a neural ODE, the relation between the input and output is characterized by From the given input x(t o ), the output x(t 1 ) can be computed by solving the ODE in (1). The backward propagation of neural ODE-Nets can also be computed with the standard ODE solver by considering their adjoint.
Neural Schrödinger equation: Next, we consider the time-dependent one-dimensional Schrödinger equation in the neural ODE framework. A block diagram of the processing in the SE-NET is shown in Fig. 1(c). The Schrödinger equation is a complex-valued second-order hyperbolic equation described as j ∂ψ(x, t) ∂t = Hψ(x, t).
where ψ(r, t) ∈ C is a wave function, t is time, x ∈ R is a position vector, j is an imaginary unit, is Dirac's constant, and H is the Hamiltonian of considering systems. For example, the Hamiltonian for a particle under the potential field V (x, t) ∈ C is described as follows.
where m denotes the mass of the particle. As discussed in the appendix and [35], one-dimensional kernel filter K(θ) in the convolutional neural network (CNN) is described as By comparing Eqs. (3) and (4), we see that linear Schrödinger equation [Eq. (2)] corresponds to the complex-valued evolution of the ODE-Net in Eq. (1) without a nonlinear function. In this analogy, the tuning of the potential field corresponds to the change in the weights in the kernel filter. The above discussion can be extended to the three-dimensional case. Then, the dimensions of the corresponding kernel filter are expanded. The Schrödinger equation with linear Hamiltonian does not include nonlinear conversion. Therefore, we als consider a nonlinear Hamiltonian such as H N L = (H L + g|ψ(x, t)| 2 ), where g is a nonlinear constant and | · | denotes an absolute value. This equation is commonly used for the analysis of nonlinear optical phenomena and quantum chaos [10]. Note that linear Hamiltonia is also valuable for considering actual implementation to physical SE-NET.
As discussed above, the time evolution of the Schrödinger equation is considered as a special type of neural network, which we call the SE-NET. In this study, we simply deal with the Shrödinger equation as a classical complex-valued wave equation; that is, ψ(x, t) in the Schrödinger equation simply represents the complex-valued amplitude of a transmitted wave (not the probability of the existence of a particle).
Training by adjoint method: As the building block of SE-NET is no longer a simple weight matrix, it seems difficult to apply the standard back-propagation method. However, as pointed out in [9] and also described in the past [27,18], the backpropagation of an ODE-based continuous network is equivalent to an adjoint sensitivity analysis. The adjoint method is typically used to optimize physical systems, which is called inverse design or topology optimization [3,21].
Here, we consider optimizing loss function L, whose input is the result of SE-NET outputs. To optimize L, we consider how the gradients of the loss depends on the hidden state ψ(x, t). This quantity is called the adjoint ψ a (x, t), which is defined as ψ a (x, t) ≡ ∂L ∂ψ(x,t) Its dynamics are given by : which can be thought of as the instantaneous analog of the chain rule. From (2) and (6), we derive the following gradients for V : where V real (x o , t o ) and V imag (x o , t o ) are real and imaginary part of potential fields at each (x o , t o ). As shown in Eqs. (7) and (8), the SE-NET can be trained by optimizing the local potential field V (x, t).
The gradients can be computed by evaluating forward propagated field ψ(x, t) and backpropagated adjoint field ψ a (x, t). As can be seen in Eq. (5), we can solve the backpropagated wave by using almost the same propagation rules: for linear Hamiltonian [(∂H/∂ψ) = 0], in particular, the backpropagation equation is mirrors the forward propagation one. The same training rule can be introduced by considering the calculus of variations as described in [14].
Physical simulator as SE-NET solver: Several techniques for efficiently solving Eqs. (2) have been widely studied [33,7]. We can use these methods as a SE-NET solver. The algorism based on fast Fourier transformation (FFT) is simple but it requires O(n x log(n x )) complexity where n x is number of grids along x-axis. Here, we solve this equation using a finite-difference (FDM) method which requires only O(n x ) complexity. There are various ways to implement the FDM. As also discussed in the original ODE-paper, the choice of solver is important because the numerical stability is different in each solver. Explicit methods such as the Euler method are simple and easy to implement. However, their solutions become unstable when ∆x 2 /∆t > 1/2, where ∆x and ∆t are grid spacings along the xand t-axes. In addition, we need fine computational meshes to obtain accurate solutions because their numerical error is relatively large: Here, we employ the Crank-Nicolson method to solve FD-BPM because its solutions are stable and its numerical error is small, The details of the Crank-Nicolson method are described in the appendix.

Application to optics
Schrödinger equation in optics: It is known that lightwave propagation in an optical waveguide with paraxial ray approximation is described by the Schrödinger equation. Therefore, we can consider the SE-NET as a way of optical beam transmission. In this analogy, the training of the SE-NET means the "design" of the waveguides. Thus, the trained network is transferable to actual optical systems. Here, we consider light propagation in a medium with a refractive index distribution for the emulation and training of the SE-NET. For optical systems, we can derive the Schrödinger equation as follows by considering time-independent two-dimensional (x-z) systems with slowly varying envelope approximation [15]: where ψ(x, z) is an electric field. For the standard linear optical system, Hamiltonian H L is described as where k is wavenumber. ∆n is defined as ∆n ≡ (n − n r ), where n r is a reference refractive index n is a complex valued local refractive index. The real part of the refractive index, n real , denotes the phase shift of the wave, and the imaginary part n imag contributes the loss or gain of the system. By comparing with the Schrödinger equation in Eq.(2), it is clear that the first term of Eq.(10) corresponds to momentum and second term corresponds to the local potential. The nonlinear conversions can be introduced by considering the approach described in section 2.2 (see Fig. 2). Thus, the z-axis evolution of Eq. (9) is also considered as forward propagation in a neural network.
By considering adjoint ψ a (x, z) ≡ ∂L ∂ψ(x,z) , we can derive the update rule in the same way as in Eqs. (9) and (10). This means that the SE-NET can be trained by optimizing the local n(x, z). We can solve both forward propagation and backpropagation by using a simulation method for optical systems.
Phase-only training: We have two parameters at each position, (x o , z o ). However, n imag means the gain or loss of the transmitting medium. ODE-based dynamical systems are known to be unstable when the gain of the weight matrix becomes large, which corresponds to gradient explosion. In addition, system loss significantly affects the signal-to-noise-ratio of the output signal, which degrades system performance. To avoid the above issue, we consider the phase-only update rule for the local potential n, which was developed for the inverse design of optical devices [14,36]. This method is called the wave-front-matching (WFM) method. In the WFM method, n imag and ∂L/∂n imag are fixed at zero. We only update n real , which means that the update is executed to match the wavefront between the forward propagating wave and backpropagating wave. By using this method, the system automatically maintains unitarity. The weight matrix derived from the local refractive index is a Hermitian matrix. Thus, we can maintain model stability. This means that the system is always stable and that the law of the conservation of energy is satisfied. Therefore, the physical system has no principle loss or energy consumption for solving SE-NETs . This framework is considered as a specific expression of unitary neural network [1].

Manipulate lightwave information:
Lightwaves have inherent parallelism with respect to phase, space, and wavelength. These data are lost when intensity is detected by a photodetector (PD). The optical SE-NET can process these data via n. The phase and space information can be directly tuned by n real (x, z). Since the amount of phase modulation tuned by n differs for each wavelength, the wavelength information can be processed by using the standard SE-NET [eqs. (9) and (10)]. Thus, we can train the physical structure by considering the wavelength.

Optics on Pytorch:
We implement the optical SE-NET using finite difference beam propagation method (FD-BPM) on Pytorch. Fig. 2(a) shows a training example, which is an optical 1:2 splitter design. The training example of wavelength splitter is also shown in Fig. 2(b). In our framework, we treat the wavelength information as the data in each minibatch. Although this framework is almost the same as topology optimization or inverse design [14], we can optimize the physical structure as a black box with few prior physical knowledge on Pytorch. The additional information for experiments is described in appendix.
End-to-end hybrid physical digital ML: Thanks to Pytorch implementation, we can jointly optimize the physical structure and digital DNNs under the same cost function. All gradients can be computed by the chain rule and adjoint thanks to the autograd function on Pytorch. This enables the end-to-end ML with a physical structure. The algorithm is shown in the appendix.

Experimental results
To examine above described training framework, we validated the SE-NET using some datasets. To compute the z-axis evolution of SE-NET numerically, we implemented the FD-BPM solver with the Crank-Nicholson method on the Pytorch framework. To train the weights, we used a standard Adam optimizer with a learning rate of 0.001. For the training, we set the size of the calculation grid along xand z-axis to 1.0 µm. The reference refractive index n r to 1.45. These hyperparameter values for the SE-NET are based on the standard silica-based optical circuits for telecom applications.
Image classification: At first, we confirmed the performance of SE-NET itself by neglecting the physical implementation. We trained SE-NETs to classify the MNIST dataset of handwritten digits [26]. The data set consists of 60,000 labeled digital images with size 28 × 28 showing hand written digits from 0 to 9. For the training, the optimizer steps were computed using mini-batches consisting of 128 randomly chosen examples from the 60,000 training data. The input wavelength was set to (λ = 2π/k) to 1.55 µm. The trained models for our experiment are shown in Fig. 3(a) and (b). The architecture in Fig. 3(a) is named CNN-SE-NET. It has a CNN-based down sampling layer containing three two-dimensional convolution units, which is the same as in the pre-filtering method used in the previously reported ODE-Net model [9]. To neglect the effect of the CNN layer, we also trained the model with SNN-based down sampling, named SE2-NET. Calculation regions are also shown in Fig.  3(a) and (b). We consider two types of nonlinear conversion as shown in Fig. 3(c) and (d) in this task. First one uses typical nonlinear Hamiltonian with g = 0.1. We decided this value using a more simple task (see Appendix). The other one employs the nonlinear activation after propagation of SE-NET as shown in Fig. 3(d). This system can be physically devised by inserting the nonlinear material discretely. At the input edge of the SE-NETs, the input data are aligned to the x-axis and mapped to the real part of ψ(x, z 0 ). At the output edge of SE-NETs, the real and imaginary parts of ψ(x, z = z 1 ) are independently treated. They go to the full connection layer. The experimental results for test error are shown in Table I. As can be seen in this figure, the performances of the SE-NETs are on the same order of magnitude as the best performance found in the literature. Although SE2-NET achieved 1.33% accuracy, the CNN-SE-NET model showed much better performance, suggesting that CNN-based down sampling is superior to SNN-based down sampling. We think this difference is due to the one-dimensional feature of our tried model as described in Eq.(4). Thus, it will be improved when we consider more highly dimensional SE-NET model including time or y-axis transmission.
The experiment for the larger dataset is also shown in appendix.
End-to-end compressed sensing spectroscopy: Next, we demonstrate possible physical implementation with the joint end-to-end machine learning scheme. We tried to design an ultra-compact optical spectrometer. As the photodetector lost the wavelength information by the measurement, the standard spectrometer is designed to receive each wavelength independently as shown in Fig. 4(a). It requires bulk gratings and long optical paths to separate the wavelength spatially. For the miniaturization, compressive sensing (CS) spectroscopy is attracting much attention [2,29,22,23,17]. In this method, spectra y ∈ R N ×1 are reconstructed from a few detector signals x ∈ R N ×1 with a known relationship x = T y, where T ∈ R M ×N is the detection matrix of the optical structure. It was reported that the deep learning is a powerful tool for reconstruction [22,23,17]. Each row of T represents the spectrum transmission to each detector. In general, the optical implementation of T requires complex optics and specialized knowledge for the design.
Here, we consider the system shown in Fig. 4(b). We implement T using the SE-NET, which enables simultaneous optimization of T and the DNN for reconstruction. To evaluate the performance, we used 12,000 synthetic spectral datasets with wavelengths of 1000-1500 nm using Gaussian distribution functions. The data were randomly divided into 10,000 training examples, and 2,000 examples were used for validation. The input data are represented as a complex-valued Gaussian beam with a beam width of 100 µm. The analyzed region was set to 500 × 500µm 2 . The 25 detectors were set on the output edge of the waveguide. We used a linear Hamiltonian by considering easy transfer to the optics. For the post processing, we used a residual layer with a one-dimensional convolution filter, which has almost the same structure as the filter in [23] (see the appendix for more details).  Figure 4(f) shows the evolution of refractive index under the training. As can be seen, the structure is optimized from the flat initial state, suggesting that we can optimize the physical structure as a black box with a little prior physical knowledge . These results suggest that the effectiveness of the hybrid physical digital ML. Our work could thus be extended to complex-valued propagation of, for instance, acoustic waves. As is well known, classical limitation (i.e. → 0 in quantum systems or paraxial approximation in optics) corresponds to the Hamilton-Jacobi equation. Therefore, our discussion can be extended to members of its family, such as the Eikonal equation, by considering such approximation. The application of Hamilton-Jacobi dynamics to ML is discussed in [8].
Possible applications: One application of the SE-NET is design method for functional optical waveguides and lenses that enables end-to-end ML with physical structure; e.g. compressed sensing, computational imaging [39,11,6], and optical communication [19]. Another application is an optical processor for ultrafast and energy-efficient inference engines [37]. This is because the operation of the the transferred network is performed at the speed of light, which does not require any principal energy consumption Scalability of physical SE-NET: Our SE-NET-based optical device requires less than 1 µm for each pixelized matrix, which is much smaller than in previous on-chip optical neural networks [37,20]. This means that we can integrate over 100 million weights into 1-cm 2 optical chips, which is compatible with state-of-the-art application-specific integrated circuits (ASICs). In addition, we have not yet extracted the full potential of the SE-NET because there are many dimensions for the augmentation, including time, the y-axis, and polarization.

From the viewpoint of ResNet: Our model is inspired by the ResNet and its ODE representations.
Although the ResNet has a very deep structure, its essence is an ensemble of relatively shallow neural networks, which was confirmed by the drop-out dependency of neural networks [42]. The SE-NET is formed by connecting each point as a node in the time direction. They can be seen as a superposition of many paths. The change in the path at a given point is obtained by multiplying the path by exp(−jV (x, t)dt) ≈ 1 − jV (x, t)dt, assuming that dt is very small. In this case, we get the unscattered path from the right-hand side first term and the scattered path from the right-hand side second term. Since dt is very small, the probability that a single path is scattered at many points is very small. The majority of the paths will be scattered by a small number. Furthermore, their superposition can be interpreted as an ensemble of relatively shallow networks. This is consistent with ResNet's analysis. This feature provides robustness against weight error, which is effective for the actual analog implementation described in Section 3. These investigations remain as future work.  [12]. A higher-order approximation of the states in a neural ODE was proposed in [44], and Y. Rubanova et al. proposed an ODE-based RNN structure [34]. The robustness of ODEs is discussed in [43].

Related works
Neural network with optics: The crossover between neural networks and optics is a recent hot topic [40]. On the application side, it has been reported that the joint optimization of the lens structure is effective for computational imaging applications [39,11,6]. However, the optimization is limited to a single-layer Fresnel lens, limiting the functionality of the optics. Furthermore, since the optimization target is a point spread function (PSF), the PSF has to be converted to an actual lens structure. By augmenting our method with a three-dimensional equation, we can optimize the local refractive index directly to enhance system performance.
On the basic research side, optical neuromorphic accelerators are being intensively investigated as candidates for the DNN processor. They typically use a Mach-Zehnder interferometer [37] or a discrete diffractive optical element [30,5,3] as a weight element. Adjoint optimization of these elements has already been shown in [30,3,20]. In general, increasing the number of nodes for each layer improves the characteristics in proportion to the polynomial of the number of nodes. On the other hand, increasing the number of layers is said to improve the characteristics exponentially, so it is better to have more layers. From this point of view, we think that the SE-NET is more advantageous than diffractive neural networks as a configuration to increase the number of layers. In addition, the SE-NET does not have a waveguide structure. Instead, each point becomes a node, and the nodes are connected through the Hamiltonian. In contrast, an optical waveguides are difficult to scale up because the elements that optically connect them must be arranged in proportion to the square of the number of waveguides.  [31,16] is relatively related to our work. With this technique, the angular of unitary quantum gates is trained to decrease the cost function through a backpropagation method, which is similar to our proposed WFM-based learning. Our approach differs from quantum circuit learning. We assume classical optical systems (not quantum systems), which are easier to fabricate and apply than quantum ones, but there is no quantum acceleration. In addition, we directly train the local potential field (not a gate device), allowing scalable system implementation.

Conclusion
We investigated the potential of Schrödinger equation as a building block of neural networks. We show that the SE-NET can be trained by optimizing the local potential field of the Schrödinger equation. As a demonstration, we trained the SE-NET by using the finite difference method and found its performance is comparable to the standard deep neural networks. The trained network is transferable to the actual optical systems. Our works extend the application field of machine learning from digital only optimization to hybrid physical digital optimizations.

Broader impact
Our research bridges the gap between physical laws and neural networks. By augmenting the physical neural network concept, we believe that our work has the following impact.
Optimizing data accuision: Recent machine learning (ML) techniques require complex processing and enormous datasets to achieve state-of-the-art performance. On the other hand, the input data are still captured to be easily understood by humans . It is questionable whether traditional data acquisition is the optimal approach for neural post-processing. For instance, it is difficult to get depth, spectral, and polarization information from standard commercial camera data. However, when we introduce special purpose lenses or gratings in front of sensors, we can easily reconstruct those data with simple processing. This implies that an optimized physical structure can simplify data processing and improve its performance. Our method provides a way to jointly optimize the physical structure and deep neural networks. Although the method has only been explored for simple tasks so far, it has the potential to integrate sensing and information processing in the future.
Physical structure as a computational resource: The computational resources to execute deep learning are increasing more and more. Thus, alternative methods of computation are being widely explored to reduce energy consumption and processing latency. Based on our concept, we can outsource part of the digital processing to a passive physical structure, which will reduce the energy consumption of the whole processing operation.
Physics inspired ML The crossover of ML and physics is a topic of great interest for, for instacne, material informatics and physics-inspired networks such as neural ODEs. Our paper shows the relationship between the manipulation of the physical quantityand the neural network, connecting ML and physics (details are described in the appendix). According to our idea, physicists can treat the physical equations as neural networks, and ML researchers can refer to physics for the design of DNNs .

A Correspondence between physics and neural network
We summarized the correspondence between physical phenomena and neural network in Figure  A.1. The neural ODEs are the continuous representation of the ResNet or RNN [9]. As discussed in the main article, Schrödinger equation is considered as a continuous expression of complexvalued residual layer. Especially, it is a special type of unitary neural network [1] when we use the phase-only update rule described in the main article. Our discussions are directly connecting to waveguide optics and quantum mechanics because they are denoted by the Schrödinger equation.
The real-valued representation of the Schrödinger equation corresponds to the diffusion equation [32]. The diffusion equation is typically used for modeling the thermal conduction and diffusion-reaction system. The discussion in section A.3 is effective by considering the real-valued case. Thus, diffusion equation is also considered as a special type of neural network. As is well known, classical limitation ( → 0) in quantum systems, or paraxial approximation (λ → 0) in wave optics corresponds to the Hamilton-Jacobi equation, or Eikonal equation. Our work can be extended to analytical mechanics or analytical geometric optics. Note that the neural network based on such second-order ODE have already introduced by [4]. The Schrödinger equations derived from the scalar wave equation. Our work could thus be extended to complex-valued propagation of, for instance, electromagnetic waves and acoustic waves. The training of the such physical neural network can be executed by the adjoint method (reverse mode ODE or PDE) as the same manner as neural SE, which is essentially equivalent to the back propagation method used in the standard neural network.

B Crank-Nicolson Finite Difference Method
We describe the Crank-Nicolson finite difference method (FDM) for solving the Schrödinger equation numerically [33]. In the FDM framework, an equation to solve is discretized by the linear spatial grid as shown in Fig. A.2(a). Then, by considering the first-order approximation: ∂ψ/∂z ≈ (ψ q+1 p − ψ q p )/∆z, Eq. (9) can be expressed as difference equation (A.1) where ψ q p denotes the electric field at each grid position (x, z) = (p∆x, q∆z), and the parameter γ indicates the contribution of the forward (explicit) and backward (implicit) difference. For example, eq. (A.1) with γ = 0 is known as an explicit method, and its calculation diagram is shown in Fig.  A.2(b). This scheme is simple and easy to implement, but the solution becomes unstable when ∆x 2 /∆z > 1/2. In addition, we need fine computational meshes to obtain the accurate solutions because their numerical error is relatively large [O(∆x 2 + ∆z)]. In the case of γ = 1, eq. (A.1) becomes implicit method as shown in Fig. A.2(c). This scheme is numerically stable, but their numerical accuracy is same as that of the explicit method. On the other hand, Eq. (A.1) with γ = 1/2 is known as the Crank-Nicolson method [ Fig. A.2(d)]. This scheme is numerically stable and the By considering the definitions, we can derive where, ∆n q p , u, and v are defined as ∆n q p = n(p∆x, q∆z) − n r (A.6) and D is tridiagonal matrix described as follows; Thus we can compute the electric field by solving eq. (A.5) at each calculation grid by using linear solver. When we simply use a standard linear solver (e.g. Gaussian elimination, numpy.linlg modules), O(n 3 x ) computational resources are typically required. Here we use sparse matrix solver based on Thomas method. Then the computational resources decrease to O(n x ). The backward transmission can be solved in the same way. We implemented the above algorithm onto the Pytorch framework. Thus, we can train the SE-NET as a standard building block of deep neural networks.

C Convolutional neural network and PDE
We show that a convolution layer can be considered as a partial differential equation (PDE). The same discussion is shown in [35]. Here we assume that one-dimensional PDE inputs y ∈ C nx . It represents a one-dimensional grid function obtained at the cell centers with a mesh size h = 1/n x . A convolution layer with one-dimensional Kernel filter is represented as K ∈ C (nx×nx) , which is parameterized by the convolution stencil θ ∈ C 3 . Applying a coordinate change, the convolutional operation can be described as follows: where * denotes convolution operation, and α i are given by the unique solution of (A.11).
By considering the limitation h → 0, This argument extends to higher spatial dimensions and it acts as a high-dimensional Kernel filter.
D End-to-end learning method As a example of end-to-end ML, we consider the system for the spectroscopy as shown in Fig. 4 in the main article. The detailed processing scheme is shown in Fig. A.3. It consists of two-dimensional optical circuits that act as SE-NETs, photodetectors (PDs) to receive the SE-NET outputs, and digital electric circuits which act as a standard neural network. The input power spectra u(λ) are represented as a complex-valued Gaussian beam, which described as follows.
ψ(x, z 0 , λ) = u(λ)Φ(x) (A. 13) where Φ represents the input waveguide fields as follows; where ω i is mode radius of the input edge of optimizing regeion. The data is processed by SE-NETs (Eq. 9) and detected by the PD. The detection efficiency η of each PD for each wavelength λ is described as where i denotes the detector number, and Φ represents the detection fields of PDs, which are typically described as Gaussian profiles.
where ω o is the mode radius of output wavegide, and x p is PD channel spacing. The total output power is described as The detected data are acquired by digital circuits, and they become inputs of digital DNNs. DNN output Y is converted to loss L by the cost function. Then, the backward of L is computed by using the standard backward process, and we get following equations: where M is the number of PDs. As the backward in analog photonics ψ a (x, z) is solved above equation, we can design the optimal structure through the end-to-end hybrid machine learning.
For the experiment in Fig. 4 in the main article, we set the parameters as follows. The first layer is SE-NET using the optical waveguide. The ω i and ω o are set to 100 and 6 µm, respectively. The analyzed region was set to 500 × 500µm 2 . The 25 detectors were set on the output edge of the waveguide (M=25) with x p = 12µm. We used a linear Hamiltonian by considering easy transfer to the optics. For the post processing, we used a residual layer with a one-dimensional convolution filter, which has almost the same structure as the filter in [23]. It consists the up-conversion layer, one-dimensional convolution layer, two fully-connected (fc) layers, and a residual connection layer. The kernel size and padding size of the convolution filter are set to 16 and 8. For the up-conversion, we use the pseudo-inverse matrix of the transmission matrix T .

E Additional Experiments
Binary classification task: We consider a small-scale test problem in two dimensions called ellipses and moons. The experimental set-up is shown in Fig. A.4(a).The test data consists of 1,200 points that are evenly divided into two groups as shown in the subplots in Figure Fig. 3(d)]. For the activation function of the cascade connected network, the tanh function was employed, which can be physically implemented by using the saturable absorbers. The input data are represented as a complex-valued Gaussian beam with a beam width of 6 µm. The analyzed region was set to 60 × 60µm 2 . The nonlinearity of the networks was scanned by changing the g value (for the nonlinear Hamiltonian case) and divided layer number N (for the cascaded network case). At the output edge, the intensity was detected by two-arrayed PDs with ω o = 6µm. We show the results in Figure A3(b) and (c). As the examined task obviously requires nonlinear conversion, the linear SE-NETs [g = 0 in Fig. A.4(b), N = 0 in Fig. A.4(c)] could not classify the data. When the network have a nonlinearity, the validation accuracy increases up to 100%. When g increase too much, the accuracy was degraded. Thus, we set g = 0.1 for another experiment.

CIFAR10:
We trained SE-NETs to classify the CIFAR10 image dataset. The data set consists of 60,000 labeled digital images with size 32 × 32 . For the training, the optimizer steps were computed using mini-batches consisting of 128 randomly chosen examples from the 60,000 training data. The input wavelength was set to (λ = 2π/k) to 1.55 µm. The trained models is almost same as CNN-SE-NET in Fig. 3(a). The differences are follows; (i) the input dimension of the convolution filter was set to three to match the data dimension of CIFAR10, (ii) the region of SE-NET was set to 1150 × 400µm 2 . It has a CNN-based down sampling layer containing three two-dimensional convolution units, which is the same as in the pre-filtering method used in the previously reported ODE-Net model [9]. We consider the nonlinear activation model as shown in Fig. 3(d). This system can be physically devised by inserting the nonlinear material discretely. At the input edge of the SE-NETs, the input data are aligned to the x-axis and mapped to the real part of ψ(x, z 0 ). At the output edge of SE-NETs, the real and imaginary parts of ψ(x, z = z 1 ) are independently treated. They go to the full connection layer. The experimental results for test accuracy was 71.4 % after the 30-epoch training.