Architecture agnostic algorithm for reconfigurable optical interferometer programming

We develop the learning algorithm to build the architecture agnostic model of the reconfigurable optical interferometer. Programming the unitary transformation on the optical modes of the interferometer either follows the analytical expression yielding the unitary matrix given the set of phaseshifts or requires the optimization routine if the analytic decomposition does not exist. Our algorithm adopts the supervised learning strategy which matches the model of the interferometer to the training set populated by the samples produced by the device under study. The simple optimization routine uses the trained model to output the phaseshifts of the interferometer with the given architecture corresponding to the desired unitary transformation. Our result provides the recipe for efficient tuning of the interferometers even without rigorous analytical description which opens opportunity to explore new architectures of the interferometric circuits.


I. INTRODUCTION
Linear optical interferometers are rapidly becoming an indispensable tool in quantum optics [1] and optical information processing [2]. The interest to linear optics grows due to broader availability of the integrated photonic fabrication technology to the scientific community.
The key feature of the state-of-the-art integrated linear interferometer is the reconfigurability enabling the device to change its effect on the input optical mode upon the demand. This possibility has made the linear optical circuits particularly appealing for information processing challenges. In particular reconfigurable interferometers are the main ingredients of the contemporary linear optical quantum computing experiments [3][4][5] and are considered as the hardware accelerators for deep learning applications [6,7]. Furthermore, the fabrication quality and the improved scalability of the reconfigurable photonic circuits led to the emergence of the field-programmable photonic array concept -a multipurpose photonic circuit which can serve many possible applications by means of the low-level programming of the device [8].
The unitary transformation matrix U completely describes the operation of the linear optical interferometer. The matrix U couples the input optical modes of the device to the output ones a . The architecture of the interferometer parametrizes the transformation U = U ({ϕ}) on the tunable parameters {ϕ} which are typically the phase shifters controlling the relative phases between the arms of the interferometer. The architecture is labeled as universal if it allows reaching any arbitrary N × N unitary matrix by appropriately setting the phase shifts {ϕ}. The device programming is then essentially boiled down to establishing the correspondence between the desired matrix U 0 and the appropriate parameter set {ϕ 0 }. Hurwitz analytical * dyakonov@quantum.msu.ru decomposition of the N × N unitary matrix [9] is the well-known example of the universal architecture. It implies straightforward implementation using simple optical components [10,11] -the two-port Mach-Zender interferometers (MZI) with two controllable phase shifters. The interferometer architecture based on this decomposition is the mesh layout of the MZI which is very easy to program -the efficient inverse algorithm returns the values for each phase shifter given the unitary matrix U 0 . The simplicity of this architecture comes at the cost of the extremely stringent fabrication tolerance. The universal operation is achieved if and only if the beamsplitters in the MZI blocks are perfectly balanced which is never the case in the real device. Numerical optimization methods have been adopted to mitigate the effect of the imperfections [12,13] but the simple programming flow is deprived.
The challenge to overcome the effect of the fabrication defects have also led to development of more sophisticated architectures [14,15] which have no simple analytical description and can only be programmed using the optimization routines. Running the optimization routine to set up the real physical device transformation requires the experimental execution of the resource-intensive procedure of the transformation matrix reconstruction [16] at each iteration of the optimization algorithm of choice. From the end user perspective the necessity to optimize the device each time when the transformation needs to be changed is unacceptable. Firstly, the optimization in the high-dimensional parameter space is itself a timeconsuming procedure requiring sophisticated tuning and what's more there is no guarantee that the global minimum will be reached. Secondly, the algorithms providing fast convergence in multiparameter optimization problems are typically gradient-based and the precision of the gradient estimation of the objective function implemented by the physical device is limited by the measurement noise. Lastly, even though the number of switching cycles of the phase shifters is not strictly limited spending the device resource during tedious optimization proce-dures severely degrades the lifetime of the programmable circuit.
In this work, we develop the efficient algorithm for programming a linear optical inteferometer with complex architecture. We employ one of the main methods of machine learning -supervised learning of a numerical model, widely applied to the neural networks training [17][18][19]. The model of the interferometer is learnt using the set of samples of transformations corresponding to different phase shifts. The trained model is used to quickly find the necessary phase shifts for a given unitary transformation using optimization routine applied to the model and not to the physical device. Our learning algorithm is divided into two stages: the training stage -find the model of the interferometer using the training set of sample transformations, and the programming stage -determine the phase shifts of the interferometer model corresponding to the required transformation.

II. FORMULATION OF THE PROBLEM
We devised our algorithm to solve the problem of programming the multimode interferometer consisting of alternating phaseshifting and mode mixing layers. This architecture has been proven to deliver close to universal performance and has no straightforward connection linking the elements of the matrix to the phase shifts of the interferometer [14]. This architecture serves as the perfect example to demonstrate the gist of our algorithm. The circuit topology is outlined in Fig. 1a). The unitary matrix U is expressed as where Φ = diag(e iϕ 1 , . . . , e iϕ N ), = 1, . . . , N + 1. We call U the basis matrices because they completely define the interferometer operation. If the U are available a simple numerical optimization routine finds the corresponding phase shifts ϕ k thus completing the task to program the device. It is worth noting that the generalized form of the expansion of the type (1) given in [14] is valid for any linear-optical interferometer design. Indeed it is easy to confirm that every optical interferometer comprised of independent ingredients -fixed beamsplitting elements of any topology and phase modulators -can be unfolded into the sequence of unitary transformations coupling the modes of the circuit and the phase shifters. The only information required is the number of mode mixing layers and the phase shifting layers. The fact that the inner structure of the beamsplitting elements can be arbitrary and there is no restriction on the length, order or number of mode mixers and phase shifters gives us the strong ground to call our algorithm architecture agnostic.
The problem underpinning the difficulty of programming the required unitary in the multimode architecture is that the basis matrices U of the fabricated device do not match the ones implied by the circuit optical design.
The interferometer universality is not degraded but efficient evaluation of the phase shifts ϕ k becomes impossible since U are not know anymore. This brings us to the first step of our learning algorithm -the reconstruction of the parameters of the basis matrices utilizing the information from the training set T gathered from the device of interest (see Fi.1b)). The set T includes M pairs (Ū (i) ,Φ (i) ) obtained by seeding the device with random phase shiftsΦ (i) and applying the unitary reconstruction procedure of choice [16,21] to get theŪ (i) . The basis matrices U are then determined as the solution of the optimization problem where the figure of merit is averaged over the training set T . Once the basis matrices are determined we move to the second step of the algorithm (see Fig.1c)) -finding the appropriate phase shifts ϕ k which will adjust the interferometer model M to match the unitary matrix U 0 / ∈ T .

III. THE LEARNING ALGORITHM
In this section, we present the algorithm which learns the interferometer model M based on the initial data contained in the training set T . We reduce the learning problem to the multiparameter optimization of the nonlinear functional J. In this section we present the mathematical framework of the learning algorithm and exemplify its performance on the multimode interferometer.

A. The figure of merit
The figure of merit J to be studied in our work is the Frobenius norm J F R : It is invariant only under the identity transformation, that is, J F R (U,Ū ) = 0 only if the the magnitude and the phase of U andŪ matrix elements are identical. The expression 3 can be rewritten using Hadamard's product (A B) i,j = (A) i,j · (B) i,j operation and takes the following form: The gradient of the J F R with respect to the parameter set {ᾱ} is given by:

B. Computing the gradients of J
The gradient-based optimization algorithm substantially benefit from the analytical gradient expressions of the optimized functions. It turns out that the multimode interferometer expansion 1 admits simple analytic forms of the gradients over the u ij elements of the basis matrices U and over the phase shifts ϕ k . We will derive the analytical expressions of the gradients ∂ and ∂ ϕ k J F R required during learning and tuning stages of the algorithm respectively. The Eq. 5 stems that the We will first focus on the ∂ x ( ) ij J F R and ∂ y ( ) ij J F R gradients. In order to simplify the computation we introduce N auxiliary matrices A and another N auxiliary matrices B as the partial products taken from the expansion Eq. 1: where A and B can be calculated iteratively: Next, given that x where ∆ (ij) are the matrices in which all elements are zeros, except ∆ (ij) ij = 1. The Appendix A provides the detailed derivation of the Eq. 8.
Once the basis matrices of the model M are learnt we can use them to calculate the gradients ∂ ϕ k J F R . The derivation of the ∂ ϕ k J F R also requires to introduce The N + 1 auxiliary matrices C and N + 1 matrices D representing the partial products from the general expansion Eq. 1. The iterative formula establishes C and D for each index : Once the C and D are computed, the gradients ∂ ϕ k U are given by where all ∆ (kk) elements are zeros, except the ∆ (kk) kk = 1. The details of the multimode interferometer architecture tuning is described in [14]. The output of the ϕ k consludes the workflow of the algorithm.

IV. NUMERICAL EXPERIMENT
In this section we provide key performance metrics of the interferometer model learning algorithm. We test the algorithm scaling properties with respect to the training set M size and the number of interferometer modes N .
To certify the quality of the model M we employ the cross-validation methodology -the quality tests use another set of examples which wasn't included in T .
The simulation of the learning algorithm follows a series of steps. We first generate the training set (Ū (i) ,Φ (i) ) using the multimode interferometer expansion Eq. 1, while we choose the phases randomly from a uniform distribution from 0 to 2π. The basis matrices U are sampled randomly from the Haar-uniform distribution using the QR decomposition [22]. In the real-life setting the elements of T are the outcomesŪ (i) rec of the unitary reconstruction [16,21] algorithms applied to the reconfigurable interferometer programmed with the phasesΦ (i) . The subtleties of gathering the appropriate training set experimentally are discussed in Sec. V. The proper interferometer model must accurately predict the unitary matrix of the real device with the certain set of phases Φ applied. The cross-validation purpose is to estimate the predictive strength of the model M . For cross-validation we generate the test set (Û (i) ,Φ (i) ) comprised of randomly selected phasesΦ (i) and the corre-spondingÛ (i) . The cross-validation test uses each sample from the test set to verify whether the interferometer model with phasesΦ (i) outputs the unitaryÛ (i) . The model U is considered to pass the cross-validation if J(U,Û ) ≤ 10 (−2) . The criteria has been derived empirically by analyzing the behaviour of J F R convergence on the test set. If the model passes cross-validation the J F R (U,Û ) experiences rapid decrease down to the values less than 10 −2 .
The model is initialized with basis matrices selected either randomly or with a priori knowledge available based on the design of the physical elements realizing the basis matrices. We will study both cases and refer to the random initialization as the black box model. At each epoch of the learning process we estimate the average gradient over the collection of examples from T and update the basis matrices according to the optimization algorithm (stochastic L-BFGS-B algorithm, SciPy package). Instead of averaging the gradient over full training set we randomly draw a subset of m = 5 pairs (Ū ,Φ) each epoch and use this subset for averaging. The value m = 5 has been determined empirically as it was providing substantial computational speed-up while still keeping high training accuracy. We don't use the unitary parametrization of the basis matrices during the learning procedure and these matrices simply as the complex-valued square matrices. Since the parameters of the model are then the real x ( ) ij and the imaginary y ( ) ij parts of each basis matrix U the updated basis matrices don't belong to the unitary space. We use the polar decomposition A = HV , where H is a hermitian and V is the unitary matrix, to project updated complex basis matrix C onto the closest unitary U at each step of the optimization [23]. This method helps to avoid local minimum problem which may arise due to sophisticated unitary matrix parametrization.
The simulation code is written in Python employing Numpy and Scipy packages. The code is publicly available on GitLab [24].

A. Model -black box
We start first from considering the black box model. This scenario implies no a priori information about the basis matrices U is available and the interferometer is represented by a black box type system with N 2 variable phase parameters ϕ k . Therefore, the model should be initialized with completely random guess. The initial basis matrices are sampled from the Haar-random unitary distribution [22]. Fig. 2 illustrates the convergence of the average value of the functional J F R during the learning for different values of interferometer dimension N . The model cross-validation testing (see Fig. 3) determines the size M of the training set for each value of N .
It should be noted that as N increases a plateau appears, which heavily impacts the learning convergence. The plateau becomes significant already at N = 6 ( Fig.  2c). For N > 6, we failed to observe learning in the black box -the average value of the figure of merit remains at a plateau all the time. The work [25] suggests that the reason for the plateau in high-dimensional optimization problems is the presence of the large number of saddle points in the optimized function landscape rather than local minima. Several algorithms exploiting the idea of the adaptive gradient have been developed to tackle the problem of escaping the plateau [26,27].
Until this moment the elements of T included the ma-  same set using the real device means that the reconstruction of theŪ (i) exp matrices must be performed with absolute precision, which is never the case in the experiment. The learning algorithm has to be designed to tolerate the certain amount of discrepancy between the idealŪ ideal and the reconstructedŪ exp matrices. These deviations are the inevitable consequence of imperfections of measurement tools used during the reconstruction process. We have modeled the behaviour of the learning algorithm seeded with a training set including the phase shiftsΦ (i) and the unitariesŪ (i) exp slightly deviated from their theoretical counterpartŪ The deviation is introduced betweenŪ exp andŪ ideal as the polar projection [23] of the perturbedŪ ideal onto the unitary space: where X and Y are the random real-valued matrices of size N × N , which elements are sampled from the normal distribution N (0, 1). The degree of deviation is controlled by the real-valued parameter α. The calibration curves provided in the Appendix C juxtapose the more common matrix distance measure the fidelity F calculated as the Hilbert-Schmidt scalar product to the deviation values F (α). These curves should develop the intuition to interpret the J F R values. The Fig. 4 illustrates the the convergence of the model of the simplest case N = 2 supplied with the training set sampled with the given deviation α. The α ≈ 0.04 indicates the threshold at which the model fails to pass the cross-validation criteria J F R ≤ 10 −2 . The black box model results expounded in the sec. IV A evidence that the optimization complexity of the model with arbitrary initial basis matrices grows rapidly in the training set volume M . In this section we study the choice of the initial approximation for the basis matrices U which enables learning for the larger dimension N . The case when the basis matrices U are completely unknown does not adequately reflect the real situation. In practice the optical circuits with well-defined geometry and optical properties implements the basis matrices. The prototypes of these circuit can be tested beforehand to verify the performance of the circuit including the mode transformation that it generates. Contemporary integrated photonics fabrication technologies guarantee the reproducibility up to the certain level of precision specific to each technological line. Summing up the basis matrix unitary transformation U est can be estimated in advance. This estimate serves as the initial guess for the optimization algorithm at the model training stage. In this section we will demonstrate how this knowledge substantially simplifies the optimization and enables learning the models of the interferometers with N up to at least few tens of modes.
We use estimated matrices U est as the initial guess for our optimization routine. These matrices are chosen to be close to the ideal basis matrices U used for training set generation. We get the initial guess using the procedure described be Eq. 12. The Fig. 5 shows the convergence of the learning algorithm employing the knowledge of the basis matrix unitary up to a certain precision. The a priori information about the basis matrices enabled learning the model of the interferometer up to N = 20 using the same computational hardware. The larger the interferometer dimension N the higher the precision of the U est estimation must be. The Fig. 5b) illustrates the regions of the α value depending on the size of the interferometer where the algorithm still accurately learns the model M.

V. DISCUSSION
The demonstrated approach to interferometer programming stands out with several major advantages. First and foremost the method is agnostic of the architecture of the interferometer. Any universal interferometers reported in literature [10,11,14,15] admit some form of the expansion Eq. 1 -the optical mode mixing elements interleaved with phase shifters. This means that both the gist of the algorithm and the mathematical framework fit any architecture of choice. This assumptions remains valid unless the mode mixers and the phase shifters are considered as the independent elements. Next, the output of the learning algorithm is the complete interferometer model taking into account the transformation of the mode mixing elements in the fabricated device. The model answers the question how close the required unitary U 0 can be approximated by the specific device under study and can pinpoint ares of unitary space inaccessible for the device due to design restrictions or the fabrication flaw. This feature has to be compared with typical calibration data used for programming the interferometer based on MZI blocks. The phase shifters are calibrated but no knowledge is available about the real transformation of the beamsplitters comprising the MZI. This fact leads to the necessity of running optimization procedure to improve the fidelity of the implemented unitary if some of the beamsplitters don't meet the designed transmission coefficients. Lastly the presented algorithm is essentially the reconstruction routine for the inner fixed optical elements of the complex interferometric circuit. Hence it can be adopted for probing the quality of the optical subcircuits located inside the larger optical scheme.
The bottlenecks of the proposed algorithm are related to the experimental issues. The J F R Frobenius metric requires exact measurement of the unitary elements' modulus and phase. Several reconstruction methods have been proposed and verified [16,21,[28][29][30]. Some of them [21,28] provide only partial information about the transformation matrix of the interferometer omitting phases which are impossible to reconstruct using the methodspecific dataset. Any method will inevitably suffer from the path-dependent optical loss which is impossible to distinguish and attribute to the particular path inside the circuit. Another issue which is not covered by our algorithm arises from the crosstalks between the phase shifters. Our framework assumes that the phases in different paths are enabled independently which is not the case due to the crosstalks between different phase modulating elements. Luckily the integrated photonic modulator implementations typically exhibit extremely low crosstalks [31,32].
We believe that our results will enable opportunities to employ new programmable optical interferometer architectures for both classical and quantum applications.
According to the expressions (1) and (7), U = A U B . Hence, we have: We calculate the derivative: Let the unitary elements be u We convert the obtained expressions to the matrix form. Using the auxiliary matrices ∆ (mn) , which elements are zeros, except ∆ (mn) mn = 1, we can transform the formula (A2) to take the following form: We get the expression for ∂u ij ∂ϕ k : We will now express B3 in matrix form.We introduce auxiliary matrices ∆ (kk) , which elements are all zeros, except ∆ (kk) kk = 1. Then the formula (B3) transforms to: Then we have:  6. Dependence of the mean fidelity (Averaging over 1000 unitary matrices) with which the trained model can produce unitary matrices for the given phases (blue) and the mean fidelity (averaging over 10000 unitary matrices) with which the matrices from the training set (green) are known on the α parameter for size matrix N = 2.