Perfect optics with imperfect components

Many advanced optical functions, including spatial mode converters, linear optics quantum computing gates, and arbitrary linear optical processors for communications and other applications could be implemented using meshes of Mach–Zehnder interferometers in technologies such as silicon photonics, but performance is limited by beam splitters that deviate from the ideal 50∶50 split. We propose a new architecture and a novel self-adjustment approach that automatically compensate for imperfect fabricated split ratios anywhere from 85∶15 to 15∶85. The entire mesh can be both optimized and programmed after initial fabrication, with progressive algorithms, without calculations or calibration, and even using only sources and detectors external to the mesh. Hence, one universal field-programmable linear array optical element could be mass fabricated, with broad process tolerances, and then configured automatically for a wide range of complex and precise linear optical functions. © 2015


Alternate Mach-Zehnder block configurations
In the implementations of Mach-Zehnder interferometers (MZIs), in general two phase shifters are required somewhere so that both the split ratio and the phase of at least the "Right" output can be controlled. For the automatic algorithms, as long as both of these degrees of freedom exist, various configurations are possible for the MZI blocks. Additional possible configurations are shown in Fig. S1. The configuration (a) of Fig. S1 shows a conceptual implementation based on a conventional beamsplitter with a controllable reflector and a phase shifter with phase delay  in the path from the "Right" port. The waveguide MZI version (b) of Fig. S1 is functionally equivalent to configuration (a). The split ratio of the MZI in version (b) -the ratio between the "Right" and "Bottom" output powers for a power incident on the "Top" port -is the equivalent of the reflectivity of the controllable reflector in version (a). In version (b), this is controlled by two phase shifters driven differentially, which give a controlled phase difference of  between the two interferometer arms. The configuration (c) is particularly compact and symmetric, using differential drive of the phase shifters to control the split ratio and common mode drive to vary the output phase.
These different configurations differ functionally in that they can give rise to different phase shifts of the beam in the "Bottom" output port. For the automatic alignment algorithms, it is of no consequence whether the phase of that "Bottom" output is changed or not. That "Bottom" output is fed into the next "Channel row" of the linear network, and that next row is configured later in the algorithms; that next row can configure itself to take any phase of inputs to the "Top" of its MZIs from the "Bottom" of the preceding row. Though the automatic algorithm does not care about which implementation of the MZI blocks is used, of course the calculated values of the settings of the  phase shifters will be different for different MZI block configurations. In the illustrative calculated designs below in Section 5, we presume the configurations of (a) and (b) of Fig. S1. It would be straightforward to repeat the calculations for other configurations, however.

Discussion of automatic beamsplitter adjustment "No relative phase change" condition
In the main text, we presumed for simplicity a "no relative phase change" condition in our analysis, by which we mean that, if we shine a beam in the "top" port T in Fig. 2 (a) of the main text, the relative phase of the beams transmitted into arms C and D is not changed as we vary the magnitude of the reflectivity rL of the left beamsplitter. That is the behavior that the Double Mach-Zehnder interferometer (DMZI) configuration ( Fig. 2(b) of the main text) will show. This is not quite a necessary condition in general for a loss-less beamsplitter -an arbitrary phase shift could be added just before point C in Fig. 2(b) of the main text, for example, without violating unitarity. If there were such a relative phase change, in the algorithms below we would need to re-optimize the phase  after each adjustment of rL, though that step would be simple to add to the algorithms.

Graph of P Rmin
The explicit graph of  

Algorithm for setting the beamsplitter ratios in the mesh of MZI blocks
Before introducing the various algorithms formally, we can informally introduce the algorithm for setting up all the MZI blocks in a unitary mesh as in Fig. 1 of the main text so that all the blocks have their (effective) beamsplitters set to 50:50. We will call this the "mesh 50:50 setup algorithm" or MFSA for short. For the moment, we presume we have either the DR or DB (or possibly both) (mostly-transparent) detectors or power sampling points in the blocks, as shown Fig. 2 of the main text. For this setup, we presume we have sources that we can shine into inputs WI1 to WI3 in Fig. 1 of the main text, one by one. Here, we will describe the algorithm for these 3 input and 3 output unitary meshes, but the extensions to larger meshes are straightforward. We give formal general versions of all the algorithms below in section 4. BFSA below is the "beamsplitter 50:50 setup algorithm" as in the main text. For this example case, then, the mesh 50:50 setup algorithm (MFSA) is as follows.
Shine power into WI1 only.
Run BFSA for B11. After completing BFSA, arrange that some power emerges from the lower ("bottom") port of B11 (e.g., by adjusting  if necessary).
Run BFSA for B21, similarly leaving some power emerging from its "bottom" port. Run BFSA for B31 (if needed -this block need only be a phase shifter) Shine power into WI2 only.
Run BFSA for B12, leaving some power emerging from its "bottom" port. Run BFSA for B22 (if needed -this block need only be a phase shifter) Shine power into WI3 only.
Run BFSA for B13 (if needed -this block need only be a phase shifter) This MFSA algorithm therefore allows us to set all the (effective) beamsplitters in all the MZI blocks to be 50:50.
Thus far, we have described how to set up the (effective) beamsplitters using detectors at one or both outputs of each MZI block. We already know it is possible to train the mesh for its ultimate function without detectors in the mesh (Appendix B of [1]). We can also run a slightly amended version of the MFSA algorithm without detectors inside each MZI block; specifically, we can set up the beamsplitter ratios in the entire mesh using only detectors D1 -D3, external to the mesh, on the outputs WC1 -WC3 respectively, effectively using them instead of the DR detectors inside the blocks. (Again, these need only sample a small amount of the power emerging from these waveguides.) We do this basically by working progressively first through all the MZI blocks on a given input row. A key point is that, after setting up the 50:50 ratios in all the MZI blocks in a row (e.g., B11 -B31) in this way, we add an algorithm, working backwards up through the row to set all those MZI blocks to the "bar" state -equivalent to perfect "reflection" from "top" to "right" and from "left" to "bottom"; effectively, this makes any such "bar" state MZI block appear as if it were not there at all. The details of this algorithm are given below in Section 4.
By this additional process of setting all the MZI blocks to the bar state, we leave the mesh in the starting state required for training the mesh for its ultimate function using only the external detectors D1 -D3 to run the "Self-configuring linear component algorithm" (SLCA) [1,2] for the final training of the mesh.
If we are using the mostly transparent detectors DB within the mesh, it is not necessary to set the MZI blocks to the bar state before training. Indeed, one major advantage of having the (mostly-transparent) DB detectors in the blocks is that we can be continually retraining the mesh if necessary as the problem changes, without having to reset all the blocks to the "bar" state before any such retraining; hence, having the DB detectors in the blocks allows a more incremental retraining to proceed all the time, as in tracking moving physical sources, for example.
With this informal introduction to the algorithms, in the next section we present the formal algorithms for configuring both the 50:50 split ratios and for training the ultimate function of the now-"perfect" set of MZIs.

General formal versions of the alignment algorithms
We presume there are MC channel rows and MI input rows; these numbers need not be equal in general for such processors (though Fig. 1 of the main text is drawn for MC = MI) -we might be mapping MC orthogonal MI-dimensional vectors to MC single-mode output waveguides (or channels), for example, where MC < MI.
For these algorithms, we use the terminology as in Figs. 1 and 2 of the main text, though for greater clarity of notation we will write B(m, n) instead of Bmn in labelling the blocks, WI(n) and WC(m) instead of WIn and WCm respectively, and D(m) instead of Dm.
Several of these algorithms run over all the blocks B(m, n) in the mesh. For simplicity in stating the algorithms, such algorithms can be written to run over n = 1 to MI and m = 1 to MC. Because we only actually have a "triangle" of blocks at most in a given unitary mesh (see, e.g., Fig. 1 of the main text) , or fewer if MC < MI, we test to see if the block exists using an "If block B(m, n) exists" statement. Formally, a block exists if m  MI -n + 1, which means it fits within the "triangle", and if m  MC, which means it is in one of the channel rows actually implemented in the device.
There are two versions of many of these algorithms depending on whether (i) we embed mostly-transparent detectors or power sampling points inside the mesh or (ii) we use only detectors or power sampling points D(1) -D(MC) at the channel "outputs" WC(1) -WC(MC). We presume the MZIs are all loss-less or that they all have the same loss. If the MZIs have loss, then we should add dummy blocks as discussed at the end of this section, which means that all paths from input to output go through the same number of MZI blocks. In such a case, the output is then simply multiplied by a constant corresponding to the loss in MI successive blocks, though the device otherwise performs the desired linear operations. The algorithms are given in pseudo-code here, with a syntax that is self-evident and similar to BASIC. Non-executable commenting statements are given in italics, starting with "Comment:".
These algorithms are only representative and are meant to indicate that there is at least one reasonable way of implementing all of these configurations and adjustments. There are variations possible, some of which are mentioned here, and alternative approaches that could be taken.
Algorithm S1 -Beamsplitter 50:50 setup algorithm (BFSA) Note, as discussed by example above in Section 3 and explicitly below, that this algorithm will be run as part of Algorithm S2 (MFSA) below, which will ensure powering of appropriate optical inputs to run this BFSA algorithm. to detect minimum and maximum powers as required Arrange that some (possibly all) power emerges from the lower ("bottom") port of B(m, n) by adjusting  in block B(m, n) so D(n) power is reduced at least somewhat from its maximum (possibly to zero) Comment: this puts the block in a partial or complete "cross" state. This gives some power into the "top" port of the block in the next channel row (next m) so we will be able to run BFSA on it next. Next m Comment: this next part of the algorithm works back up through the line of blocks to set them all to the "bar" state. Though this part of the corresponding algorithm above (for the case with embedded detectors) was optional, here it is needed so that input row of blocks we have just set now appears as if it is essentially not there as we set up the blocks in the next input row. For m = MI -n + 1 to 1 step -1 Comment: it may not be necessary to run this for m = MI -n + 1 since that block may just be a phase shifter This version has both set up all the (effective) beamsplitters in the MZIs to 50:50 and set all the MZI blocks are in their bar states (i.e., the mesh is implementing an identity matrix).

Algorithm S3 -Self-configuring linear component algorithm (SLCA)
This algorithm is described in Ref. [1], and we give a formal version here for completeness. This algorithm sets up a unitary processor, e.g., as in Fig. 1(a) of the main text, so that the total power in each specific orthogonal input "training" (column) vector Um  of (complex) field amplitudes at inputs WI(1) -WI(MI) is mapped to the single channel output WC(m), for all m from 1 to MC.
For this algorithm to work perfectly, the (effective) beamsplitters in the Mach-Zehnder blocks are all presumed to be 50:50, so we presume we have already run Algorithm S2 (MFSA).

Version with no embedded detectors in the blocks
We presume we have already also run the version of Algorithm S2 (MFSA) without embedded detectors so we have also set all the MZI blocks to their "bar" state as the starting condition. If we are retraining the mesh to implement a new unitary operation, then we can run Algorithm S5, the "Bar-State Setup Algorithm" (BSSA) below before this retraining. Note: This algorithm is somewhat different from the algorithm in Appendix B of [1]. That algorithm works by maximizing output progressively in only one output detector for each input vector, on the presumption that all the MZIs are set initially to their "cross" state (i.e., "top" directly through to "bottom" and "left" directly through to "right", equivalent to the beamsplitter in Fig. 2(a) having no reflectivity at all). The algorithm here starts with the MZIs in their bar state, and works by successively minimizing power in various different output detectors. We could construct a version of Algorithm S2 (MFSA) that finished by setting all the MZIs to their cross states (or we could run Algorithm S4, the "Cross-State Setup Algorithm" (CSSA) below), and then use the previous algorithm in Appendix B of [1] here instead.

Algorithm S4 -Cross-State Setup Algorithm (CSSA)
This auxiliary algorithm can be used to set up all the blocks (except the lowest row of blocks, which are operating only as phase shifters and hence are always intended to be in their "bar" state if they even contain MZIs) in their "cross" state, where "top" is transmitted completely to "bottom" and "left" is transmitted completely to "right" in each block. This algorithm should be run only after setting all the internal (effective) beamsplitters in the blocks to 50:50, i.e., after running Algorithm S2 (MFSA). The blocks can otherwise be in any internal state, i.e., with any starting values of  and  in each block.
For n = 1 to MI -1 Comment: the upper limit of n = MI -1 is sufficient since the only block with n = MI is in the lowest row, and we do not want to set any of the blocks in the lowest row to the cross state since they are only operating as phase shifters.
Shine We could also construct a version of this algorithm based on minimizing power in DR or in D(m).

Algorithm S5 -Bar-State Setup Algorithm (BSSA)
This auxiliary algorithm can be used to set up all the blocks (including the lowest row of blocks, which are operating only as phase shifters and hence are always intended to be in their "bar" state if they even contain MZIs) in their "bar" state, where "top" is transmitted completely to "right" and "left" is transmitted completely to "bottom" in each block. This algorithm should be run only after setting all the internal (effective) beamsplitters in the blocks to 50:50, i.e., after running Algorithm S2 (MFSA). The blocks can otherwise be in any internal state, i.e., with any starting values of  and  in each block. we are working back up each "Input" row of blocks, from largest m to smallest m in each row, changing them from "cross" to "bar" states Next m Next n Note that the function of this algorithm is built into the version of Algorithm S2 (MFSA) for the case without embedded detectors, but is only optional in the version with embedded detectors.

Note on requirements on phase shift elements
All the algorithms presume that we have some way of changing and holding the values of the phases , , L, and R that we may adjust for each block while running the algorithms. We could be controlling the phases through heaters or through voltages on phase shift elements, in which case we would need (e.g., electronic) memories to hold the necessary drive voltages so as to retain these phases between algorithm steps and after we are finished with the algorithms. Alternatively, we might be physically trimming phase shifting elements by adding or subtracting material or physically permanently or semi-permanently changing refractive index.
To run the beamsplitter 50:50 setup algorithm, Algorithm S1 (BFSA), we have to be able to change  by approximately  (i.e., 180°) multiple times, so it may be particularly useful to make that phase shifter voltage-controlled in some fashion, at least during the initial training phase. Otherwise, if we only want a "set-it-andforget-it" device that we configure only once, the phase shifts could be set by physical trimming. Also, having both the  and  phase shifters voltage-controlled allows the overall function of the device to be reprogrammed.

Use of dummy blocks to equalize loss
The mesh as shown explicitly in Fig. 1 of the main text has all the necessary blocks for arbitrary configuration of the network, but has different numbers of blocks in different possible "paths" through the network. For example, the path from WI1 to WC1 only passes through one block, whereas the path from WI3 to WC3 passes through three blocks. If we want to equalize background loss or overall delay on all paths, we can add dummy blocks 1,2 , ultimately to be set in their "bar" state (i.e., complete transmission from "top" to "right" and from "left" to "bottom"). Such a configuration for a 4-input, 4-output mesh is shown in Fig. S3. Fig. S3. 4-input, 4-output mesh with added blocks (with dashed outlines) so all paths from inputs WI1 -WI4 to outputs WC1 -WC4 through the network encounter the same number of blocks. Ultimately, all of the additional blocks here will be set to their "bar" state so they perform no mathematical function.
To achieve the equal numbers of blocks per path, we only need to add the dummy blocks that lie horizontally on the rows connected to the actual signal paths (i.e., on the horizontal lines from WI3 to WC3, from WI2 to WC2 and from WI1 to WC1 in Fig.  S3), which in this example is the set of blocks B03, B02, B01, B10, B20, B30, B-12, and B2-1. To allow training of all of these additional dummy blocks so that their internal (effective) beamsplitters are 50:50 and so that they are all set to the "bar" state, we add the rest of the dummy blocks (here B-10, B-1-1, B0-1, B1-1, and B00). We can essentially then use all the same algorithms as we would use for the simpler mesh without the dummy blocks, merely extending Algorithm S2, the mesh 50:50 setup algorithm (MFSA), to also include these dummy blocks. The simplest way to view that extension is as if we had "completed the triangle" of blocks in Fig. S3, imagining that we extended the input rows and channel rows to the bottom of the figure. In the case of Fig. S3, this would involve hypothetically adding blocks B-13, B-14, B-15 to channel row -1, B04 to channel row 0, and similar hypothetical extensions for input rows 0 and -1. Then we imagine running MFSA for this larger mesh, just as before, though of course there are no actual settings to set in these hypothetical blocks. The additional inputs (e.g., WI0, WI-1, WI-2, and WI-3 in Fig. S3) and outputs (e.g., the waveguides WC0, WC-1, WC-2, and WC-3 and detectors D0, D-1, D-2, and D-3) are used during the MFSA algorithm (which should also be run to leave all the blocks in the "bar" state at the end). Otherwise, they are not used; the main functional training Algorithm S3 (SLCA) uses only the "active" inputs (WI1, WI2, WI3, and WI4 in Fig. S3) and outputs (WC1, WC2, WC3, and WC4 and detectors D1, D2, D3, and D4 in Fig. S3).
Note now that every pair of beams that interferes at a given block has passed through the same number of blocks. So, if there is equal loss in every block, the pair of beams has experienced the same attenuation. Hence, the settings the alignment algorithm gives to a particular block are the same as if there was no loss in the blocks. This in turn leads to a conservation of the orthogonality properties of the unitary transform even if the blocks are lossy. The resulting mesh of blocks will behave the same as the lossless set of blocks except for multiplication by a factor that corresponds to the loss in passing through one complete path from input to output. Of course, different loss in different blocks does not lead to this kind of behavior. This does mean, however, that losses like waveguide loss and beamsplitter loss do not matter to the function of the system, within an overall loss factor, as long as the waveguide loss and beamsplitter loss are sufficiently uniform for all waveguides and beamsplitters.

Example designs Calculating device settings
The overall method we are proposing in this work does not require we calculate anything for the settings inside the mesh because it works based on feedback loops. To see what the method will actually do in a given design, we can calculate what the various device settings will be after training. This also enables us to compare with previous designs.
To do this calculation, we use the method in Appendix A of Ref. [2]. We give a simplified and condensed version of that method here in the current notation. For definiteness, we choose the structures as in Fig. S1 (a) and (b), which are equivalent. In these, changing the reflectivity (e.g., by changing ) makes no change in the phases of either the "right" or "bottom" exiting beams, and the phase change  only acts on the "right" exiting beam; these choices keep the mathematics slightly simpler. It is straightforward to map the results onto the alternative versions in Fig. 2 of the main text and in Fig. S1 (c). Also, we choose the phase delay for any "reflected" path (i.e., from "left" to "bottom", "bottom" to "left", from "top" to "right" or "right" to "top"), not including passing through any  phase shifter, to be 0 (or some multiple of 2radians), and we choose the phase delay for any "transmitted" path (not including passing through any  phase shifter) -i.e., from "left" to "right", "left" to "right", "top" to "bottom", or "bottom" to "top" -to be /2 (possibly plus some multiple of radians). (To satisfy unitarity the sum of these "top" to "bottom" and "left" to "right" phase shifts has to differ by , within an additive multiple of , from the sum of the "left" to "bottom" and "top" to "right" phase shifts.) With these choices, we can write the field reflectivity of the beamsplitter or MZI in block Bmn (not including passing through any  phase shifter) as a positive real number rmn and the field transmission "through" such a beamsplitter or MZI (not including passing through any  phase shifter), from "left" to "right", "right" to "left", "top" to "bottom", or "bottom" to "top" as and we write the phase delay setting  in any given block as mn.
Mathematically, we imagine that we shine light backwards into the "output" channel waveguides WCm, starting with WC1. Progressively, we calculate the device settings so that the field emerging from the "input" ports is the phase conjugate (or Hermitian adjoint, Um  ) of the training vector Um  (written as a vector of unit mathematical magnitude) we would have shone into the "input" ports. This is straightforward for the first channel "row".
We write the elements of the (column) vector   and with the understanding that, for n = 1, the summation term will be zero and the product term will be 1, then  (3), the "sum" term gives the sum of all the phases of the phase shifters in the preceding blocks in the channel row m, and the "product" term is the product of all the (field) transmissions of the preceding blocks in the channel row m. Note that the last block in a given row, which will be block B(m, MI -m + 1), will always have a reflectivity of 1; this will follow from the unitarity or losslessness of the optics and the normalization of the input training vectors, and in the calculations, we will explicitly always set this to 1. Note that if any To complete the calculation method, we need to calculate each successive Dm  from its corresponding training vector Um  using the known values of the rmn and mn in the preceding channel rows. Generally, this mathematical transformation from the "input" into the "top" ports of the mth row to the "input" into the "top" ports of the (m + 1)th can be accomplished using an is understood physically as the (complex) factor relating the input field on the "top" of block Bmj to its contribution to the input field on the "top" of block   1, B m s  . Here, as we can see from the matrix, 1 j s   . For the cases where 1 j s   , the product term is understood to be 1 (there is no transmission term involved because the light reflects from one block into the next where it also reflects).
So, once we have calculated the settings in a given row m from its training vector Um  , we calculate the matrix C (m) . Then, from the next training vector , which we use to calculate all the ( 1) In this way, we can calculate all the settings in the mesh for a given set of orthogonal, normalized training vectors, and hence for any desired unitary or loss-less transform between the inputs and the channel outputs.

Three-way beamsplitter ("tritter")
A three-way beamsplitter or "tritter" splits each of its three inputs so that the input power in any one input is divided equally among its three output (see, e.g., Ref. [3] for a recent discussion). This is a good example to illustrate the method here; it is just complicated enough to be a non-trivial example, it has been demonstrated in real systems (e.g., [3]), and is of specific interest for quantum applications. We will work through this example in detail to clarify definitions, notations, and the overall process.
One unitary matrix D for such a device is 3 Other matrices could also describe such a device 4 . Columns or rows in this matrix could be interchanged, which would correspond to a re-ordering of the inputs or outputs, respectively; any column could be multiplied by an arbitrary unit complex number, which corresponds to changing the phase delay from a given input to the whole set of outputs, and similarly any row could be multiplied by a unit complex number, which would change the relative phase of a given output compared to the other outputs. Fig. S4. Example calculated settings to implement a tritter. Settings for intensity reflectivity R, as set in a MZI with  radians of phase lead in one arm compared to the other, and phase lead  radians in the phase shifter (using configurations as in Fig. 2 (a) or (b)) are given for each block for a configuration as in Fig. 1  The input mathematical vectors correspond to sets of field amplitudes in the input waveguides WI1 -WI3. Obviously, such a device takes the set of input vectors corresponding to illuminating the waveguides WI1 -WI3 one at a time, and turns them, one by one, into the output vectors of sets of field amplitudes in the channel waveguides WC1 -WC3 2 /3 4 /3 1 2 3 4 /3 8 /3 In our method, we train the FPLA to perform this function using input vectors that are the columns Um  of the matrix † T U = D (or, equivalently, that are the Hermitian adjoints of the rows of T D ). Specifically, for T D we train with the vector of input amplitudes 1 1 to give output vector (Note: though we show normalized training vectors here, the actual power in the training vector makes no difference to the ultimate settings of the devices.) With this incident beam, we run Algorithm S3, the SLCA algorithm along the first channel row of blocks so all the power emerges from channel waveguide WC1 to give the desired output vector (within some overall phase delay for propagating through the device). Similarly, we train with 3 /4 2 2 3 /2 to give output vec Here we can use the convention that a positive phase  (as in the factor   exp i ) corresponds to a phase lead and a negative one corresponds to a phase lag or delay. So, the training vector here has relative phase lags of 3 / 4  and 3 / 2  radians in the inputs to waveguides WI2 and WI3 respectively. Similarly, we train with Though our method does not require any calculations to set it up, we can use the calculation method above to calculate what the settings of the phase shifters and reflectivities would be. The results are shown in Fig. S4.

Linear optics quantum computing CNOT gate
The linear optics quantum computing CNOT gate has been investigated extensively following its original proposal [5]. The version proposed in Ref. [6] has been demonstrated in discrete [7] and integrated [7,8] optical versions, including a generalized interferometer mesh with fixed couplings of two different kinds [8]. Using the relations between the input and output photon annihilation operators for the waveguide modes (Eq. (2) of Ref.
[6]), we can directly write down the unitary device matrix for such a gate as where the field variables for the input column vector, in the notation of Ref. [6], are in the order, from top to bottom, cH, cV, tH, tV, vc, and vt, and similarly for the resulting output column vector, with variables in the order, from top to bottom, cHO, cVO, tHO, tVO, vcO, and vtO. For the physical mesh, the training vectors are the columns of the matrix † G G  U D .
Using this matrix and our algorithms here, we can automatically generate the design shown in Fig. S5 for a 6-input, 6-output mesh. The result is has both similarities to and differences from the layouts deduced in the original proposal [5]. (By "essentially identical" here we mean within phase shifts of various multiples of / 2  that can arise because of differences of definitions and choices of phases associated with beamsplitter operation.) First, the present mesh obviously contains more beamsplitter blocks than the original, quite economical and symmetrical design [5]; many of additional blocks are performing only simple "bar" or "cross" routing through this network, though (in some cases with  phase shifts on the "upper" output arm). Blocks that in either the "bar" or "cross" state are shown with dashed outlines The kind of mesh of Fig. S5 is capable of much more complicated operations with more fully populated matrices, which is why these additional blocks are present. Nonetheless, we can see that this automated design has separated out the 2x2 block between the vc and cH inputs and the vcO and cHO outputs and the 4x4 block connecting the other inputs and outputs. The arrangement for the 2x2 block is essentially identical to that of the original proposal [5]. The setup for handling the lower, 4x4 block is similar to the original proposal [5] but employs 5 rather than 4 beamsplitters. Note, though, that, unlike the original proposal, all the paths through this network pass through the same number of blocks, which equalizes loss (hence preserving orthogonality in the presence of finite but equal loss in the blocks) and the optical path length for all paths is nominally the same, which allows the device to function identically over some range of wavelengths. Furthermore, the present mesh approach can work just as well for any rearrangement of the inputs or outputs (equivalent to swapping rows or columns, respectively, in the matrix G D ), and is not restricted to the particular ordering of the original proposal [5].
We can "prefactor" the network into a 2x2 unit and a 4x4 unit, then our automated algorithm run on each of these separately, using the device matrix to construct the training vectors to shine into the top two waveguides, WI1 and WI2, and the matrix 44 1 1 1 0 to construct the training vectors for the bottom four waveguides, WI3 -WI6. This process produces results shown in Fig. S6. We omit showing blocks that would be performing no function, i.e., operating in the "bar" (zero-reflection) state blocks with no additional phase shift. These additional blocks are not trained in this process, and should be preset in the bar state (by, for example, running Algorithm S5 (BSSA) before training.) The results now are essentially identical with the original proposal [5].

Hadamard transform
The Hadamard transform is an example of a unitary linear transform with a broad range of applications in signal processing, encryption and data compression, and Hadamard transforms also have applications in quantum computing. The mth Hadamard transform Hm transforms an input vector of 2 m elements to an output vector of the same dimension. For example, the operator H3 would be described by a device matrix The training vectors are the complex conjugates of the rows of 3 H D ; since these rows are real then the rows themselves are the training vectors, which are known as the Walsh functions. Note this matrix is symmetric (and Hermitian). The resulting calculated design is shown in Fig. S7. In this case, all the blocks (except the bottom row) have non-trivial reflectivities. The concept of making Hadamard transforms using sets of MZIs and phase shifters has been known in principle for some time [9] Fourier transform The concept of performing discrete Fourier transforms using combinations of Mach-Zehnder interferometers and phase shifts has also been known for some time [9]. An example form of discrete Fourier transform for a dimensionality 8 N  is given by where the matrix elements for such an N-dimensional transform are given by the expression Using our FPLA architecture, the resulting calculated design would be as shown in Fig. S8.

Lens
With this FPLA approach, discrete Fourier transforms are not restricted to dimensionalities that are powers of 2 (as is the case for fast Fourier transforms, for example), and we can recenter them as desired. Another example of a Fourier transform is a device configured to emulate the action of a lens in forming a Fourier transform in its back focal plane of the field in its front focal plane. We specifically want a uniform input to result in a "spot" in the middle of the output. We can correspondingly center the transform by using the expression Choosing N to be odd allows an output in the middle of the set of output waveguides for a uniform input. For . This leads to the calculated design as in Fig.  S9.

Non-unitary linear operations
Many common and useful linear operations are non-unitary, and so in an FPLA architecture can be implemented using the configuration discussed in Ref. [2]. Such an approach formally decomposes the device matrix D by singular value decomposition into †  DD that we will calculate numerically when performing the singular value decomposition of D are themselves also ambiguous within unit complex factors and different, valid eigenvector algorithms might generate different resulting eigenvectors. In constructing the full singular value decomposition, we will have to add some process to make sure we have chosen these unit complex numbers consistently. . With this set of choices, the product † diag VD U will correctly reconstruct the original device matrix D. This is actually only a mathematical point because the absolute phase of the individual training vectors has no meaning physically. In the end with this approach for a physical device we are going to have to choose the phase relationship between an input training vector and its corresponding output vector manually, for example by setting the phase of the modulators in the middle, by some other process, if we care about the relative phases or signs of these different outputs.

Differentiation
One generally useful non-unitary problem is differentiation or, at least, its approximation by finite differences. We can work through a particular example here to show the method of tackling nonunitary problems with the FPLA architecture. We could imagine that the set of amplitudes at the input waveguides corresponds to a set of amplitudes of some function at a set of equally spaced points; then we could construct an approximation to the spatial first derivative of that function based on the differences in the function values at adjacent points (i.e., the amplitudes in adjacent waveguides). We can consider a system with 5 input waveguides, which will have a device matrix we could write as Note, first, that this matrix is not square; there are only four differences between five adjacent values, so here we obtain a 4 5  matrix (and we will correspondingly only have 4 output waveguides). Note also that the rows of this matrix are not orthogonal. Hence, no matter what we choose for the prefactor in front of this matrix (here 1/ 8 ), D D is not unitary (and it is already not technically unitary because it is not square). The choice of the prefactor here is to some extent arbitrary because it depends Fig. S8. Results of automatic design for the 8 N  discrete Fourier transform, using the same notation as in Fig.S6.
on what we regard as the "distance" between adjacent points in our derivative. The choice of 1/ 8 here conveniently makes the sum of the squares of the matrix elements equal to 1, which also ensures that none of the singular values is greater than 1 (and hence can be implemented by a modulator element without gain), but other values could be chosen instead.
In the usual fashion for singular value decomposition, the columns of the matrix U are the eigenvectors of