I/O-efficient iterative matrix inversion with photonic integrated circuits

Photonic integrated circuits have been extensively explored for optical processing with the aim of breaking the speed and energy efficiency bottlenecks of digital electronics. However, the input/output (IO) bottleneck remains one of the key barriers. Here we report a photonic iterative processor (PIP) for matrix-inversion-intensive applications. The direct reuse of inputted data in the optical domain unlocks the potential to break the IO bottleneck. We demonstrate notable IO advantages with a lossless PIP for real-valued matrix inversion and integral-differential equation solving, as well as a coherent PIP with optical loops integrated on-chip, enabling complex-valued computation and a net inversion time of 1.2 ns. Furthermore, we estimate at least an order of magnitude enhancement in IO efficiency of a PIP over photonic single-pass processors and the state-of-the-art electronic processors for reservoir training tasks and multiple-input and multiple-output (MIMO) precoding tasks, indicating the huge potential of PIP technology in practical applications.


Introduction
The input/output (IO) data movement in a processor typically has limited bandwidth and consumes considerable energy, leading to performance bottlenecks in processing systems that receive input data and store output data through an attached host with memory [1][2][3][4] .The IO issue becomes particularly severe when the problem size exceeds the capacity of the processor, a situation that is often encountered in practice.Thus, enhancing the computationto-IO (C-to-IO) ratio becomes a key focus, i.e. increasing the number of implemented operations for a given input/output data size.In computation intensive tasks such as matrix multiplications and inversions, where multiple operations are performed on each piece of input data, the C-to-IO ratio can be improved by reusing the input data.The tensor processing unit (TPU) exemplifies this approach in accelerating neural network inference tasks.The TPU utilizes a systolic architecture to compute matrix multiplications in parallel by reusing the input data, demonstrating significant speed and energy efficiency improvements over central processing units (CPUs) and graphics processing units (GPUs) 2 .Nevertheless, the challenge of keeping pace with the ever-growing demands of compute-intensive applications has urged people to keep seeking faster and more energy-efficient solutions.
Photonic processors have long been considered as a promising alternative in algebra computations, owing to its inherent high parallelism and low energy consumption [4][5][6][7] .Photonic processors have shone through in the field of matrix multiplications, notably been demonstrated in voice recognition 8 , image classification 9,10 , and optical communication 11 .However, few research has been done on using photonic processors to accelerate matrix inversion, a more computationally expensive operation that is fundamental to many scientific and engineering problems such as numerical computations [12][13][14][15][16] , statistics 12,17 , wireless communication systems [18][19][20][21][22] and neural network training 23,24 .In most practical applications, the matrices to be inverted are diagonally dominant or sparse, and approximate inversion results are often sufficient for subsequent processing stages.Iterative inversion methods generally outperform direct methods in these scenarios since one can easily balance speed and accuracy by terminating the iterative process after several iterations, while interrupting the inversion process of direct methods midway yields incorrect results 12 .Classical iterative matrix inversion algorithms essentially perform matrix multiplications and additions iteratively.Implementing such algorithms on a widely reported photonic single-pass processor (PSP) 8,9,25,26 necessitates repetitive inputting and outputting of computation results in each iteration, thus generating heavy IO traffic.
In this paper, we report a novel photonic iterative processor (PIP) based on reconfigurable photonic integrated circuits for speeding up matrix inversion tasks.The inclusion of optical loopback enables iterative computations for direct matrix inverting with an enhanced I/O efficiency.The processor core is a matrix-vector multiplier comprising MZI units, which reduces the computation overhead by directly encoding the matrix elements on MZI arrays.Table 1 compares the C-to-IO ratios for inverting an  ×  matrix using the iterative Richardson method 12 on four different platforms including the CPU, TPU, PSP, and our proposed PIP.The Richardson method is assumed to converge after  + 1 iterations.According to the table, the PIP exhibits the highest C-to-IO ratio of  3 /(2 2 + ) among the four computing platforms, followed by the PIP, the TPU, and the CPU.In most application scenarios,  is much larger than , indicating an improvement of at least  times in the C-to-IO ratio for PIPs compared to other platforms (See Supplementary 1 for more details).
We demonstrate, to the best of our knowledge, the first lossless reconfigurable PIP system that is capable of directly inverting matrices and solving integral and differential equations.Such a PIP system computes 4×4 realvalued matrix inversions with an accuracy of >97%, a C-to-IO ratio improvement of up to >7 times, a core energy efficiency of 4.6 MOPS/W (mega operations per second per watt), and a net inversion time of 2.6 µs, which is solely bounded by the length of fibre-based optical loops.The lossless PIP is then reconfigured to numerically solve real-valued integral and differential equations, reaching a mean absolute error of < 0.02, and up to >1.8 times C-to-IO ratio improvement.The first coherent PIP with on-chip optical loops is also demonstrated to break the loop-length limitation on the net inversion time.The coherent PIP is demonstrated to operate 2×2 complexvalued matrix inversions with an accuracy of > 98%, a core energy efficiency of 3.6 GOPS/W, and a net inversion time of 1.2 ns.Its enhancement in C-to-IO ratio reaches up to 2.8 times.Benefiting from the much-reduced IO demand, the proposed PIP is capable of reaching at least an order of magnitude IO efficiency enhancement compared with a single-pass optical processor and the state-of-the-art electronic processors, by emulating MIMO precoding tasks and reservoir training tasks.Our results indicate a promising way towards ever-powerful optical processors that could surpass IO limits.

Photonic iterative processor architecture
The proposed PIP is tailored for matrix inversion problems that cannot be easily solved by traditional singlepass photonic processors.Solving integral and differential equations can be reduced to basic matrix computations including addition, subtraction, multiplication and inversion, which can all be solved optically by iterative algorithms using the PIP.Matrix inversion is computed through the Richardson method: where  is an  ×  matrix operand to be inverted,   is the  ×  identity matrix,  is a parameter used to adjust the convergence of the inversion algorithm and the matrix operand is encoded in the weight bank via   -.
(+1) and  (𝑘𝑘) are output matrices after k+1 and k iterations, and  () =   is the initial input matrix that initiates the computation.The PIP generates matrix computation results one column at a time.Full-matrix inversion can be realised by using N PIPs or by a multiplexing technique based on a different architecture we proposed 27 .
Fig. 1 depicts the architecture of the proposed PIP based on Richardson method, with grey arrows indicating signal flows.As highlighted by red arrows, the input pulse (representing the j th column of the initial input matrix,   ) is generated by modulating the continuous wave (CW) laser in modulators.Upon entering the loop, part of the pulse is split and sent to detectors for outputs readout (one column of  (+1) ).The remaining part first passes an  ×  weight bank with each MZI encoding one element of the  ×  matrix,   - .Subsequently, the weighted pulses are summed by waveguide couplers to implement matrix-vector multiplication (MVM).The MVM results (one column of (  − ) (𝑘𝑘) ) are then amplified to compensate for any loop loss.Optical filters are attached to remove excess amplified spontaneous emission (ASE) noise.The "clean" pulses now complete operations in one iteration and are either dropped or retained for recursive computing depending on the configuration of the optical switches.When the switches are "On", the "clean" pulses are sent for summation with the input pulse (between one column of (  − ) (𝑘𝑘) and one column of   ).Then part of the summed signals is split out and detected for outputs.The remaining parts enter the next circulation for iterative computation.Such recursive operation is terminated either when the optical switches are set to "Off" or when the input pulse ends.The PIP is also capable of computing matrix addition and multiplication in a single iteration by proper configuration.Detailed descriptions can be found in the supplementary 5.2.

Figure 1 | Conceptual figure of the photonic iterative processor (PIP).
Architecture of the proposed iterative photonic processor.The PIP serves as a photonic accelerator for inverting matrices which is widely used in equation solving, communication systems, robotics trajectory control, etc. Matrix inversion is solved by the Richardson method, whose results can be obtained by multiple iterations of the light signal in the PIP.Matrix addition, subtraction, and multiplication results can also be computed by a single iteration of the light signal in the PIP.For the lossless PIP system, the 4× 4 reconfigurable weighting bank, adders and splitters are integrated on-chip.For the coherent PIP system, the 2×2 reconfigurable weighting bank, adders, splitters, optical switches, coherent detectors, and optical loopback are integrated on-chip.Monolithic integrations are possible and separately discussed in the section titled "Photonic integration techniques for a fully integrated PIP".

Real-valued matrix inversions
A 4×4 chip is taped out on the Silicon Nitride (SiN) platform, which is used as a processor core to form a lossless PIP system with off-the-shelf components as shown in Fig. 2a.The SiN chip integrates 4 adders, 4 splitters with a 4×4 sized MZI weighting bank, and their enlarged views are shown in Fig. 2b-d, respectively.Light is coupled in and out of the chip via edge couplers.Complete optical loopback paths are formulated by fibre components.The continuous-wave (CW) laser and optical modulator (Mod) correspond to the laser and modulator blocks in Fig. 1, generating an input vector   that is coupled into the 4×4 chip.One column of the inverse results is computed each time by launching an optical input pulse to one input port and full matrix inversion is realized by sweeping different input ports (I1 -I4).The input pulse is first split into four copies on chip and then gets imprinted by the set of weights.The weighted signals are subsequently summed by 4 adders to perform an MVM, which are then coupled out of the chip and amplified by Erbium-doped fibre amplifiers (EDFAs), followed by bandpass filters (BPFs) to supress ASE noises.The 1×2 splitters allow part of the optical signals to be collected by the oscilloscope (OSC) after outputs readout in photodetectors (PDs) and transimpedance amplifiers (TIAs), while the remaining part are sent back to the chip via the optical loops.Polarisation controllers (PCs) are used to align the polarisation state of optical inputs to the chip.Electrical control is used for loading matrix weights, setting the modulator and operating outputs readout.
We use the set-up shown in Fig. 2a to demonstrate two real-valued matrix inversion examples (detailed matrix values,   and   , can be found in Supplementary 4.1 and 4.2).As shown in Fig. 3a

Solving real-valued integral and differential equation with matrix inversions
Integral and differential equations offer a powerful tool for quantifying the dynamics of systems that change over time or space, making them widely used in scientific research and engineering.Numerical solutions instead of analytical solutions are usually taken for real-world problems, which is essentially the problem of matrix inversion (See Method and Supplementary 3 and 5).Using the lossless PIP system that operates iteratively to solve integral and differential equations provides a novel computing paradigm that significantly reduces the demand of data movement.We adopt the system to solve an integral equation (IE, Fredholm integral equation of the second kind, Eq. (3a)), a second order ordinary differential equation (2 nd order ODE, Eq. (3b)) with both using an 8-point discretization, and a partial differential equation (PDE, Poisson equation, Eq. (3c)) using a 4-point discretization.
An -point discretization corresponds to  ×  matrix inversions for IE and ODE, while it corresponds to  2 ×  2 matrix inversions for PDE.Block matrix computation techniques are employed to bridge the size of the problem and our chip, which could be readily mitigated by integrating a larger-scale processor on chip.Though using block matrix inversion techniques on a PIP increases the memory access count, the total memory access count and hence processing time is still much lower than that of traditional electronic processors and photonic single-pass processors (See Supplementary 6.4 for a detailed analysis).Figure 4a-c showcase the measured solutions based on the chosen discretization resolution, ideal solutions (from a conventional 64-bit digital computer) based on the chosen discretization resolution, and ideal solutions based on a finer discretization

Complex-valued matrix inversions
One of the major applications of complex-valued matrix inversion is in MIMO decoding and precoding [19][20][21] , which are fundamental to 5G/6G wireless communication systems.These algorithms require the inversion of complex-valued channel matrices, whose dimensions significantly increase owing to growing data demands from an increased number of end-users.Large scale complex-valued matrix inversion is computationally expensive and faces speed and power efficiency limitations in digital electronic computers.Unlike traditional methods that store a complex number as two separate real parts and process each part individually, photonic processors manipulate the amplitude and phase of optical signals simultaneously, enabling truly complex-valued computations with enhanced efficiency.Additionally, the PIP system can further improve the C-to-IO ratio and increase the inversion speed through its iterative optical loopback, significantly reducing IO access.
Here we show a coherent PIP system as shown in Fig. 5a for complex-valued matrix inversions.With optical loopback paths integrated on-chip, stable phase control can be achieved together with ultrafast processing time.Optical switches are integrated on-chip to facilitate device characterisation.Coherent outcomes are read out by off-chip balanced photodetectors (BPD) and captured in the OSC.EDFAs are used to compensate for coupling loss only and BPFs are used to remove excess ASE noise.We use the coherent PIP system to demonstrate three complex-valued matrix inversion examples (detailed matrix values,   ,   and   , can be found in Supplementary  phase changes of each element of the inverse matrices are shown in Fig. 5b, 5d, and 5f respectively, indicating a good agreement with the ideal process.Figure 5c, 5e, and 5g shows the evolution of inversion accuracy of three matrices during multiple iterations.The inversion accuracy reaches 98.8%, 98.3% and 98.9% for   ,   , and   , respectively.

IO advantages of the PIPs
We use saved processing time,   , saved energy consumption,   , improvement in total processing time,  2.  is the number of times a matrix is decomposed before computing on a PIP,  is the PIP size,   is the single-iteration's processing time, and  is the number of iterations for convergence. 0 and  0 are the processing time and energy consumption of a single IO access.Details about calculating each specs can be found in Supplementary 6.According to the table, >7 times and >2 times improvement in the C-to-IO ratio are achieved for 4×4 realvalued matrix inversions and 2 × 2 complex-valued matrix inversions respectively.For solving the integral equation and the ordinary differential equation, 8×8 matrix inversions are performed on the 4×4 lossless PIP through one decomposition, still resulting in ~1.8 times improvement in IO efficiency.For solving the partial differential equation, 16×16 matrix inversions are performed on the 4×4 lossless PIP through two decompositions yet resulting in ~1.2 times improvement in IO efficiency.These results well verify the IO advantages of the PIP for matrix inversions.

Processing time (latency) of an 𝑵𝑵 × 𝑵𝑵 PIP
The processing time of a PIP includes both the core processing time and the IO access time.The core processing time refers to the duration starting from when a signal is launched into the processor until the computation results are ready for acquisition, which is determined by the single-iteration's processing time,   , and the number of iterations to reach convergence, .The PIP system shown in Fig. 1 generates the matrix inversion results one column a time, corresponding to a core processing time of   =   •  • .Using wavelength multiplexing techniques can reduce the core processing time to  _ =   •  .According to Table 2, the best demonstrated net inversion time,   , are 2.6 µs and 1.2 ns for the 4×4 and 2×2 cases, respectively.The loop length of the lossless PIP is mainly limited by the long fibre length (~35m) inside the EDFA, which can be removed by using semiconductor optical amplifiers (SOA) or on-chip gain components.It is worth noting that for the coherent PIP device, additional delay lines (~3.9 cm) are integrated as part of the loopback paths to ease the electrical readout, which can be shortened for a faster computation speed.To explore the future scaling of the PIP, we estimate the loop length and the single iteration's processing time of a fully integrated PIP with sizes ranging from 2×2 to 256×256 as shown in Fig. 6a-b.Both the loop length and the single iteration's processing time scale approximately linearly with PIP size.
The IO access time is defined as , where  is memory access counts shown in Table 1,  is the bit resolution of the data, and  is the data transfer rate.  includes the weight matrix loading time, input data fetching time, and output data storing time, and contributes to a heavy burden for digital electronic processors and single-pass photonic processors, while it contributes much less to the total inversion time for the PIP.The total processing (inversion) time of a PIP and a PSP can be estimated according to  _ =  _ +  _ , and  _ =  _ +  _ , respectively.More details can be found in Supplementary 7.2-7.3.

Inversion accuracy of an 𝑵𝑵 × 𝑵𝑵 PIP
Inversion accuracy is defined in terms of matrix norm as: , where  is the inversion error,   −1 is the measured or simulated is the ideal or theoretical inversion results calculated on a 64-bit traditional digital electronic computer.Three main error sources when performing matrix inversions on a fully integrated PIP include: 1) quantization error, 2) ASE noise introduced during amplification, and 3) thermal and shot noise introduced during detection.The estimated inversion accuracy of a PIP with sizes ranging from 2×2 to 256×256 is shown in Fig. 6c, indicating a relatively good inversion accuracy for a processor size of up to 256×256.More details can be found in Supplementary 7.4.

Energy efficiency of an 𝑵𝑵 × 𝑵𝑵 PIP
Energy efficiency of the PIP core is defined as:   =   /   , which is the number of operations per second per energy consumption of the PIP core.The components consuming energy during the matrix inversion process include lasers, modulators, semiconductor optical amplifiers (SOAs), photodetectors (PDs), the MZI weight bank, and optical switches.The energy efficiency of the lossless PIP and the coherent PIP, together with the estimated energy efficiency of the PIP (with and without wavelength multiplexing techniques) and that of the state-of-the-art electronic and photonic processors are shown in Fig. 6d, indicating that the PIP structure is an energy-efficient architecture for matrix inversion tasks.More details can be found in Supplementary 7.5.

Scalability of the PIP
In the above analyse, the PIP size is limited to 256×256.While this is sufficient for most real applications, with no more than 5 decompositions (see Supplementary 6for more details) ensuring the IO advantage for problem sizes up to 8192×8192, larger-scale PIPs are still preferable for tackling even more complex problems with more significant IO advantages.
The main constraints for further scaling up the PIP are similar to those for any large-scale photonic integrated circuit (PIC), i.e., the chip insertion loss and the available reticle size of the lithography.As shown in Fig. 6c, for PIPs larger than 128×128, good accuracy can still be achieved but requires high input signal powers, indicating that the round-loop loss is not well sustainable for higher-radix crossbar array-based MVM cores.A singular value decomposition (SVD)-based MVM core can be used to reduce signal splitting and addition losses, however, the accumulated insertion loss of the increased building blocks remains as a challenge.Figure 6c also reveals the estimated inversion accuracy of a 256× 256 PIP with an SVD based MVM core, indicating good scalability.Nevertheless, the computation overhead for implementing SVD is non-negligible.Several approaches have been reported to effectively reduce the PIC insertion loss, including the use of a microelectromechanical-system (MEMS) MZI structure 30 and high-index doped silica glass waveguide 31 .Wafer-scale photonic integrated circuits are also demonstrated via inter-reticle waveguides 30 , paving the way for very large-scale PIP systems.

Envisioned IO advantages of the PIP for matrix-inversion-intensive applications
We further evaluate the IO advantages of our PIP in matrix-inversion-intensive problems, i.e., the MIMO precoding task and the reservoir training task which is essentially a ridge regression task.The MIMO precoding task solves  � = (   + ) −   , while the reservoir training task solves   =      (     + ) −1 .The matrices to be inverted in these applications are typically diagonally dominant, and their feasibility of being inverted on a PIP is verified in the experiments.To predict the advantages of using a PIP for matrix inversions in these two applications, we offload the matrix inversion task to a hypothetical PIP based on the previous analyses of the PIP's processing time, accuracy, and energy consumption, and implement the main model on electronic processors.The estimated IO advantages of these two tasks are summarised in Table 3.More details about these two tasks can be found in Supplementary 8.For a normal MIMO system or the current massive MIMO system, the size of the channel matrix is typically 4×4 and up to 8×8.For the future massive MIMO system, the channel matrix will be scaled to 128×128 32,33 or even larger.At least an order of magnitude IO improvement of the PIP is predicted for PIP sizes larger than 8×8.In practice, the MIMO channel matrices are time-varying and continuous matrix inversions are required.The updating period of a MIMO channel ranges from several minutes in a quasi-static environment to several microseconds in high-speed scenarios.This means ~10 3 to ~10 10 matrices need to be inverted per day, resulting in substantial amount of saved processing time and power consumption by using the PIP.For the reservoir training of the deep learning datasets, 8.5 times improvement in IO efficiency is predicted for a 10×10 PIP.Similarly, the total saved processing time and energy consumption becomes significant since a total of 60000 matrices needs to be inverted for the training task.The IO advantages of the PIP are more significant for datasets with more features.

Parameters
For both examples, wavelength multiplexing techniques can be used to enhance the IO advantages, as indicated by the numbers in the brackets in Table 3, leading to up to two orders of magnitude improvement in IO efficiency for MIMO precoding tasks and at least an order of magnitude improvement for the MNIST training task.

Photonic integration techniques for a fully integrated PIP
According to Table 2, even if a PIP is not fully integrated on a single chip and with a limited scale, we can still gain notable benefits in speed and power consumption thanks to the reduced IO access counts.Moreover, as shown in Table 3, a fully integrated PIP as shown in Fig. 1 or a large-scale PIP makes the IO advantage more significant.Having optical gain in the loop supports a significantly increased number of iterations, making the PIP a more appealing technology in breaking the IO bottlenecks in practical matrix-inversion-intensive tasks.However, the integration of a light source or an amplifier on silicon chip still remains an active research area today, as silicon is an indirect bandgap material that precludes the possibility of a silicon laser.A number of hybrid and heterogeneous integration methods have been investigated to combine the power of III-V with the full capability of SOI.These include flip-chip bonding 34 , die/wafer bonding 35,36 , micro-transfer printing 37 , direct epitaxial growth 38,39 , optical wire bonding 40,41 , and the membrane technology 42,43 .There are an increasing number of inhouse demonstrations in the full integration of active and passive components, and when they are released to the public in the future, a fully integrated PIP can be realised immediately.

Conclusion
In this paper, we propose a PIP with reconfigurable photonic circuits that is capable of handling signals in the optical domain recursively, achieving much higher computation-to-input/output ratio than traditional single-pass optical processors and digital electronic processors.This heralds a new photonic computing paradigm that significantly reduces the data shuttling cost.We showcase the first lossless PIP and the first coherent PIP with on-chip optical loops to demonstrate its power.High-fidelity optical computations including real-valued matrix inversions, real-valued integral and differential equation solving, and complex-valued matrix inversions are performed.By emulating the MIMO precoding tasks and the reservoir training tasks, the proposed PIP is shown to be capable of reaching at least an order of magnitude IO efficiency enhancement compared with a single-pass optical processor, benefiting from the much-reduced IO demand.Our work paves the way towards the next generation photonic processor that significantly enhances IO efficiency and processing speed.

Methods
Chip fabrication.The photonic chip with a footprint of 2.8×6.6 mm 2 is fabricated on a SiN platform provided by CORNERSTONE multi-project wafer run using the standard deep ultraviolet lithography with a feature size of 250 nm.The platform comprises a 3 µm buried oxide layer, and a 2 µm silicon dioxide top cladding, and a 300 nm thick LPCVD SiN layer, which provides propagation loss of <1 dB/cm.Basic building blocks including the strip waveguide, the 1×2 MMI coupler, the 90° bend, the waveguide crossing and the edge coupler are customized using a commercial Lumerical FDTD simulator.The waveguide crossing has an extinction ratio of >30 dB.The edge coupler is based on a reverse taper structure, with a mode diameter of around 3.5 µm and a coupling loss of ~2.5 dB per facet.
Chip characterisation (lossless PIP).16 MZI weights are characterised independently by launching a laser into the chip and measuring the output light intensity while sweeping the applied voltage to the heaters on the MZI arms.16 transmission-voltage (T-V) or transmission-power (T-P) curves are recorded and fitted to form lookup tables for loading matrix weights.The effective weight of each matrix element is a combination of the attenuation in the weighting bank, the loss in the loop including the MZI unit, and the gain in that loop.This is achieved by creating an optical loop between input and output waveguides for a single MZI.The effective weight is then determined by launching an optical pulse into the loop and measuring the ratio of subsequent output pulses.Note that there is an EDFA within the loop whose purpose is to compensate for optical losses in the loop.Finally the attenuation in the MZI unit is adjusted to provide the required attenuation for each matrix element.

Chip characterisation (coherent PIP).
In order to characterise the 4 MZI weights, two loop switches, one input switch, and two coherent detectors need to be characterised first.The "On/Off (Bar/Cross)" states of two loop switches are characterised by measuring the transmission of a light signal through two edge-coupled ports of the switch while sweeping the applied voltage to the heaters on the MZI arms.Configuring the loop switches to "On/Off (Bar/Cross)" state corresponds to maintaining/terminating the iterative process.The "Bar/Cross" state of the input switch, which corresponds to injecting unit vectors to two input ports, is characterised with both loop switches configured to "Off" states.Then the two coherent detectors, each comprising an MZI, are characterized respectively by setting the input switch to either "Bar" or "Cross" state and measuring the photo-detected electrical signals with both loop switches in "Off" states.The effective attenuation and phase shift of each MZI weight unit is the total attenuation and phase shift within the loop respectively, which are determined by launching optical pulses into two input ports respectively and measuring the ratio of the first two pulses.
Experimental setup.The light source (Thorlabs TLX1 tunable laser) is set to a 1550 nm wavelength and 5 dBm output power.Manual polarisation controllers are used to align light polarisations to the chip.The optical switch is Thorlabs LNA6213 intensity modulator.Amplifiers are EDFAs from Connect Laser which can provide >35 dB gain.BPFs are filters from WL Photonics with 0.1 nm 3 dB bandwidth.Outputs are recorded in a 4-channel Keysight DSO-S 404A oscilloscope.The system requires electrical control.A customized Matlab program is used to characterize the chip.The switch is controlled by a Tektronix AFG3102C function generator.Outputs from the oscilloscope are sent back to an electronic computer for analysis.
Numerical methods.Discretization is needed for numerically solving equations (see Supplementary 2).The rectangular integration technique is used for solving integral equations by approximating the integral by summing a series of rectangular partitions under the curve.Fredholm integral equations of the second kind can be written The finite difference method is used to solve differential equations by using finite difference formulas at evenly spaced grid points to approximate the differential equations.There are three types of difference formulas, which are central, forward and backward differences.Here we use central difference to approximate the equations.The first and second order derivatives of ODEs (1D system) can be written as where  is the index of the desired grid point, −1 and +1 are the indices of the neighbouring points, and ℎ is the grid size.A fixed grid size is used for simplicity.For PDEs (2D system), the second order partial derivatives with respect to variable  and the gradient relationship can be written as The discretized equations ( 6)-( 10) are then mapped into coefficient matrices which describes the relation between a point and other points in the grid.Solving differential equations are then converted to solving matrix inversions.
and Fig. 3d,   and   are loaded into the weighting bank as   =   −   and   =   −   respectively (See Methods and Supplementary for matrix weights characterisation).By injecting different unit vectors   () =   ( = 1,2,3,4), different columns of the inversed matrices are obtained.Fig. 3b and Fig. 3e indicate a very good agreement between the ideal inverse results and the measured inverse results for the two inversion examples, respectively.The evolutions of inversion accuracy of   and   during convergence are traced and exhibited in Fig.3cand Fig.3f, reaching an inversion accuracy of 97.5% and 97.2% respectively.

Figure 2 |Lossless
Figure 2 |Lossless PIP system with a SiN chip core.(a) Experimental set-up of the lossless PIP system.(b) Enlarged view of two adders comprising two stages of cascaded 2×1 multimode interferometers (MMIs).(c) Enlarged view of a splitter consisting of two stages of cascaded 1×2 MMIs.(d) Enlarged view of a weight unit comprising of a 1×1 thermo-optic (TO) MZI.

Figure 4 |
Figure 4 | Solving real-valued integral and differential equations.(a) Solutions to a Fredholm integral equation of the second kind.(b) Solutions to the 2 nd order ordinary differential equation.(c) Solutions to the partial differential equation (Poisson equation).Mean absolute errors (MAEs) are indicated at the top of each sub figure.

4. 3 Figure 5 |
Figure 5 | Demonstrated coherent PIP system and complex-valued matrix inversions.(a) Experimental setup of the coherent PIP system.(b) Ideal and measured inversion process of two diagonal elements of   .(c) Evolution of inversion accuracy of   during convergence.(d) Ideal and measured inversion process of   .(e) Evolution of inversion accuracy of   during convergence.(f) Ideal and measured inversion process of   .(g) Evolution of inversion accuracy of   during convergence.
in C-to-IO ratio, __  __  to quantify the demonstrated IO advantages of the lossless PIP and the coherent PIP.The results are shown in Table

Figure 6 |
Figure 6 | Performance analyses of the PIP.(a) Estimated loop length of the PIP with sizes ranging from 2×2 to 256×256.Length of each component is based on the reported best values.(b) Estimated single-iteration's processing time with sizes ranging from 2×2 to 256×256.(c) Estimated inversion accuracies of the PIP with sizes ranging from 2×2 to 256×256.Note the 256×256 PIP with an SVD core only consider the 6 dB splitting and adding loss.(d) Estimated energy efficiency of the PIP core with sizes ranging from 2×2 to 256×256.