Accelerating recurrent Ising machines in photonic integrated circuits

Conventional computing architectures have no known efﬁcient algorithms for combinatorial optimization tasks such as the Ising problem


INTRODUCTION
Combinatorial optimization is critical for a broad array of tasks, including artificial intelligence, bioinformatics, cryptography, scheduling, trajectory planning, and circuit design [1][2][3]. However, combinatorial problems typically fall into the nondeterministic-polynomial hard (NP-hard) problem class, becoming computationally intractable at large scale for traditional algorithms. This challenge motivates the search for alternatives to conventional (von Neumann) computing architectures that can efficiently solve such problems. The Ising problem, which consists of finding the ground state spin configuration of a quadratic Hamiltonian defined by a symmetric matrix K and spins of unit amplitude σ 1≤i≤N ∈ {−1, 1}, has garnered significant attention, as many other combinatorial problems can be polynomially reduced to an Ising problem [4,5]. Therefore, any technique for finding the ground state of arbitrary Ising problems, which is an NP-hard computational task, may extend to a wide range of other computationally intensive optimization problems. There is currently no known efficient classical algorithm to find the exact ground state of an arbitrary Ising graph, so heuristic and meta-heuristic algorithms are often implemented as a means of quickly obtaining approximate solutions [6]. Various physical systems have been proposed as Ising machines, as the evolution of many natural systems (ferromagnets [7], lipid membranes [8], random photonic networks [9], etc.) can be described by Hamiltonians similar to the one in Eq. (1). Parallel machines provide additional advantages by reducing the correlation between consecutive samples and preventing premature trapping in local minima by applying many local modifications simultaneously [10] and can also efficiently explore the phase space with many independent searches running in parallel [1,11,12].
This observation motivates the development of (re)configurable parallel analog machines, such as programmable nanophotonic processors [13][14][15][16][17]. The computational speedup with these machines is still polynomial (since analog machines also suffer from the P = N P paradigm [18]). However, when operating at a fast clock rate (GHz), afforded by a photonic implementation, these optical architectures could enable orders-of-magnitude speedups against conventional solvers [17].
A number of photonic Ising machines have been developed in the past few years, based on various physical systems including degenerate optical parametric oscillators (OPOs) [19,20], fiber networks [21,22], injection-locked laser networks [23,24], and spatial multiplexing [25]. Some of these (classical) Ising machines even demonstrated performance superior to that of certain quantum annealers on dense graphs [26]. Experimental OPO demonstrations [20,[27][28][29] and recent theoretical proposals [30] have shown great potential for finding optimal cuts of MAX-CUT problems with large numbers of spins. However, all-optical OPO machines face scaling limitations due to dispersion and decoherence of the time-division-multiplexed pulses in increasingly large resonators [28]. Additionally, some hybrid systems implement couplings via matrix multiplication in a field-programmable gate array (FPGA), and thus face scaling issues similar to those found in traditional computing for large-scale multiplication [20]. While massive spatial free-space multiplexing approaches [25] achieve a very large number of spins (N ∼ 10 4 ), the spin couplings are limited to N degrees of freedom [instead of O(N 2 ) for arbitrary Ising problems], and their clock rate is inherently slow (∼kHz). These Ising machines also differ in the way they compute: while the vast majority of them belongs to the class of heuristic solvers, the OPO-based approach has been shown to operate as a negative-temperature simulated annealing machine [31], and the approach presented in Refs. [17,25] and in the present work belong to the class of (positive temperature) Markov chain Monte Carlo algorithms [1].
Here, we experimentally demonstrate a photonic recurrent Ising sampler for probabilistically finding the ground state of an arbitrary Ising problem. Using a programmable silicon-oninsulator nanophotonic processor [14,15,17,[32][33][34], we leverage schemes for decomposing any unitary matrix into a mesh of linear optical components [35,36] to enable sampling of arbitrary Ising graphs. Furthermore, recent work on parallel photonic circuits for optical neural networks has demonstrated the capability for O(N) time-scaling for vector-matrix multiplication with an array of tunable Mach-Zehnder interferometers (MZIs) [14]. While digital electronics such as application-specific integrated circuits (ASICs) or FPGAs can also exploit parallelization to achieve the same time scaling, the architecture we demonstrate is additionally compatible with quasi-passive photonic processing. As a result, the time per iteration can be reduced from 5-100 ns in parallelized digital hardware to 0.1-1 ns in passive integrated photonics [17].

A. Theory
The proof-of-principle integrated nanophotonic recurrent Ising sampler (INPRIS) is based on a recently proposed parallel photonic recurrent network designed to find the ground state energies of Ising problems [17]. The conceptual structure of a single algorithm step of the N-spin photonic recurrent network is shown in Fig. 1(b). At time step t, the spin state vector is encoded in the amplitudes of N coherent optical signals at the input. During each algorithm step, the optically encoded spin state vector propagates through an array of optical components that encodes an arbitrarily reconfigurable N × N optical matrix [14,15,35], C, that is fixed and dependent on the problem-specific Ising coupling matrix, K. The output of the matrix multiplication is noisy, being perturbed by a Gaussian noise source with standard deviation φ. This Gaussian perturbation to the signal could be realized with electrical or optical modulation of the refractive index, or via quantum detection noise [27]. The output of the noisy matrix multiplication is then fed to an optoelectronic threshold operation, f θ th , where θ th is a vector of threshold values, that converts the continuous output vector back into a binary spin state, S (t+1) ∈ {0, 1} N . In combination, the sampler calculates the input spins for the subsequent algorithm step using the following expression: where N (µ, φ) is the normal distribution with mean, µ, and standard deviation, φ. After many algorithm steps, the output spin state probability distribution converges to an Ising Gibbs distribution (independent of the initial spin state) [17]: where C = 2J is the matrix implemented by the optical circuit, and β = 1/(kφ) (k being a constant that depends only on the noise distribution present in the system). The effective Hamiltonian of the system H L can be approximated by an Ising Hamiltonian with weights J 2 for small β (corresponding to large noise, φ) and symmetric J. Therefore, to probe a symmetric Ising coupling matrix, K = UDU † , with UDU † being the eigenvalue decomposition of K, we program the optical matrix to implement the square root of K. The square root motivates us to express the optically applied diagonal matrix in the form D α = Re( The term α is a diagonal offset matrix with > 0, and α is a scalar dropout parameter that allows us to tune the dimensionality of the ground state search by selectively dropping out lower negative eigenvalues (the choice of parameters is described in Ref. [17]). In the following, we will refer to α = 1 as no dropout (none of the eigenvalues is dropped out, as D + α is positive semidefinite) and α = 0 as dropout (the negative eigenvalues of D + α are set to zero by taking the real part). In our algorithm, the eigenvalue dropout parameter is a fixed transformation on the coupling matrix, unlike dropout encountered in machine learning, which is a dynamic technique to prevent overfitting when training neural networks [37].
To illustrate an example ground state search of the parallel photonic network architecture, we simulate a two-dimensional K, where K is the coupling matrix of the desired Ising graph. The output signal is noisy, with a distribution that is Gaussian with standard deviation φ, and goes through an analog nonlinear unit before being fed back to the chip input. Considering as an example a nine-spin 2D antiferromagnetic graph, with coupling and ground state shown in (c), the simulated energy evolution as a function of time is shown in (d). (e) Simulated energy distribution of the optical output, which converges in probability to the Gibbs distribution of the associated Ising problem [Eq. (3)], for which the ground state (c) is exponentially more likely than higher-energy states at low temperatures.
antiferromagnet with periodic boundary conditions and identical nearest-neighbor couplings, K ij = −1, shown in Fig. 1(c). After several algorithm steps, the optical output converges to a probability distribution that matches the Gibbs distribution with energy described in Eq. (3). Thus, the ground state with energy H min [every spin having opposite direction to its nearest neighbors, shown in Fig. 1(c)] is most likely, which can be readily observed in the simulated optical output [ Fig. 1(d)]. The expected Gibbs-like energy distribution is also readily observable on the simulated energy output distribution [ Fig. 1(e)].

B. Experimental Demonstration
The experimental INPRIS demonstration presented in this work finds the ground state of various four-spin Ising problems with arbitrary graph couplings, using noise as a resource to enable convergence to the ground state. The optical matrix multiplication update step, C S (t) [from Eq. (2)], is broken down into three stages, motivated by the eigenvalue decomposition of the Ising weight matrix K. The two unitary matrix multiplication steps, U † and U, are implemented in a programmable nanophotonic processor (PNP) [15,16] comprising a unitary mesh of tunable MZIs [ Fig. 2(b)]. The diagonal matrix D α is currently performed on an external CPU but could also be implemented optically with an electro-optic attenuator, a modulator, or a single MZI [14,[38][39][40][41]. In future implementations, the optical diagonal matrix should also be tunable, as D α is a function of the eigenvalue dropout hyperparameter, α. For this demonstration, we consider only two values α = {0, 1}, but a more in-depth discussion of a proposed future implementation is presented in Supplement 1, Section 3.
Our experimental setup (Fig. 2) consists of a 1550 nm laser feeding one port of the PNP, then routed [red MZIs in Fig. 2(b)] to a 5 × 5 unitary processor comprising 13 individually thermally tunable MZIs [42] [green MZIs in Fig. 2(b)], controlled by a microcontroller. The output intensities of each spatial waveguide in the PNP are measured by an InGaAs photodiode array and subjected to the set of nonlinear transformations outlined in Eq. (2) and Fig. 2(c) (phase-intensity reconstruction, addition of extrinsic Gaussian noise, and nonlinear threshold unit), currently implemented in an external CPU. The output of these electronic transformations then determines the PNP phase setting at the next algorithm step, and so forth.
To maximize the number of spins that can be handled by a single PNP, we broke down the algorithm step shown in Fig. 1(b) into multiple runs on the PNP, such that each run required only one matrix to be encoded on the chip [shown in Fig. 2(c)]. During each run, the PNP implements a unitary matrix of the form h i VR[S (τ ) ] [i ∈ 1, 2, Fig. 2(c)], where R[S (τ ) ] rotates the PNP input (1, 0, 0, 0, 0) T into vector S (τ ) , τ is the integer (resp. halfinteger) number of algorithm steps, and S (τ ) is the current spin state (resp. an intermediate result after multiplication by DU † ). V corresponds to one of the two unitary matrices involved in the eigenvalue decomposition of K (V ∈ {U, U † }), and h 1 (resp. h 2 ) is a homodyne matrix interfering PNP outputs 1 with 2 and 3 with 4 (resp. 2 with 3 and 4 with 5). We use the PNP output 1 as an idler signal of known amplitude and phase to reconstruct the  [15,16] and encodes a circuit consisting of input routing (red) and a U(5) unitary matrix (green). (c) Each algorithm step, shown conceptually in Fig. 1(b), requires four passes through the PNP chip. Each use of the chip performs a unitary matrix product of a state-preparation rotation matrix R, the desired unitary (U or U † ), and one of two homodyne detection matrices (h 1 or h 2 ). Phase-intensity reconstruction, a diagonal matrix multiplication, Gaussian noise addition, and a nonlinear threshold unit are applied in the electronic domain.
phase and amplitude of every other output in a cascaded manner (more details on the experimental setup is given in Supplement 1, Section 1).
Consequently, a single algorithm step of the INPRIS, shown in Fig. 1(b), is performed with four runs on the PNP in our proofof-concept experiment (two unitary matrices, each requiring two homodyne measurements to reconstruct the amplitude and phase of the output vector). The required reconfiguration of the PNP phases for each of these runs, due to the state-dependent R[S (τ ) ] and homodyne matrices h i , introduces an intrinsic noise on the measured outputs of the photonic system with standard deviation φ int , which depends on the single-shot fidelity of the encoded PNP transformations (the measured single-shot fidelity of the PNP was 91.6%-Supplement 1, Section 2). We emphasize that, unlike our experimental demonstration of optical neural networks [14], a deterministic PNP crosstalk correction step is performed just once before each PNP run, and that additional phase shifter optimization using a detection feedback loop is not required to optimize unitary fidelity. This high single-shot fidelity allows us to achieve low enough noise levels to probe Gibbs distribution in their ordered phase, while circumventing the increased complexity required to perform fidelity optimization at each iteration. This single-shot scheme mimics a passive or low-bandwidth photonic system, which motivates our treatment of non-idealities in the detected PNP outputs due to MZI phase settings as a separate intrinsic noise at every algorithm step. We also perturb the magnitude of each detected PNP output at the end of the algorithm step with an extrinsic zero-mean Gaussian noise with standard deviation φ ext via the CPU [see Gaussian noise unit in Fig. 2(c)]. Assuming the intrinsic contribution to the noise is also Gaussian, these two independent sources of noise add to yield a total noise level φ = φ 2 int + φ 2 ext .

Figures 3(b)-3(f ) show our main experimental results
: the noisedependent ground state population for various four-spin Ising models. Each data point is averaged over 10 solver instances with randomized initial spin states and 100 algorithm steps per instance. By comparing simulation results (green curves correspond to ideal phase and amplitude reconstruction, while blue incorporate our experimental homodyne scheme) with the larger number of iterations in Figs. 3(b)-3(f ), we observe that 100 algorithm steps are enough to converge to the ground state with high probability. In addition, we observe that our samples within the first 100 iterations show increased ground state probability over a random search [righthand plots in Figs. 3(b)-3(f )], with probabilities that approach a Gibbs distribution. While we do observe that the ground state is the most probable state for many of these graphs after 100 algorithm steps, there is a graph-dependent burn-in time required for the sampler to converge to steady-state dynamics [17]. The samples generated during the burn-in time are included in the histograms in Figs. 3(b)-3(f ) and are responsible for some decrease in ground state probability. Comparisons between the measured energy state probabilities and a Gibbs distribution for all sampled graphs and dropout levels are presented in Supplement 1, Section 5.
Since we can adjust the total noise in the system by tuning the extrinsic noise, φ ext , we can characterize the intrinsic noise level, φ int , by measuring the noise-dependent evolution of physical observables. Conventional physical observables describing spin glasses (energy, magnetization, heat capacity, susceptibility, etc.) can be calculated by using the thresholded PNP outputs as samples of the Gibbs distribution [Eq. (3)]. Typically, these samples would be simulated using a heuristic algorithm [1,3,43], for instance, with conventional simulations run on a computer. The measured absolute magnetization of a four-spin ferromagnet with dropout, measured by our PNP proof of concept, is shown in Fig. 3(a), which fits its theoretical value with a resulting intrinsic noise φ int = 0.6457. The simulated values (green and blue plots) in Figs. 3(b)-3(f ) are calculated using the total noise level with the experimentally fitted φ int . This noise level naturally present in the photonic system can be leveraged to drive the ground state search, as can be seen in Figs. 3(b)-3(f ), which show the ground state population for a variety of graphs (couplings shown in inset) as a function of the extrinsic noise standard deviation φ ext applied in the CPU. We notice that large ground state probabilities are attained for φ ext = 0, meaning that the regime of optimal total noise level is attained with little to no further addition of extrinsic noise in the system. This observation suggests that the intrinsic noise present in the hardware is already very close to the optimal total noise level. Decreasing the intrinsic dynamic noise within the system would enable sampling in regimes with lower total noise, where an optimum for the added extrinsic noise can be found. This decrease in precision can be addressed by using current sources to modulate photonic active components instead of voltage drivers. In addition, dedicating separate photonic circuits for passive and active tasks will also reduce overall system infidelity. While noise is usually considered a nuisance in most physical (computing) systems, our experimental demonstration relies on physical noise, coming from finite single-shot fidelity of the PNP. This sheds new light on noise as a computational resource in physical computing systems. When injecting extrinsic (CPU-applied) noise to the outputs at each algorithm step, we observe that the ground state probability remains large and agrees with simulations of the PNP that account for homodyne detection and phase-intensity reconstruction. The ideal performance of the PNP is shown in green for various numbers of algorithm steps per sampler instance, thus demonstrating the effect of our cascaded homodyne detection scheme on the ground state convergence probability of the INPRIS. Overall, the ground state probability is ≥80% for most graphs at some dropout level (α = 0 or α = 1) and low extrinsic noise levels, largely outperforming random search algorithms. We also observe that experiments with dropout usually outperform those without dropout, as expected from theory [17]. Extended experimental results on ground state probabilities for all generated graphs and dropout levels are available in Supplement 1, Section 5.

B. Phase Space Exploration
We further investigated the impact of the homodyne reconstruction by probing the noise-dependent phase space of the two-dimensional Ising ferromagnetic problem, shown in Figure 4.
The phase space population clearly shows a skew towards one of the ground states σ = (−1, −1, −1, −1) (results are averaged over 10 random initial states). We confirm this skew originates from the cascaded homodyne detection by comparing the phase space population to simulations with (blue) and without (green) homodyne detections for the ferromagnetic graph (see Supplement 1, Section 6). Since this skew breaks the two-fold degeneracy of the two-dimensional Ising problem, we presume that part of the phase information was lost through the reconstruction. To bypass this skew, we suggest scalable homodyne detection schemes that require more than one idler channel, in Supplement 1, Section 1. We observe that the skew does not seem to prevent the algorithm from converging to ground states but may complexify the evaluation of some physical observables. Interestingly, frozen transient stateswhich can be eliminated by reducing the pump amplitude-have also been observed in the OPO Ising machine [44]. As additional extrinsic noise is injected into the system, we observe that the phase space recovers its symmetry, thus providing an estimate for the amplitude of the skew in our photonic system.
Fabrication imperfections in photonic networks could be a significant bottleneck in scaling integrated nanophotonic coherent Ising samplers to large (N > 100) graph orders. For instance, a static skew on the phase setting or the split ratio of beam splitters will result in a static error on the effective coupling between spins, thus reshaping the Hamiltonian landscape, which could impact the algorithm efficiency [17]. Thus, the reduction of imperfections is of paramount importance in the realization of the INPRIS on large-scale static photonic networks. However, several calibration techniques have been developed to achieve precise linear optical functions with broad process tolerances [45,46]. The bandwidth and energy cost of thermal phase shifters is another bottleneck in increasing the size of photonic networks. A significant advantage of the INPRIS over learning-based methods that require reconfiguration of the PNP phases at every algorithm step [17] is that the phase reconfiguration bandwidth is irrelevant in characterizing the algorithm efficiency, since it translates only in a constant computational cost. Thus, a larger-scale INPRIS could be enabled by different architectures that do not suffer from this bottleneck and could operate at very low power levels. Examples include 3D-printed free-space optical networks [47] and silicon photonic networks combined with non-volatile phase-change materials [48].

DISCUSSION
To conclude, we introduce the INPRIS, a photonic circuit able to probe the Gibbs distribution of arbitrary Ising problems. We demonstrate a proof-of-concept experiment on a PNP that exhibits high success probability on a variety of four-spin graphs. We show that noise coming from the PNP-limited single-shot fidelity allows the system to operate close to its optimal noise regime, required to find ground states of Ising problems. Conversely, the influence of sources of noise of various origins (extrinsic versus intrinsic) and natures (static skew versus dynamic) is readily seen on physical observables such as magnetization, ground state population, and phase space distribution. Additionally, we observe agreement between the measured ground state probabilities and those predicted in simulation, suggesting the possibility of using simulated scaling predictions for this architecture, such as those presented in Supplement 1, Fig. 10 in [17], to motivate performance predictions of future large-scale systems with passive photonic components. We include a treatment of timing for a large-scale system in Supplement 1, Section 4.A. In fully passive networks, other sources of noise should be leveraged. For instance, recent work on large-scale photoelectric networks naturally demonstrates Gaussian noise on their outputs, arising at the quantum limit (and thus suggesting the operation of a very large-scale INPRIS that reaches the thermodynamic limit, relevant for simulations in statistical mechanics with N ∼ 10 6 , that can reach attojoule energy consumption per algorithm unit step) [27]. Furthermore, the photonic circuit program for this architecture is fixed for any given coupling configuration. This feature enables the potential for quasi-passive photonic ASICs, such as non-volatile photonic ASICs [48][49][50], that could deliver speed and energy savings over other physical Ising machines with active components.
Following an analysis similar to the one presented in Ref. [14], we can compare the energy consumption for a future INPRIS with that of graphics processing unit (GPU) hardware. Each iteration of the algorithm requires a matrix multiplication consisting of 2N 2 floating-point operations (FLOPs). For a proposed passive photonic system operating at a realistic clock rate of 1 GHz, one would need to perform 2N 2 * 10 9 FLOP/s. As detailed in Supplement 1, Section 4.B: we estimate the power consumption of the photonic system to be (18N) mW. These values correspond to an energy consumption per FLOP of ∼(9/N) pJ/FLOP. At the time of writing, a state-of-the-art NVIDIA V100 GPU operating at a comparable precision of 16-bit consumes 2.2 pJ/FLOP [51], suggesting an energy scaling advantage for a passive photonic architecture as the graph size increases.
This work also paves the way to larger-scale INPRISs by identifying key tradeoffs in their design. While homodyne detection allows the reconstruction of the spin state phase and amplitude, an increased number of idler signals will increase the PNP footprint for a given number of spins and determine the overall accuracy of this operation. Other tradeoffs, such as optimal values of eigenvalue dropout [17], should be taken into account when designing quasi-passive photonic GHz INPRISs on ASICs which could outperform current optical and electronic Ising machines by several orders of magnitude [20,26,28].