Performance evaluation of coherent Ising machines against classical neural networks

The coherent Ising machine is expected to find a near-optimal solution in various combinatorial optimization problems, which has been experimentally confirmed with optical parametric oscillators (OPOs) and a field programmable gate array (FPGA) circuit. The similar mathematical models were proposed three decades ago by J. J. Hopfield, et al. in the context of classical neural networks. In this article, we compare the computational performance of both models.


Introduction
In recent trends in semiconductor technologies, the Moore's law is slowing down mainly due to the limitation of micro-fabrication heat dissipation and communication bottleneck problems on a chip [1,2]. Many efforts to boost the processor performance have been made for parallelized architectures including GPU, other multi/many-core processors, and neuromorphic hardwares [3]. An optics-based special purpose computer, which is named the coherent Ising machine (CIM), has been proposed to exploit a rapid physical convergence time for accelerating the solution search in hard optimization problems [4].
One of the well-known examples of combinatorial optimization problems is a maximum cut problem (MAX-CUT) on a graph, which is essentially equivalent to the Ising model in statistical mechanics [5]. It is a problem to find the largest cut in a given graph G = (V, E), where the number of edges at the boundary of a partition of vertices into two subsets is maximized. The size of the cut is defined as the total weight of edges separated by the cut, i.e., edges which have each endpoints in the different sides of the cut. This objective function can be written as where the graph order n = |V | is the number of vertices, w ij is the weight of the edge (i, j) ∈ E and x i = ±1 is a binary value indicating which side of the cut the vertex i ∈ V belongs to. To implement the above problem on a physical system, the injection-locked lasers [4] and the degenerate optical parametric oscillators (DOPOs) [6] were proposed to use. With a series of experimental challenges, the implementation of the optical delay line based [7,8] and measurement feedback based [9,10] coherent Ising machines has been realized.
As a metaheuristic algorithm, the CIM can be interpreted as a mathematical model to solve combinatorial optimization problems using recurrently updated neurons with nonlinear activation function. From this point of view, there have been related and interesting approaches by mathematical models of the neurons (e.g., [11,12]) and their networks (e.g., [13]). Hopfield developed the optimization algorithm by using such neural networks [14]. Then Hopfield and Tank extended it to the continuous valued model to improve the performance and applied it to the combinatorial optimization problems [15,16]. Simulated annealing (SA) is proposed in the same period [19].
Here, we try to clarify the relative performance of our CIM against a family of classical neural network approaches and SA for combinatorial optimization problems especially for MAX-CUT. This paper is organized as follows. In section 2.1, we describe the basic concept and experimental configuration of CIM. In section 2.2, we describe models of neural network algorithm for the combinatorial optimization problems followed by section 2.3 to describe suitable hardware implementation. Then the numerical experiments are performed in section 3. We discuss the results and other possibilities of implementations in section 4. Finally, we conclude the paper in section 5.

Coherent Ising machine (CIM)
We intend to solve combinatorial optimization problems by mapping the cost function (1) to the energy of an Ising spin system. CIM is initially proposed as an injectionlocked laser system [4], followed by the proposal using a degenerate optical parametric oscillator (DOPO) system [6]. So far, several experimental machines are demonstrated with n = 4, 16, 100, 2048-pulse systems [7,8,9,10]. Since the original MAX-CUT has binary variables, we use a bistable optical device, DOPO at the output stage of computation, while an analog optical device, degenerate optical parametric amplifier (DOPA), at the solution search stage of computation. Figure 1 depicts the schematic of the measurement feedback based CIM [9,10]. Here we describe the typical experimental configurations in [10]. The DOPO part consists of a 1 km optical fiber with an externally pumped periodically poled lithium niobate (PPLN) waveguide. The pulsed pump laser, at the 1 GHz repetition rate of 5000 times as the cavity circulation frequency, generates 5000 individual DOPO pulses in a single fiber ring cavity. A segment of them (2000 pulses) is used as the signal pulses for computation and the remaining portion (3000 pulses) is used for the cavity stabilization.
The feedback circuit stores the interaction strength for each pair of DOPO pulses. A portion of optical pulse is picked-off by a beamsplitter (numbered as 1 in the figure 1) and measured by balanced homodyne detectors. The measured values of DOPO pulse amplitudes are fed into an analog-digital converter (ADC), followed by FPGAs. Here, 1 GHz repetition rate of signal pulses is downclocked to 125 MHz (8 parallel) and the measured amplitudesc i are sliced into the digital signals of 5 bits. Then 2 FPGAs sum up the coupling effect from the other vertices (in the given topology) j J ijcj for the ith pulse. The feedback pulse train is modulated in intensity and phase by this output electrical signal after a digital-analog converter (DAC). The feedback pulse is injected to the signal DOPO pulse running through the main fiber ring cavity via a beamsplitter #2.
The DOPO is operated near the oscillation threshold by crossing the pump rate from below to above the threshold in the case of [10]. In the beginning, the DOPO is biased at below the threshold in which all phase configuration is established so as a superposition state and the quantum parallel search is implemented [18]. Then, the external pump rate (or the feedback) strength is gradually increased, and once the whole system reaches the oscillation threshold, it selects a particular phase configuration which corresponds to the near-optimal solution of the original optimization problem.
The dynamics of the CIM can be simulated by the quantum master equation. Instead of numerically integrating the master equation for the DOPO density operator, we can expand the density operator by the quasi-probability function in the phase space. One quasi-probability function need for this purpose is the positive P (α, β) representation in terms of the off-diagonal coherent state expansion, |α β|. The Fokker-Planck equation for P (α, β) is derived from the master equation and then the c-number stochastic differential equations for α and β are obtained using the Ito calculus (see [17] for detail). Another quasi-probability function used for this purpose is the truncated Wigner representation W (α) in terms of the Gaussian states. The corresponding cnumber stochastic differential equations are derived in [18].

Classical neural networks
We describe in this section the classical neural network models to solve the same combinatorial optimization problems, three of which are summarized in the table 1.   Figure 1. Experimental schematic of a coherent Ising machine implemented on a fiber DOPO with an FPGA measurement feedback circuit.  [14], which is referred to the Hopfield network (HN). The neuron in this model has the discrete output values x i = ±1 with a simple majority voting update rule: which will execute asynchronously. The spin index i is selected randomly in the original paper but we derandomized it to enhance the speed, i.e., the spin indices from i = 1 to i = n are updated sequentially. Simultaneous updates introduce the instability or periodic solution into the system. Since the update is local and deterministic, the system will converge into the nearest local minimum, which is determined by the initial state. Note that the model is originally proposed with {0, 1}-binary neurons, but for comparison, we use equivalent {+1, −1}-valued neurons.

Simulated Annealing (SA)
While the HN will often get stacked at poor local minima, Kirkpatrick, et al. introduced a stochastic spin update strategy in simulated annealing (SA) algorithm to mimic thermal annealing [19]. The probability of stochastic spin flip is governed by the Boltzmann factor in the Metropolis-Hastings procedure as follows: even if the energy difference to flip the ith spin makes the total energy increased, namely ∆E i > 0. The spin index i is selected randomly while temperature T is gradually decreased.

Hopfield-Tank Neural Network (HTNN)
Hopfield and Tank proposed another neural network approach using an analog valued neuron x i ∈ [−1, 1], which is referred to the Hopfield-Tank neural network (HTNN) [16]. The time evolution of the HTNN is described by ordinary differential equations (ODE): where f (x) is a nonlinear sigmoid function. In this study, tanh(x) is used as f (x). In the extremely high linear gain limit, i.e., when the slope of the sigmoid function around 0 is steep, this HTNN model becomes close to the HN model described above. The parameters in later section are optimized as the neuron decay rate α = 6 and the synaptic connection strength β = 0.1 to achieve the best performance for the given MAX-CUT problems. The numerical integration of (5) is performed by the Euler method with the discrete time step ∆t = 0.01.

Hardware used for Implementation of classical neural networks
Here we describe the hardware configuration needed to implement the classical neural networks, which will be used in the benchmark section. Note that all codes are implemented with C++ ‡.

MIMD many core processor PEZY-SC (for HTNN)
Since HTNN is based on ordinary differential equations (a continuous-valued continuous-time system) and requires floating-point arithmetic, it is better to parallelize by accelerators. We used a MIMD many core processor PEZY-SC @ 733 MHz with 1024 cores and 8192 threads on a chip (the architecture is shown in figure 2), which is set in Shoubu supercomputer at Riken (Japan). We parallelized matrix-vector multiplication and neuron updates in 8192-thread parallel. The coupling matrix is efficiently stored as a 1-bit matrix (since J ij = ±1 has no empty entry) and neuronal states as floating points (32-bit float). Note that it was 1.4 times faster than storing matrix values in 32 bits. The benchmark of the hardware itself is shown in Appendix A.1.

Results
We compared the performance of HN, SA, HTNN, and CIM by solving the MAX-CUT problems on a dense graph. The particular problem instance is a complete graph, in which all pair of N = 2000 vertices are connected and edges are weighted by {+1, −1} in uniform distribution (the identical instance as in Ref. [10]). Figure 3 shows the performance on the complete graph, while the detailed computation time to target and the hardware configurations are summarized in table 2.  We ran 100 different trials for the same problem instance (except for CIM, which consists of 26 trials). Each solid line in the figure 3 indicates the ensemble average of all trials, while the lower and upper shaded lines indicate the best and worst case envelopes, respectively. Here, parameters for SA and HTNN are optimized to achieve the shortest computation time to the target which is obtained by the SDP relaxation algorithm [20].
The computation time to the SDP-produced target is shorter in the order of CIM, HN, SA, HTNN on the instance. The data from CIM are noisy due to experimental noise, but it can find better solutions than the target in all 26 trials. HN is faster than SA since HN can be regarded as a derandomized version of SA. Note that in the worst case, HN cannot reach the target (it fails 3 times in 100 trials as it can be seen partly in the worst case in figure 3). It can be understand that HTNN performs much slower than HN/SA since it solves ODEs which deals with the analog variables. Note that HTNN achieves lower energy than SA in figure 3 but the performance of SA heavily depends on temperature scheduling. We optimized to reach the target shorter but slower scheduling ends up lower energy generally.

Discussion
In this section, we will add the two discussions to justify the above conclusions, • Validity for the hardware selection.
• Optimization for PEZY-SC implementation for HTNN. HTNN is apparently efficient on PEZY-SC than on CPU, which is shown in Appendix A.1. On the other hand, we did not use any accelerator for HN and SA in this study. This is because we do not expect significant speed-up by naive implementation since they are already parallelized by SIMD operations in CPU and the cache hit rate is so high as 98.8% (measured by perf command in Linux). Generally, asynchronous update in HN/SA seems to be not suitable for parallel implementation.

Optimization for PEZY-SC implementation for HTNN
We tried to optimize the implementation by storing matrix data efficiently. Since the given adjacency matrix has only the 1-bit entry (J ij = ±1), we packed each value in 1 bit. This contributes the 1.4 times speed-up than having a 32-bit float matrix. But putting the data in local memory does not contribute to significant speed-up since its bottleneck in computation is not in memory transfer. There is a possibility of speed-up if we replace the multiplication by the selector. Rather, it is possible to scale out for parallel distributed processing by using multiple PEZY-SC chips in Soubu supercomputer, especially when the problem size is larger.

Conclusion
In this paper we compared the performance of the CIM implemented on DOPOs and FPGAs against the family of classical neural-network-based algorithms: HN, SA and HTNN. To accelerate the performance of the classical neural networks, HN and SA are implemented on CPU with bit operations and HTNN is implemented on a many core processor PEZY-SC. It is shown that the CIM can achieve faster computational time than HN (13.0 times for the best case and 6.97 times for the average), SA (29.6 times Table A1. Accelerator configurations for parallel implementation of the c-number stochastic differential equations in figure A1.