Reinforcement Learning Enhanced Quantum-inspired Algorithm for Combinatorial Optimization

Quantum hardware and quantum-inspired algorithms are becoming increasingly popular for combinatorial optimization. However, these algorithms may require careful hyperparameter tuning for each problem instance. We use a reinforcement learning agent in conjunction with a quantum-inspired algorithm to solve the Ising energy minimization problem, which is equivalent to the Maximum Cut problem. The agent controls the algorithm by tuning one of its parameters with the goal of improving recently seen solutions. We propose a new Rescaled Ranked Reward (R3) method that enables stable single-player version of self-play training that helps the agent to escape local optima. The training on any problem instance can be accelerated by applying transfer learning from an agent trained on randomly generated problems. Our approach allows sampling high-quality solutions to the Ising problem with high probability and outperforms both baseline heuristics and a black-box hyperparameter optimization approach.


Introduction
Many important real-world combinatorial problems can be mapped to the Ising model, ranging from portfolio optimization (Venturelli & Kondratyev, 2019;Marzec, 2016) to protein folding (Perdomo-Ortiz et al., 2012). The Ising model describes the pairwise interaction of binary particles and assigns some cost function (energy) to each particle configuration. The Ising problem consists in finding binary strings that minimize the energy. It is a quadratic unconstrained optimization task over the discrete {±1} n domain and equivalent to the Max-Cut problem from graph theory.
There are multiple methods for solving the Ising and Max-Cut problems. Classic algorithms include heuristics performing local search in the solution space, like breakout local search (Benlic & Hao, 2013) and simulated annealing (Kirkpatrick et al., 1983). For many combinatorial problems, commercial solvers are available, including Gurobi (gur, 2019) and CPLEX (cpl, 2019).
An entirely different approach is to use a quantum physical system with its energy function similar to the optimization objective, and then anneal this system towards its ground state -the lowest energy state. Devices utilizing this method include the coherent Ising machine (CIM) McMahon et al., 2016) and the quantum superconducting annealer manufactured by D-Wave Systems (McGeoch et al., 2019). For example, in CIM, pulses of light circulate in a lossy optical fiber loop containing a parametric amplifier. In each round trip, a classical controller modulates these pulses according to the parameters of the Ising problem and the measured amplitudes of other pulses.
Quantum technology does not yet compete with classical computation systems in terms of both problem size and solution quality. However, it has inspired a family of new classical optimisation algorithms that perform well in comparison with existing ones (King et al., 2018;Leleu et al., 2019). An example is a simulation of CIM, known as the SimCIM algorithm . SimCIM reformulates the Ising model as a continuous constrained optimization problem and solves it with iterative gradient-based optimization, with each iteration corresponding to a roundtrip of the optical pulses through the fiber loop. SimCIM was implemented on computers equipped with consumer GPUs and outperformed CIM in both solution quality and computation time . It has since been applied as a Boltzmann sampler to train general Boltzmann machines and for applications in statistical physics . However, both SimCIM and CIM require parameter tuning for each problem instance to obtain the best results. One of the SimCIM parameters, the combined coefficients of linear gain and loss (which can be interpreted as a dynamic regularization coefficient), needs to be varied as a function of the iteration number. As a result, the use of classic hyper-arXiv:2002.04676v2 [cs.LG] 14 Feb 2020 parameter optimization approaches (Feurer & Hutter, 2018) is limited, since most methods assume a small number of continuous or discrete parameters.
To automate parameter tuning in a flexible way, we use a reinforcement learning agent to control the regularization (gain-loss) function of SimCIM during the optimization process. An important feature of the Ising problem is the presence of multiple local optima whose energy is only slightly higher than the global minimum, but the associated bit configuration is significantly different. To address this issue, we propose Rescaled Ranked Reward (R3), a modification of Ranked Reward (R2) (Laterre et al., 2018). In this approach, we assign reward to the agent depending on how its current score compares to scores obtained in recent trials, and thus enable self-play training for a single-player environment. Rescaled Ranked Reward ensures the agent is motivated to keep discovering better solutions, without destabilizing the training process.
We demonstrate that the convergence speed noticeably improves if we apply policy transfer from an agent pre-trained on randomly generated problems to the unseen target problem. This transfer learning is facilitated by feature-wise linear modulation (FiLM) (Dumoulin et al., 2016) with the features extracted from the general parameters of the problem at hand.
Our approach allows us to find the best solutions with higher probability than SimCIM with a regularization function that changes linearly or according to a hyperbolic tangent function with manually tuned parameters (which is our benchmark for the human level performance). It also outperforms CMA-ES (Hansen et al., 2003), one of the most powerful black-box algorithms for hyperparameter optimization. 1

Combinatorial optimization
The Ising problem is a discrete energy minimization problem: Here x is a vector of n binary variables that can have values ±1, and J is a symmetric problem matrix that describes pairwise interactions between them. This problem is NPhard (Barahona, 1982). It is equivalent to the Max-cut problem: This problem can be interpreted as a task of dividing a set of n nodes of a weighted graph into two subsets, such that the sum of edge weights connecting these subsets C(J, x) is maximized. In this interpretation, the problem matrix J is the adjacency matrix of the graph, and binary variables x denote the choice of the subset for each node. The optimization objective C(J, x) is called the cut value (higher is better); in this paper we use it to evaluate our algorithm and compare it to benchmarks.

SimCIM algorithm
SimCIM  solves the discrete problem (1) by replacing x with a vector c of n continuous variables bounded in [−1, 1]: In addition to the Ising term, the SimCIM loss function contains a regularization term parametrized by the real scalar p, which corresponds to the combined gain and linear loss coefficients in the CIM ; its value is allowed to change during the optimization process. This change does not have to be smooth or even continuous, however p should generally decrease with time to ensure that the loss function (3) approximates the continuous Ising loss function by the end of the iterations.
The problem (3) is solved via gradient descent. We initialize the vector c of problem variables according to c 1 = {0} n . At each iteration t, we compute the noisy anti-gradient of the loss function where µ is the learning rate, p t is the regularization coefficient, ε t is a vector of random samples from the standard normal distribution and σ is the noise amplitude. Then we calculate the update vector m t = ηm t−1 + (1 − η)g t by applying momentum µ to the anti-gradient (the update vector is initialized according to m 0 = {0} n ) and update the problem variable vector as follows: where I denotes the indicator function and element-wise product. In other words, an update is applied to each element of c t only if it does not cause it to to exceed the boundary of [−1, 1].
These iterations are repeated N times. Subsequently, the solution of the original discrete problem (1) is calculated as its elementwise sign. SimCIM is reminiscent to the Hopfield-Tank simulated annealer (Hopfield & Tank, 1986), but differs from it in the shape of the activation function.
The hyperparameters µ, η, σ are scalar values and relatively easy to tune. In contrast, p t is a discretized function of time, which poses a challenge to common hyperparameter optimization techniques due to the large dimensionality.
Eigenvalue decomposition Since the matrix J is real and symmetric, we can construct an eigenvalue decomposition J = QΛQ T , where Q is an orthogonal matrix with the eigenvectors of matrix J as its columns, and Λ is a diagonal matrix with the eigenvalues of J as diagonal elements Λ ii .
With some simplifications (η = σ = 0, c t 1) the dynamics (4) of the system can be described by the equation c t+1 = c t + µ(Jc t − p t c t ). By performing eigenvalue decomposition and the change of variable e = Q T c, the update equation simplifies to e t+1,i = e t,i +µ(Λ ii e t,i −p t e t,i ) -i.e. the update is applied to individual elements of the vector e. Thus, when p t is greater than the highest eigenvalue of J, both e and c = Qe exponentially decay. Also, setting p t lower than all Λ ii will lead to exponential growth of all amplitudes e i , and subsequent poor conversion of the iterations. Using these observations, we reparameterize the regularization function by introducing a normalized regularization functionp t , which, as a rule, is restricted to the interval [0, 1]: Choosing a learning rate To select the learning rate µ for each problem instance, we use an automatic procedure similar to the learning rate range test proposed in (Smith, 2017). One optimization cycle of SimCIM (4) with momentum η = 0 is performed with an exponentially decaying µ, starting from some high value. The learning rate is chosen at an iteration where the l 1 norm of gradient g t 1 starts to converge.

Reinforcement learning
In reinforcement learning (RL) setting, an agent at each step t interacts with some environment E by observing its state s t , performing an action a t sampled from its policy π(a), and obtains a reward r t (s t , a t , s t+1 ). One interaction session, called an episode, usually lasts until the agent reaches a terminal state or until the limit on the number of steps T is reached. The goal of the agent is to maximize the expected sum of discounted rewards during the episode: T t=0 γ t r t (s t , a t , s t+1 ), where τ (π) is a trajectory generated by the agent in the environment, and γ ∈ (0, 1] is the discount factor. In actor-critic learning, an agent consists of two components. The actor, using observations from the environment, predicts the agent's policy π(a). The actor's parameters are updated in the direction of improvement, which is estimated using sampled trajectories. The critic predicts the value of each observation, which is then used to reduce the variance of actor's gradient. In the case of deep reinforcement learning, both actor and critic are usually implemented as deep neural networks.

Our approach
We use a neural network based RL agent to control the scaled regularization functionp t . Every m iterations of Sim-CIM, the agent observes the state of the optimization process and modifiesp t . Observations, actions and rewards from each SimCIM optimization cycle constitute one agent rollout: in an episode of N SimCIM steps the agent performs N/m actions.
Actions The agent has a discrete action space: it is allowed to increase or decrease the regularization function, which is initialized atp 0 = 1.0, by one of the values {±p ∆ , 0}. In addition to that,p t is decreased by m N at each agent step to ensure that a random agent yields a decreasing regularization function. Between the agent steps,p t is interpolated linearly;p t is also clipped to the interval [0, 1.05] to limit the exploration area.
Observations The agent observes the current state of optimization variables in the eigenbasis of the problem matrix J, i.e. it is supplied with the set of amplitudes e t,i (listed in the order of decreasing corresponding eigenvalues Λ ii ), as well as the elapsed time t/N and the regularization function p t−1 from the previous step. The benefit of using e t , rather than the actual amplitudes c t , as the state component, is that the former have a natural ordering according to the corresponding eigenvalues of J, while the components of c t can be arbitrary permuted along with the rows and columns of J. This representation of the state therefore facilitates the transferability of the agent across problems.
To provide the agent with the information about the current problem instance for the purpose of transfer learning, we calculate problem features as φ j = 1 n n i=1 |Q ij |. This means that φ j are scaled l 1 norms of the problem matrix eigenvectors. These features are "static" observations that are fixed during the entire episode. Features φ j are provided to the agent at each step as a part of the observation.

Rewards
In the case of combinatorial optimization, we are interested in finding solutions with the best quality (e.g. cut value) for each instance, while the path in which it has been reached is less important. Also, solutions with slightly different cut values may correspond to completely different bit configurations x. Thus the current cut value or its difference between steps is not the best choice for the reward.
To address this issue, the Ranked Reward (R2) method was proposed in (Laterre et al., 2018). In R2, the environment maintains a list of discovered cut values C j for the last P episodes (a "leaderboard"), the q-th percentile C q is calculated over this list, and the new solution with the cut value C is rewarded at the last step only according to the rule where q and P are hyperparameters. This kind of reward ensures that the agent constantly improves its performance in search of better solutions. In the language of of self-play (Silver et al., 2017), the agent is rewarded for beating most of its last results in a single-player game (being at the top of the leaderboard) and punished otherwise.
We propose a modification of this method that we dub Rescaled Ranked Rewards (R3) to account for imbalanced reward distribution: wherer is calculated in such a way that the average reward over the last P episodes is equal to zero. This modification ensures that negative and positive rewards are balanced. It also ensures that solutions with C > C q are clearly distinguishable from those with C = C q , and hence discourages the agent from getting stuck in a local optimum.
Transfer learning The approach we propose requires training the agent for each problem instance separately, however it is possible to accelerate this process significantly by pre-training the agent on randomly generated problem instances.
We pre-train the agent on random adjacency matrices from the Erds-Rnyi distribution (Erdős & Rényi, 1960) with a fixed connection probability of 0.06. We select this value so that the pre-training distribution is close to that for the target set of problems. However, we observe that transfer works reliably for matrices with different structure, too.
At each step of the pre-training process, the environment samples a new matrix J, and the agent uses it to generate a batch of episodes and perform a gradient update. This is repeated a fixed number of times. Note that this procedure does not require any costly data labeling or using previously known solutions.
Once the training is complete, the agent is fine-funed in application to the specific problem of interest. This finetuning is performed in a similar manner: at each step the agent generates a batch of episodes using the matrix J of the problem and performs a gradient update.
Implementation details The agent is implemented as two separate fully-connected networks (actor and critic) with two hidden layers of size 256 and tanh activation functions. These two networks take environment observation as input and produce policy and value function respectively.
The static features of the problem matrix φ j are not included in the network inputs; instead, they are used to calculate a set of parameters to perform feature-wise linear modulation (FiLM) (Dumoulin et al., 2016) of the last hidden layer in the actor network. The FiLM module is a linear layer that predicts a set of weights and biases that are used to scale and shift the activations of the actor's hidden layer element-wise.
We train the agent using the PPO (Schulman et al., 2017) algorithm with 4 epochs. The discount factor γ is equal to 1.0. SimCIM performs N = 1000 iterations per episode, and the agent acts every m = 10 iterations, corresponding to 100 steps per episode. The SimCIM algorithm allows efficient parallel implementation on a GPU, so we train the agent in batches of size 256 (both for pre-training and finetuning). We use q = 0.99 to calculate rewards in R2 and R3 methods; the leaderboard size P is equal to 5 batch sizes for fine-tuning and one batch size for pre-training (since each problem instance is used to generate only one batch of episodes). The pre-training is performed for 30000 problem instances.
The SimCIM hyperparameters are chosen as follows. The momentum is set to η = 0.9 and noise level to σ = 0.03. The learning rate µ is tuned automatically for each problem instance, including the random instances used for pretraining. The regularization function increment p ∆ is equal to 0.04. In (Laterre et al., 2018), a permutation-invariant network was used as a reinforcement learning agent to solve the bin packing problem. This work introduced Ranked Reward to automatically control the learning curriculum of the agent.

Related Work
Combining RL with heuristics was explored in (Xinyun & Yuandong, 2018): one agent was used to select a subset of problem components, and another selected an heuristic algorithm to process them.
In (Khairy et al., 2019), a reinforcement learning agent was used to tune the parameters of a simulated quantum approximate optimization algorithm (QAOA) (Farhi et al., 2014) to solve the Max-Cut problem and showed strong advantage over black-box parameter optimization methods on graphs with up to 22 nodes. QAOA was designed with near-term noisy quantum hardware in mind, however, at the current state of technology, the problem size is limited both in hardware and simulation.
To the best of our knowledge, combining quantum-inspired algorithms with RL for combinatorial optimization in the context of practically significant problem sizes was not explored before.

Experiments
To evaluate our method, we use problem instances from Gset (Ye, 2003), which is a set of graphs (represented by adjacency matrices J) that is commonly used to benchmark Max-Cut solvers. Gset contains problems of practically significant sizes, from hundreds to thousands of variables from several different distributions.
We concentrate on graphs G1-10. Of these, G1-G5 appear to belong to the Erds-Rnyi (Erdős & Rényi, 1960) model with the connection probability approximately equal to 0.06, while G6-G10 are weighted graphs with the same adjacency structure, but with approximately half of the edges having weights equal to −1. All of these graphs have 800 nodes.
For all our experiments, we use a single machine with a GeForce RTX 2060 GPU.

Performance
The agent, pre-trained and fine-tuned as described in Section 3, is used to generate a batch of solutions, for which we calculate the maximum and median cut value. We also report the fraction of solved instances: the problem is considered solved if the maximum cut over the batch is equal to the best known value reported in (Benlic & Hao, 2013).
The results are presented in Table 1. The obtained maximum and median are normalized by this best known value; the normalized values are further averaged over instances G1-G10 and over three random seeds for each instance (for each random seed we pre-train a new agent). Note that problem instances G6-G10 belong to a distribution never seen by the agent during the pre-training.
We compare our method to two baseline approaches to tuning the regularization function of SimCIM. In the first approach (labelled "Linear"), the scaled regularization functionp t is decaying linearly from 1 to 0 during the N Sim-CIM iterations; in our reinforcement learning setting, this is equivalent to the agent that always chooses zero increment as the action. In the second approach (labelled "Manual"), which has been used in the original SimCIM paper , the regularization function is a parameterized hyperbolic tangent function: where J m = max i j |J ij |; t/N is a normalized iteration number and O, S, D are the scale and shift parameters. These parameters are tuned manually for all instances G1-G10 at once. If manually tuned in this fashion, SimCIM solves 8 of G1-G10 instances, however the result is stochastic and the probability of solving each instance is different . We evaluate the baselines by sampling 30 batches of solutions (batch size 256) for each instance and averaging the statistics (maximum, median, fraction of solved) over all batches of all instances.
We also compare our approach to a well-known evolutionary algorithm CMA-ES (Hansen et al., 2003) (population size 10). We parameterize the regularization function for iteration t according to Eq. (8), and CMA-ES is used to tune D ∈ [−3, 3] and O, S ∈ [0.01, 10] (exponential scale) for at most 500 SimCIM evaluations in batches of size 256 each. We maximize C max + q max , where C max is the maximum cut over the batch, and q max is the fraction of values in the batch equal to C max . Since all elements of J are integer, so is the cut value, while 0 < q max ≤ 1. As a result, this objective orders batches first by the maximum cut value, and then by the probability to obtain it. After the optimization is finished, the best parameters are selected, and a new batch of solutions is sampled with these parameters. We report results from the batches obtained in this manner, averaged over three random seeds and over all instances.
Though the pre-trained agent without fine-tuning (Agent-0) is even worse than the baselines, fine-tuning rapidly improves the performance of the agent. The fine-tuned agent does not solve all instances in G1-G10, however it discovers high-quality solutions more reliably than the benchmarks.
CMA-ES is capable of solving each of G1-G10 instances: we observed that the best known value appeared at least once for each instance during several trials with different seeds. Table 1. Performance on Gset: maximum and median normalized cut values are averaged over the instances (G1-G10); Agent-K denotes an agent fine-tuned for K episodes, Agent-0 is not fine-tuned. Standard deviation over three random seeds is reported in brackets for each value. However, for some instances this result is not reproducible due to the stochastic nature of SimCIM: a new batch of solutions generated with the best parameters found by CMA-ES may yield a lower maximum cut. In this sense, the results for CMA-ES are worse than for the manually tuned baseline.  Figure 1 demonstrates the dynamics of the maximum and median cut values for the G2 instance during the process of fine-tuning. The median value continues to improve, even after the agent has found the best known value, and eventually surpasses the manually tuned baseline. This means that the agent still finds new ways to reach solutions with the best known cut. A further advantage of our agent is that it adaptively optimizes the regularization hyperparameter during the test run by taking the current trajectories c t into account.
The exact maximum cut values after fine-tuning and best know solutions for specific instances G1-G10 are presented in Table 2. We see that the agent stably finds the best known solutions for G1-G8 and closely lying solutions for G9-G10. The reason it fails to solve G9 and G10 is that the policy found by the agent corresponds to a deep local optimum that the agent is unable to escape by gradient descent. In contrast, CMA-ES does not use gradient descent and is focused on exploratory search in a broad range of parameters, and hence is sometimes able to solve these graphs. However, even with CMA-ES, the solution probability is vanishingly small: 1.3 × 10 −5 for G9 and 9.8 × 10 −5 for G10.
We also note the difference in the numbers of samples used by the automatic methods -our agent and CMA-ES -as compared to the manual hyperparameter tuning and the linear variation of the hyperparameter. In the former case, the total number of samples consumed including both training (fine-tuning) and at test equalled ∼ 256 × 500 = 128000.
On the other hand, the manual tuning required much fewer samples (tens of thousands), while the linear setting did not involve any tuning at all. Hence it is fair to say that the linear and manual methods are much more sample-efficient.

Ablation study
We study the effect of the three main components of our approach: transfer learning from random problems, Rescaled Ranked Rewards (R3) scheme, and feature-wise linear modulation (FiLM) of the actor network with the problem features.
• To study the effect of the policy transfer, we train pairs of agents with the same hyperparameters, architecture and reward type, but with and without pre-training on randomly sampled problems. In the latter case, the parameters of the agent are initialized randomly.
• We compare our R3 method with the original R2 method both with and without pre-training.
• We study the effect of FiLM by removing the static observations extracted from the problem matrix J from the observation and the FiLM layer from the agent.
We report the fraction of solved problems, averaged over instances G1-G10 and over three random seeds for each instance. The results are presented in Table 3 and Fig. 2.
According to the results, all of the above listed features are essential for the agent's performance. We see, in particular, that the pre-trained agent with both FiLM and R3 rewards experiences a slightly slower start, but eventually finds better optima faster than ablated agents. . Ablation study: averaged fraction of solved problem instances versus the number of episodes of fine-tuning for each instance (smoothed with Savitzky-Golay filter). Standard deviation is calculated over three random seeds. "Transfer" and "From scratch" are used to denote the agent with and without pre-training, respectively.

Rescaled ranked rewards
The analysis of specific problem instances helps to demonstrate the advantage of the R3 method. We analyze the behavior of the 99-th percentile of the solution cut values (the one used to distribute rewards in R2 and R3) on the G2 instance from Gset in Fig. 3. G2 has several local optima with the same cut value 11617, which are relatively easy to reach. When the agent is stuck in a local optimum, many solutions generated by the agent are likely to have their cut values equal to the percentile, while solutions with higher cut values may appear infrequently.
In the R2 scheme (6)  for local-optimum solutions and +1 for better ones. Thus infrequent solutions with higher cut values become almost indistinguishable from the local-optimum solutions. Furthermore, the fraction of episodes with local-optimum solutions increases, which results in a large fraction of random rewards, thereby preventing the efficient training of the critic network. This is evident from the monotonic growth of the value loss function in Fig. 3.
In our R3 scheme (7), in contrast, the rewards for the localoptimum solutions are deterministic and dependent on the frequency of such solutions. The more often the agent reaches them, the lower the reward, while the reward for solutions with higher cut values is fixed. Eventually, better solutions outweigh sub-optimal ones, and the agent escapes the local optimum. This moment is indicated by a significant increase of the value loss: the agent starts exploring new, more promising states.

Discussion and future work
One of the benefits of our approach is the lightweight architecture of our agent, which allows efficient GPU implementation along with the SimCIM algorithm itself. This allows us to rapidly fine-tune the agent for each problem instance. However, the fully-connected architecture makes it harder to apply our pre-trained agent to problems of various sizes, since the size of the network input layer depends on the problem size. Hence it would be interesting to explore using size-agnostic architectures for the agent, like graph neural networks.
Another future research direction is to train the agent to vary more SimCIM hyperparameters, such as the scaling of the adjacency matrix or the noise level. Additionally, it would be interesting to explore using meta-learning at the pre-training step to accelerate the fine-tuning process.
Lastly, with our approach, each novel instance requires a new run of fine-tuning, leading to a large number of required samples compared with simple instance-agnostic heuristics.
In order to make our approach viable from a practical point of view, we hope to address generalization across different, novel, problem instances more efficiently.

Conclusion
In this work we proposed an RL-based approach to tuning the regularization function of SimCIM, a quantum-inspired algorithm, to robustly solve the Ising problem. Our hybrid approach shows strong advantage over heuristics and a black-box approach, and allows us to sample high-quality solutions with high probability.
We proposed an improvement over the Ranked Reward (R2) scheme, called Rescaled Ranked Reward (R3), which allows the agent to constantly improve the current solution while avoiding local optima. We also demonstrated that our algorithm may be accelerated significantly by pre-training the agent on randomly generated problem instances, while being able to generalize to out-of-distribution problems.
Importantly, our approach is not limited to SimCIM or even the Ising problem, but can be readily generalised to any algorithm based on continuous relaxation of discrete optimisation.