Natural Evolutionary Strategies for Variational Quantum Computation

Natural evolutionary strategies (NES) are a family of gradient-free black-box optimization algorithms. This study illustrates their use for the optimization of randomly-initialized parametrized quantum circuits (PQCs) in the region of vanishing gradients. We show that using the NES gradient estimator the exponential decrease in variance can be alleviated. We implement two specific approaches, the exponential and separable natural evolutionary strategies, for parameter optimization of PQCs and compare them against standard gradient descent. We apply them to two different problems of ground state energy estimation using variational quantum eigensolver (VQE) and state preparation with circuits of varying depth and length. We also introduce batch optimization for circuits with larger depth to extend the use of evolutionary strategies to a larger number of parameters. We achieve accuracy comparable to state-of-the-art optimization techniques in all the above cases with a lower number of circuit evaluations. Our empirical results indicate that one can use NES as a hybrid tool in tandem with other gradient-based methods for optimization of deep quantum circuits in regions with vanishing gradients.


I. INTRODUCTION
Rapid developments in quantum hardware [1][2][3] have led to the development of a wide variety of algorithms to perform useful tasks on the available devices. The limited coherence times and number of qubits of the devices places limitations on the circuits that can be reliably executed. A class of algorithms which are well-suited for these noisy intermediate scale quantum (NISQ) [4] devices are the variational quantum algorithms (VQAs) [5]. These algorithms use classical and quantum computers in tandem, using the quantum computers for state preparation with parametrized unitaries and measurement, and then optimizing the parameters of the unitaries on a classical computer in an iterative manner.
VQAs have been used for a variety of tasks such as simulating molecules and many body systems [6][7][8], compression of quantum states [9][10][11], classification tasks [12][13][14], generative modelling [15][16][17][18][19][20], and solving combinatorial optimization problems [21] among others. [22,23] However, these algorithms suffer in cases when there is little or no knowledge about a good initial point. For such problems, the optimization of the unitaries using gradient-based methods become impossible for deep quantum circuits, as the average and variance of the gradient of the objective function decay exponentially as a function of the number of qubits and layer in the circuits [24]. More recently it was reported [25][26][27] that the barren plateaus are not only present in deep circuits, but are dependent on the cost function used for op-timization and that optimization of problems with global cost function is not possible for randomly parameterized shallow circuits either. Furthermore, another study [28] showed that the variance of the gradient is bounded by the width of the causal cone of the circuit corresponding to the various Pauli terms in the decomposition of the cost function.
Recently, there have been different studies that have tried to solve the barren plateau problem. One approach circumvents the problem by formulating better initial guesses for the values of the parameters [29]. A different study [30] investigated ways to boost the gradients for certain problems by using correlated parameters. In this study we use an approach based on natural evolution strategies and batch optimization for avoiding barren plateaus in the optimization landscapes.
Evolutionary Strategies (ES) [31,32] have been applied extensively in classical machine learning tasks as a viable black-box optimization tool for high-dimensional problems. These strategies are promising for optimization of quantum circuits with respect to both the number of function evaluations and the scaling with the size of the circuit. The gradient is estimated by a number of function evaluations that does not increase with the number of parameters. Moreover, these function evaluations are fully independent and could be executed in parallel. This means that these methods are suitable for highdimensional problems. They can be further improved by using estimates of natural search gradients, fitness shaping, and adaptive sampling [33,34] for both highly correlated (considering the full co-variance matrix) and separable (only considering a diagonal co-variance matrix) problems. In a recent work [35], the authors illustrated the use of natural evolution strategies (NES) for approximate combinatorial optimization problems, and show that they can achieve state-of-the-art performance. More recently in a work [36] NES was used to investigate the problem of manipulating Majorana bound-states in a topological superconductor, where the authors show that these strategies can be used efficiently to study quantum mechanical many-body systems. In this study we use two modified strategies, separable NES and exponential NES [34,37], for optimizing randomly initialized PQCs by considering circuit parameters as multivariate normal distributions. We show that optimization of randomly initialized parameterised quantum circuits can be performed with significantly less circuit evaluations using NES, while achieving comparable accuracy to gradientbased methods. We also show that they can be used to amplify gradients and improve parameter initialization for gradient-based approaches.
The rest of the paper is organised as follows, we provide the theory behind NES in section II, the details of the numerical experiments in section III, the result from optimization of randomly parameterized quantum circuits in section IV, and finally present some concluding remarks in section V.

II. NATURAL EVOLUTION STRATEGIES
Evolution strategies were first introduced by Ingo Rechenberg [31] and Hans-Paul Schwefel [32], to cope with high-dimensional continuous-valued domains. It has remained an active field of research, and has been developed extensively over the years to include the full covariance matrix [38] for efficient optimization. In what follows, we describe a specific subclass of evolution strategies called the natural evolution strategy (NES) for iteratively updating a search distribution by ascending the surface along estimated search gradients. The reader is redirected to Ref. [33,37] for a detailed description of NES.

A. Canonical Natural Evolution Strategies
The NES uses search gradients to iteratively update the distribution of a stochastic variable z. The search gradient can be defined for any distribution for which the derivatives of the log-density can be calculated. If we consider a distribution with density π(z|θ) with θ as the parameter and f (z) as the fitness for z sampled from the distribution, then we can write the expected fitness as Using the 'log-likelihood trick' we can then approximate the search gradient from samples z 1 ...z k as where k is the population size. This equation has the benefit that it does not require derivatives of the fitness function. One can estimate the gradient for different commonly used search distributions including Gaussian and Cauchy distributions, and use standard gradient descent to iteratively update the parameters of the distribution where η is the learning rate. For a Gaussian prior distribution on the perturbation [39], the procedure is shown in Algorithm 1 while stopping criterion is not met do for n = 1..k do draw sample s n ∼ N (0, I) z n = µ + σ init s n evaluate the fitness f (z n ) end compute gradients: where f is the fitness function, η is the learning rate, σ init is the fixed variance, s n is a Gaussian perturbation sampled from the normal distribution N (0, I). This procedure replaces the normal gradient with the evaluation of the objective function blurred with Gaussian noise and samples the stochastic gradient with the perturbed parameters. It reduces to the simultaneous perturbation stochastic approximation (SPSA) [40] for the case of a symmetric perturbation forward and backward for a single Gaussian perturbation. As this method does not follow the exact gradient of the cost function but a stochastic estimator, it does not necessarily get stuck on the same barren plateaus as the analytical gradient. We will show in the numerical section that this is the case in general.
Instead of using the standard gradients for updates, modern evolutionary strategies implementations use the natural gradients because of their numerous advantages [41,42]. The natural gradient can be calculated using the Fisher matrix F which can be estimated from the search gradient as and the distribution is now updated as The canonical NES can be further improved by fitness shaping and adaptive sampling, which are used in exponential Natural Evolution Strategies (xNES) and separable Natural Evolution Strategies (sNES) [33,34,37]. Before we describe xNES and sNES in detail, we explain how evolutionary strategies are used for optimization of parameterized quantum circuits employed in different quantum algorithms. A quantum algorithm generally consists of finding the optimal parameters θ * of the parameterized quantum circuit U (θ) to minimize the associated loss function. The parameters of the PQCs are used as the centers to define a search distribution π(z|θ), which is used in NES to sample different points z and estimate the gradient on the distribution parameters. The parameters of the search distribution θ are then optimized over time by using the estimated gradient to find the optimal parameters, θ * for the parameterized quantum circuit.

B. Exponential Natural Evolution Strategies
The multivariate Gaussian distribution is one of the most prominent search distribution used for evolutionary strategies. Historically this distribution has received the most attention because of its analytical properties and the resulting elegance of the equations. Based on these techniques a new algorithm known as Exponential Natural Evolution Strategies (xNES) can be formulated, for which we provide the outline in Algorithm 2, as presented in Ref. [34,37]. It goes beyond the standard NES by introducing an exponential parametrization by which it makes the natural gradient easier to compute.
with respect to f (z n ) and compute utilities u n compute gradients: where, f is the fitness function, d is the dimension of the problem, A is the covariance matrix, µ is the mean, k is the number of walkers, s n is the sample in local coordinates, z n is the sample in the task coordinates, N (0, I) is a normal distribution, η B is the covariance learning rate, η µ is the mean learning rate, η s is the scale learning rate, and u n is the utility function used for fitness shaping which is defined as The stopping criterion for the xNES algorithm is met when the maximum value of the covariance matrix reaches below a threshold of 10 −8 . The xNES strategy works well across a wide variety of optimization problems [37], however the computational cost increases significantly with increasing variable dimension.

C. Separable Natural Evolution Strategies
For high-dimensional problems, the more suitable variant is the separable NES (sNES). Here, the correlation between the distribution variables is not considered (the covariance matrix A is forced to be diagonal), implying that the parameter distributions of the PQCs are considered independent of each other. The sNES algorithm is expected to perform at least as well as xNES on separable problems, to perform worse than xNES for highly correlated problems and to outperform xNES for very high-dimensional problems [37]. The sNES algorithm is outlined in Algorithm 3, as in Ref [37].
Algorithm 3: sNES input: f , µ init , σ init while stopping criterion is not met do for n = 1..k do draw sample s n ∼ N (0, I) z n = µ + σs n evaluate the fitness f (z n ) end sort {(s n , z n )} with respect to f (z n ) and compute utilities u n compute gradients: where f is the fitness function, µ is the mean, σ is the standard deviation, k is the number of walkers, s n is the sample in local coordinates, z n is the sample in the task coordinates, N (0, I) is a normal distribution, η µ is the mean learning rate, η σ is the deviation learning rate, and u n is the utility function used for fitness shaping as defined in equation 6. The stopping criterion for the sNES algorithm is met when the maximum value of the standard deviation vector reaches below a threshold of 10 −8 .

III. APPLICATION TO QUANTUM CIRCUITS
In this section, we illustrate the use of the NES for optimization of variational quantum algorithms (VQAs). We consider two different ansatzes for the problem of state preparation. The first ansatz is a randomly parameterized quantum ciruit (RPQC) with repeating parameterized and entangling layers, as where L is the number of layers in the circuit, V (θ l ) = Q j=1 R lj (θ lj ) is a tensor product of randomly chosen pauli rotation gates R lj , Q is the number of qubits in the circuit and W l is a layer of entangling CZ gates between adjacent qubits. The second ansatz is an alternate layered parameterized quantum circuit (ALPQC) which consists of two layers of CZ gates between alternate nearest neighbour qubits with layers of R Y rotation gates between the entangling layers.
where L is the number of layers in the circuit, V (θ n ) = represents the layer of the rotation gates, and W l and W l represent layers of entangling CZ gates between adjacent qubits.
The circuits are rotated out of the computational basis by adding a layer of R Y (π/4) gates, and are shown in Figure 1. The objective Hamiltonian operator H is chosen to be a projector |00....00 00....00| on the vacuum state, and the fitness function is chosen as f (θ) = (1 − ψ| H |ψ ) 2 , where |ψ = U (θ) |00....00 is defined as the state after the application of the parameterized unitary on the vaccum state.
The parameters of the quantum circuits are used to define a search distribution -a multi-normal Gaussian with centers as the circuit parameters, which is then optimized using NES as discussed in section II. For the xNES parameter distributions, the parameter distributions are correlated through the covariance matrix, while for the sNES the distributions are independent. The estimates of the fitness function are carried out by calculating the average outcome of the circuit at k different task coordinate values. The number k of these evaluations, which we will reference as walkers from here on, is independent of the number of parameters and qubits in the circuit and can be fixed to a constant for a given problem. We perform . . .
. . . different numerical simulations to show the application of NES for estimating the ground state energy of the water molecule in the STO-3G basis using a unitary coupled cluster (UCC) ansatz with randomly initialized parameters. We also consider a more abstract example of state preparation. In both cases we focus on the initial part of the optimization landscape to demonstrate the ability to optimize in cases with randomly initialized parameters.

IV. NUMERICAL SIMULATIONS
The circuit parameters are initialized as a multinormal distribution with randomly chosen centers and standard deviations of 0.1. The parameters used for the optimization are given in Table I are the same as considered in Ref. [37]. The whole framework is implemented using TEQUILA [43] an open source package, which uses QULACS [44] as the backend for the execution of all the numerical simulations.

A. Search Direction for NES
The problem for randomly initialized unstructured quantum circuits is that both the mean and variance tend to zero for random initializations [24]. This implies that all directions to search for the optimal parameters are equally likely, thus the difficulty in optimiza-  tion. Adding random noise can not change the average value of the gradient but we want to demonstrate that we have a valid search direction in the optimization landscape. The search direction is defined by an estimate of the gradient of the search distribution on its parameters, and we show that this estimate does not have a vanishing variance. For this we have repeated the calculations in [24] for the circuit in 1(a) for 18 qubits with a local cost function. We average each point over 10000 random initializations. For each random initialization, we have calculated the estimate of the search gradient with respect to the first parameter with a varying number of walkers k from 1 to 8. Figure 2 shows results of the calculations for the mean of the surrogate search gradients obtained with σ init taking values π/8, π/16, π/32.
The plots show that variance can be amplified arbitrarily as the width of the Gaussian noise is reduced. This should not be surprising as this parameter appears in the denominator in Algorithm 1. Decreasing the width of the noise also means that the NES samples less far around the current parameters and less information of the optimization surface is included. As the number of walkers k is increased, the stochastic approximation of the gradient approaches the true value of the gradient better and the variance is reduced. Both the width of the noise and the number of walkers are important and need to be chosen appropriately. In what follows we will let the width of the noise be a variable to optimize and the number of walkers a hyperparameter of our algorithm.
This section demonstrates that we can get an arbitrary amplification of the gradient for randomly initialized quantum circuits by replacing the exact gradient with a stochastic surrogate. That does not necessarily mean that optimization would be possible following this gradient. It is possible that our estimator leads to finding local minima or poor convergence behavior. In the next sections we show that optimization following this gradient is possible for a wide array of problems.

B. State preparation
We carry out numerical simulations for optimization of different quantum circuits with varying number of qubits and layers to prepare the corresponding |00....00 state. The number of walkers for all the simulations is chosen to be 16 unless stated otherwise. We plot the results from  Figure 1(a). The plot corresponding to 'exact' represent the variance of the analytical gradient.
the simulation using NES and the two different ansätze RPQCs and ALPQCs with different layers in Figure 3. It can be seen from the plots in Figure 3 that the optimization is possible in all cases when using sNES, but xNES fails to find the global optimum in cases with large number of parameters. This finding agrees with the expected behavior of sNES for these high-dimensional and multi-modal problems with highly redundant global optima, as pointed out for various cases earlier [37]. The failure of xNES for high-dimensional problems suggests that it is not advantageous to retain the full covariance matrix for optimization and that including the correlations between these parameters leads to bad performance. Thus, instead of considering the full covariance matrix for optimization of deep quantum circuits, a viable strategy could be using a block matrix correlating only some parameters, which we employ in section IV D. It is evident, that the rate of convergence of the optimization slows down with increasing number of parameters for both techniques which is expected as the search space becomes very large.
In Figure 4, we compare the performance of NES with that of gradient descent for the optimization of RPQCs. The plots show that optimization with sNES performs equally well for all the cases in terms of finding the global minimum, but that xNES performance gets worse for circuits with more layers. An important point to make here is that, an update of the parameters using NES only requires k circuit evaluations, while a standard gradient descent calculation with the parameter shift procedure typically needs twice the number of parameters (2QL) function evaluations. Even though gradient descent reaches the global minimum in less iterations, it does so at a much higher cost in function evaluations. An interesting observation from the plots is that the initial rate of convergence with NES for circuits with 10 qubits or more is higher than that using gradient descent. This is important in the context of the barren plateaus, where typically no suitable search direction can be found. It opens the possibility for hybrid strategies, where the NES can be used to provide the inital direction for gradient descent for optimizing large RPQCs.

C. VQE for the H2O molecule
In the context of chemistry, PQCs are commonly used to represent the wave function of molecules. Although the circuits here typically use structured physically motivated ansatzes, the number of parameters becomes large as the system size grows. In this simulation we estimate the ground state energy of a water molecule at equilibrium geometry using the STO-3G basis (14-qubit) with the UCCSD ansatz (34 parameters). The parameters of the circuit are intitialized randomly, which can lead to convergence issues due to presence of barren plateaus as observed in [45]. We plot the results from the numerical experiments in Figure 5, and it can be seen from the  plots that we can use both sNES and xNES to optimize the parameters for estimating the ground state energy of water molecule using VQE. The xNES performs better for this example than the sNES, even though the size of the water molecule is beyond the limit for which we previously found that xNES loses performance compared to sNES. This suggests that in this case keeping the covariance matrix gives an advantage and that the parameters for this problem are actually correlated. The degree of randomness in the circuit topology can be a factor in selecting the optimal NES flavor.
FIG. 5: Convergence to the ground state energy for H 2 O molecule using VQE with randomly initialized UCC ansatz.

D. Batch optimization
As seen in Figure 3 the rate of convergence decreases with increasing number of parameters. This leads to long optimization runs for growing circuits. At the same time the overhead of the full NES strategy grows. To alleviate these effects, we investigate the use of batch optimization strategy for optimizing RPQCs of larger depth, as investigated in another recent study [46] where the authors investigate the use of layerwise learning techniques for optimisation of Quantum Neural Networks. In batch optimization, only a limited number (the batch size) of parameters are adjusted every iteration to lower the cost. We carry out various simulations to analyze the effect of the size of the batches, number of walkers and different partitioning strategies on the convergence of the optimization.
We first investigate the effect of the size of batches and the number of walkers on the optimization by carrying out different numerical simulations with varying batch size. We use RPQCs with 10 qubits and 50 layers for the simulations and the batches of parameters are selected randomly but fixed over the entire calculation. The results in Figure 6 show clearly that a minimum batch size is required. The xNES needs more walkers to give meaningful results than the sNES. The curves show both a convergence in the batch size and the number of walkers at around 16 walkers and a batch size of 50. We will use these values in the following calculations, it is expected that these values have some problem dependence and need to be tuned accordingly.
We next carry out simulations to investigate the effect of different strategies for generating the batches. The different strategies we investigate are dividing the circuit parameters randomly, layer-wise, qubit-wise, layer-blockwise, and qubit-block-wise. We plot the results from simulations of 10 qubit 50 layered RPQCs in Figure 7. It can be seen from the plot that the curves are identical for all the strategies, with the layer-wise batch optimization strategy a bit deviant because of the different batch size. We also carried out different simulation with RPQCs of larger depths using the different partition strategies and found that depending on the problem one might get faster convergence using one of the strategies, thus we recommend investigating the use of different strategies when dealing with larger RPQCs. Other techniques such as the parameter efficient circuit training techniques [47] can be also used together with NES to further improve the optimization procedure.
We also analyzed the performance of NES with gradient-descent with the batch optimization strategy. We use RPQCs with 10 qubits and 50 layers, and use batch size of 50 parameters and 16 walkers for the simulations. The results from the simulation is plotted in Figure 8. The optimization with NES performs better than gradient-descent initially, as seen previously in Figure 4.
Finally we ran simulations to carry out the optimiza- tion of RPQCs with varying number of qubits and layers using both xNES and sNES by randomly dividing the parameters in batches of size 50 and using 16 walkers. We show the results for 10-qubit and 15-qubits circuits with different number of layers in Figure 9. The plots show that we can perform optimization of significantly deep quantum circuits. As the circuits depth increases the rate of convergence decreases, and one might require a high number of circuit runs for the final result. Faster convergence of deep RPQCs might be achieved by selecting specific values of different parameters such as the initial learning rates, initial values of covariance matrix, number of walkers and the batch size. We leave such hyperparameter tuning for a future study.

E. Hybrid strategy
As shown above, the convergence with NES is sometimes slower than gradient-based methods in regions with non-vanishing gradients. In this section we explore a hybrid strategy that uses separable natural evolution strategies (sNES) in areas with vanishing gradients and regular gradient descent when a search direction has been found. For this, we simulated circuits with 10 qubits 50 layers, 15 qubits 50 layers, and 18 qubits 10 layers RPQCs with 500, 750 and 180 parameters respectively. We prove the viability of this strategy by tracking the value of the analytical gradient with respect to all the parameters at different iterations in the optimization with NES, and plot the resulting distribution of the analytical gradients using violin plots in Figure 10.
We tracked the analytical gradient distribution for the full optimization run for the 10 qubit 50 layers RPQC plotted in Figure 10a, and as expected, the distributions follow the pattern where it spreads initially to larger values and then narrows down about zero near the end of the optimization. This indicates that if one can increase the spread of the gradients at the beginning of the optimization one might be able to use gradient-based methods to complete the process. Thus, we generate the distribution of the analytical gradients after every 5 iteration of optimization by sNES, and plot the distributions for 15 and 18 qubits RPQCs in Figure 10b & 10c. The distribution of the analytical gradient in Figure 10b & 10c start to spread to higher values already after 5 iterations of optimization with sNES, which indicates that we can then use gradient-based methods in these cases for the full optimization. The choice of the number of optimization step will depend more on the problem, and one can use sNES until the magnitudes of the analytical gradients are large enough. These empirical results indicate that using a surrogate gradient as the one in sNES, one might be able to navigate through areas of vanishing gradients and then use gradient-based methods for finding the optimal solution.

V. CONCLUSION AND OUTLOOK
We presented a strategy of representing parameters of quantum circuits as multinormal distributions for parameterized quantum circuits and use natural evolution strategies for optimizing randomly initialized PQCs. We have shown that the search direction used in NES can be amplified for randomly initialized parameterized quantum circuits, and that with appropriate number of walkers one can use them for optimization in all cases. We numerically analyze the performance of NES for param- eterized quantum circuits, and show that they achieve nearly similar accuracies when compared to gradientbased methods. While sNES works pretty good for deep PQCs, xNES performs nearly well for shallow PQCs and outperforms sNES for problems where the parameters are correlated. We observe that the convergence curves for both NES and gradient-based methods are very similar, however, the number of circuit evaluations decrease significantly for NES and the initial rate of convergence seems to be higher too.
We also extend the optimization problem to deep quantum circuits by using of batch optimization strategy and show that one can use NES with batch optimization for deep PQCs. Based on various numerical simulation we observe that we can increase the rate of convergence by choosing appropriate values of different hyperparameters, learning rate, number of walkers, and batch size. We also show that random partition of parameters into batches works comparably well as structured partitioning.
In the end we proposed a hybrid strategy, where we show that using NES we can amplify the gradients significantly and then use gradient-based methods for faster convergence. Our results indicate that the NES can be used as alternative to gradient-based methods in regions of vanishing gradients and can be a useful tool for exploring deep quantum circuits for different tasks.
Our study only explores the basic use of NES for optimization of randomly initialized PQCs, we believe that they can be used with different gradient-based methods to improve the overall optimization of deep quantum circuits. Other possibilities like, using other distributions to sample the parameters from, or combination of strate-gies to partition parameters into batches and adjusting the hyper-parameters for the problem in hand, can be used to further improve the training. These different directions are something we leave for future studies, as they require additional research.