Variational Quantum Reinforcement Learning via Evolutionary Optimization

Recent advance in classical reinforcement learning (RL) and quantum computation (QC) points to a promising direction of performing RL on a quantum computer. However, potential applications in quantum RL are limited by the number of qubits available in the modern quantum devices. Here we present two frameworks of deep quantum RL tasks using a gradient-free evolution optimization: First, we apply the amplitude encoding scheme to the Cart-Pole problem; Second, we propose a hybrid framework where the quantum RL agents are equipped with hybrid tensor network-variational quantum circuit (TN-VQC) architecture to handle inputs with dimensions exceeding the number of qubits. This allows us to perform quantum RL on the MiniGrid environment with 147-dimensional inputs. We demonstrate the quantum advantage of parameter saving using the amplitude encoding. The hybrid TN-VQC architecture provides a natural way to perform efficient compression of the input dimension, enabling further quantum RL applications on noisy intermediate-scale quantum devices.

Such limitations make the development of near-term quantum algorithms highly non-trivial.
Numerous efforts have been made to utilize these NISQ resources and one of the notable achievements is the variational quantum algorithms (VQA) [25,26]. In such a framework, certain parts of a given computational task that can leverage the strength of quantum physics will be put on the quantum computer, while the rest remains on the classical computer. The outputs from the quantum computer will be channeled into the classical computer and a predefined algorithm will determine how to adjust the parameters of the quantum circuit on the quantum computer.
In classical RL, evolutionary optimization has been shown to reach similar or even superior performance compared to gradient-based methods on certain difficult RL problems [27]. Therefore, it is natural to consider the potential application of this method in the quantum RL. To the best of our knowledge, applying evolutionary algorithms to quantum RL optimization has not been studied extensively. On the other hand, current quantum devices are realized with a small number of qubits, forbidding potential use cases of environments with large dimensions. Here we present an evolutionary and gradient-free method to optimize the quantum circuit parameters for RL agents. We show that this method can successfully train the quantum deep RL model to achieve the state-of-the-art result on the Cart-Pole problem with only a few parameters, demonstrating a potential quantum advantage. In addition, we demonstrate that the evolutionary method can be used to optimize models combining tensor network (TN) and variational quantum circuits (VQC) in an end-to-end manner, opening up more opportunities for the application of quantum RL with NISQ devices.
In this work, we present an evolutionary deep quantum RL framework to demonstrate the potential quantum advantage. Our contributions are: • Demonstrate the quantum advantage of parameter saving via amplitude encoding. In the Cart-Pole environment, we successfully use the amplitude encoding to encode a 4-dimensional input vector into a two-qubit system.
• Demonstrate the capabilities of the hybrid TN-VQC architecture in quantum RL scenarios. The hybrid architecture can efficiently compress the large dimensional input into a small representation that could be processed with a NISQ device.
The paper is organized as follows. In Section II, we introduce the basics of reinforcement learning. In Section III, we describe the testing environments used in this work. In Section IV, we introduce the basics of variational quantum circuits. Section V describes the quantum circuit architecture for the Cart-Pole problem. Section VI introduces tensor network methods and describes the hybrid TN-VQC architecture for the MiniGrid problem.
Section VII explains the evolutionary method used to optimize quantum circuit parameters.
The performances of the proposed models are shown in Section VIII, followed by further discussions in Section IX. Finally we conclude our work in Section X.

II. REINFORCEMENT LEARNING
Reinforcement learning is a machine learning paradigm where a given goal is to be achieved through an agent interacting with an environment E over a sequence of discrete time steps [1]. At each time step t, the agent observes a state s t and subsequently selects an action a t from a set of possible actions A according to its current policy π. The policy is a mapping from a certain state s t to the probabilities of selecting an action from A. After performing the action a t , the agent receives a scalar reward r t and the state of the next time step s t+1 . For episodic tasks, the process proceeds over a number of time steps until the agent reaches the terminal state. An episode includes all the states the agent experienced throughout the aforementioned process, from a random selected initial state to the terminal state. Along each state s t during the training process, the agent's overall goal is to maximize the expected return, which is quantified by the value function at state s under policy π, V π (s) = E [R t |s t = s], where R t = T t =t γ t −t r t is the return, the total discounted reward from time step t. The discount factor γ ∈ (0, 1] allows the investigator to control the influence of future rewards on the agent's decision making. A large discount rate γ forces the agent to take into account the farther future, whereas a small γ allows the agent to focus more on immediate rewards and ignore future rewards beyond a few time steps. The value function can be expressed as V π (s) = a∈A Q π (s, a)π(a|s), where the action-value function or Q-value function Q π (s, a) = E[R t |s t = s, a] is the expected return of choosing an action a ∈ A in state s according to the policy π. Selecting the best policy among all possible policies yields the maximal action-value, given by the optimal action-value function Q * (s, a) = max π Q π (s, a), which in turn produces the maximal expected return.

A. Cart-Pole
We first study the performance of a simple VQC model with the classic Cart-Pole problem, demonstrating the validity of quantum RL. Cart-Pole is a common testing environment for benchmarking simple RL models, and has been a standard example in the OpenAI Gym [28] (see Figure 1). In this environment, a pole is attached by a fixed joint to a cart moving horizontally along a frictionless track. The pendulum initially stays upright, and the goal is to keep it as close to the initial state as possible by pushing the cart leftwards and rightwards.
The RL agent learns to output the appropriate action according to the observation it receives at each time step.
The Cart-Pole environment mapping is: • Observation: A four dimensional vector s t comprising values of the cart position, cart velocity, pole angle, and pole velocity at the tip.
• Action: There are two actions +1 and −1 in the action space, corresponding to pushing the cart rightwards and leftwards, respectively. How to choose the action with a variational quantum circuit is described in Sec. V B.
• Reward: A reward of +1 is given for every time step where the pole remains close to being upright. An episode terminates if the pole is angled over 15 degrees from vertical, or the cart moves away from the center more than 2.4 units.

B. MiniGrid
We study the model performance of our hybrid TN-VQC architecture (as described in Sec. VI) with a more complex environment MiniGrid [29]. This environment also follows the standard OpenAI Gym API but has a much larger observation input, thus being a desirable choice for studying our hybrid RL model. In this environment, the RL agent receives a 7 × 7 × 3 = 147 dimensional vector from observation and has to determine accordingly the action from the action space A, which contains a total of six possibilities. As shown in Figure 2, the agent (shown in red triangle) is expected to find the shortest path from the starting point to the goal (shown in green).
The MiniGrid environment mapping is: • Observation: A 147 dimensional vector s t .
• Action: There are six actions 0,· · · ,5 in A, each corresponding to the following,  • Reward: A reward of 1 is given when the agent reaches the goal. A penalty is subtracted from the reward according to the formula: The max steps allowed is defined to be 4 × n × n where n is the grid size [29]. In the present study, we consider n = 5, 6, 8. Such a reward scheme is challenging since it is sparse, i.e. the agent will not receive any reward along the steps until it reaches the goal and therefore most of the actions elicit no immediate response from the environment.

IV. VARIATIONAL QUANTUM CIRCUITS
Variational quantum circuits have been established in quantum computing as a type of quantum circuits with parameters tunable via iterative optimizations, which are typically implemented with either gradient-based [30,31] or gradient-free methods [32]. In the present study, we employ the gradient-free approach based on the evolutionary algorithm. The architecture of a generic VQC is illustrated in Figure 3. The U (x) block serves as the state preparation part, which encodes the classical data x into the circuit quantum states and is not subject to optimization, whereas the V (θ) block represents the variational part containing trainable parameters θ , which in this study are optimized through evolutionary methods. The output information is obtained through measuring a subset or all of the qubits and thereby retrieving a classical bit string.

A. Quantum Encoding
For a quantum circuit to process data, the classical input vector has to be encoded into a quantum state first. A general N -qubit quantum state can be written as: where c q 1 ,q 2 ,...,q N are the amplitudes of the quantum states with c q 1 ,q 2 ,...,q N ∈ C and q i ∈ {0, 1}.
The absolute square of each amplitude c q 1 ,q 2 ,...,q N represents the probability of measuring the state |q 1 ⊗ |q 2 ⊗ ... ⊗ |q N , and all the probabilities sum up to 1, i.e.
There are several kinds of encoding schemes commonly used in quantum ML applications [81]. Different encoding methods provide varying extent of quantum advantage. Some of them are not readily implemented on real quantum hardwares due to the large circuit depth. In this work, we employ two different encoding schemes based on the problems of interest. We use amplitude encoding (described in Section V A) for the Cart-Pole problem and variational encoding for (described in Section VI C) the MiniGrid problem.

V. QUANTUM ARCHITECTURE FOR THE CART-POLE PROBLEM
In this problem, the observation input is four-dimensional, which can be readily encoded into a quantum circuit. The quantum circuit for the Cart-Pole experiment is shown in  U (x) FIG. 4. Quantum circuit architecture for the Cart-Pole problem. U (x) is the quantum routine for amplitude encoding and the grouped box is the variational circuit block with tunable parameters α i , β i , γ i . In the evolutionary quantum RL architecture for the Cart-Pole problem, the grouped box repeats 4 times. The total number of parameters is 2 × 3 × 4 + 2 = 26 where the extra 2 parameters are the bias added after the measurement.

A. Amplitude Encoding
As the observation space of the Cart-Pole environment is continuous, it is impossible to use the computational basis encoding (which is for discrete space, as used in the previous work [62]) to encode the input state. In this task, we employ the amplitude encoding method to transform the observation into the amplitudes of a quantum state. Amplitude encoding is a method to encode a vector (α 0 , · · · , α 2 N −1 ) into an N -qubit quantum state |Ψ = α 0 |00 · · · 0 + · · · + α 2 N −1 |11 · · · 1 where the α i are real numbers and the vector (α 0 , · · · , α 2 N −1 ) is normalized. A potential advantage is that for a m-dimensional vector, it requires only log 2 (m) qubits to encode the data. The details of this operation is described in Appendix A. The whole quantum circuit simulation is performed with the package PennyLane.

B. Action Selection
The selection of next action is similar to that used in the quantum deep Q-learning in the work [62]. Specifically, the output from the quantum circuit after the classical postprocessing of this 2-qubit system is a 2-tuple [a, b]. If a is the maximum between the two values, then the corresponding action is −1; on the other hand, if the maximum is b, then the action is +1.

VI. HYBRID TN-VQC ARCHITECTURE FOR THE MINIGRID PROBLEM
One of the key challenges in the NISQ era is that quantum computing machines are typically equipped with a limited number of qubits and can only execute quantum algorithms with a small circuit-depth. To process data with input dimension exceeding the number of available qubits, it is necessary to apply certain kinds of dimensional reduction techniques to first compress the input data. For example, in Ref. [44], the authors applied a classical pre-trained convolution neural network to reduce the input dimension and then use a small VQC model to classify the images. However, the pre-trained model is already sufficiently powerful and it is not clear whether the VQC plays a critical role in the whole process. On the other hand, in Ref. [49], the authors explore the possibilities of using a TN for feature extraction and training the TN-VQC hybrid model in an end-to-end fashion. It has been shown that such a hybrid TN-VQC architecture succeed in the classification tasks. However, to our best knowledge, the potential of such an architecture has not yet been explored in other machine learning tasks. Since in the MiniGrid environment, the observation is a 147 dimensional vector, which is impossible to process on current NISQ devices, we propose a hybrid TN-VQC agent architecture (see Figure 5) so that an efficient dimensional reduction can be achieved.

A. Tensor Network
Tensor networks (TN) is a technique originally developed in the field of quantum manybody physics [83][84][85][86][87][88][89] for efficiently expressing the quantum wave function |Ψ . Matrix product states (MPS), among others, is a type of one dimensional TN that decomposes a large tensor into a series of matrices. A general N -qubit quantum state can be written as where T i 1 i 2 ···i N is the amplitude of each basis state |i 1 ⊗ |i 2 ⊗ · · · ⊗ |i N . As the number of into a product of matrices [90]: where the matrices A are indexed from 1 to N and α j represent the virtual indices, one can largely reduce the space where |Ψ resides. Each virtual index α j has a dimension m called bond dimension and serves as a tunable hyperparameter in the MPS approximation. It is known that for a sufficiently large m, an MPS can represent any tensor [91]. In machine learning applications, the bond dimension m is typically used to tune the number of trainable parameters and thereby the expressive power of MPS. Figure 6 shows the illustration of tensors and MPS. We refer to [92] for in-depth introduction on tensor networks.
Since the pioneering work [93], great efforts have been made to apply TN in the field of machine learning. TN-based methods have been utilized for applications such as classification [93][94][95][96][97][98][99], generative modeling [100][101][102] and sequence modeling [103]. It has also been shown that TN-based architectures have deep connections to the building of quantum  machine learning models [104]. Specifically, it is possible to encode a quantum-inspired TN architecture such as MPS into a quantum circuit with single-and two-qubit gates [105].

B. MPS Operation
In the TN-VQC architecture in this study, we use a MPS-based feature extractor as the TN part to reduce the input dimension. For a MPS to process an input vector v, a feature where each φ is a d-dimensional feature map, mapping each v j into a d-dimensional vector.
The value d is known as the local dimension. In this work, we choose d = 2 and the feature map φ(v j ) to be: The input vector v, a state/observation perceived by the agent, is therefore encoded into a tensor product state in the following way, which is then contracted with the trainable MPS and becomes a vector: where i 1 , i 2 , · · · , i N are in {0, 1} and T i 1 i 2 ...i N is defined as in Equation5 but with an additional rank-3 tensor in the middle with an open leg representing the 8-dimensional output, i.e. the compressed representation, as can be seen schematically in Figure5. In Figure5, the featuremapped input and the trainable MPS are shown in red and blue circles (nodes), respectively.
As the observation input in MiniGrid is a 147-dimensional vector, there are in total 147 input nodes and (147 + 1) MPS nodes.

C. VQC Processing
For the VQC part, we adopt the variational encoding method to encode our compressed representations into quantum states. The initial quantum state |0 ⊗ · · · ⊗ |0 first undergoes the H ⊗ · · · ⊗ H operation to become an unbiased state |+ ⊗ · · · ⊗ |+ . Consider an N -qubit system, the corresponding unbiased state is, This unbiased quantum state subsequently goes through the encoding part, which consists of R y and R z rotations. These rotation operations are parameterized by the compressed representation vector x = (x 1 , x 2 , · · · , x 8 ). For the i th qubit we choose the R y and R z rotation angles to be arctan(x i ) and arctan(x

D. Action Selection
The selection of next action is similar to that used in the quantum deep Q-learning in the work [62]. Specifically, the output from the quantum circuit after the classical post-processing of this 8-qubit system is a 6-tuple [a, b, c, d, e, f ]. If a/b/c/d/e/f is the maximum among the six values, then the corresponding action is 0/1/2/3/4/5.

VII. QUANTUM CIRCUIT EVOLUTION
Here we elucidate our quantum circuit evolution algorithm inspired by the work [27].
The essential concept of such an approach is to first generate a population of agents with random parameters and then make them evolve through a number of generations with certain mutation rate. In each generation, the fittest agents will be selected for producing the next generation. The details of each step are explained below. See AppendixB for the pseudocode of the whole quantum circuit evolution algorithm.

A. Initialization
We first initialize the population P of N agents with each of them given randomly generated initial parameters θ, which are sampled from the normal distribution N (0, I) and multiplied by a factor of 0.01. The multiplication factor 0.01 serves to set the parameters around zero, thereby rendering the training process more stable.

B. Running the Agents
For each generation, all of the agents' fitness are evaluated as follows. Each agent plays the game for R 1 times and the average score, which represents the fitness, is calculated by is the score of the r th trial obtained by the i th agent. The score here is simply the sum of rewards within an episode. Having the fitness of all the agents, we then select the top T agents according to their average scores (fitness) S avg i . The resulting group is called the parents and used to generate the next generation.

C. Mutation and the Next Generation
The N children of the next generation are generated via two separate procedures. The first part is to generate a group of N − 1 children, each of which is a single agent randomly selected from the parent group and slightly mutated. Specifically, the parameter vector θ of this parent agent undergoes the following mutation operation: θ ← θ + σ , where σ is the mutation power and the Gaussian noise sampled from the normal distribution N (0, I).
This is distinct from the commonly used gradient-based methods in that the optimization direction is randomly chosen, a feature that can potentially provide the advantages of circumventing the local optima and efficiently optimizing the parameters in an environment with sparse rewards [27]. The second part is to find the elite, the N th child which is not mutated. To make the selection process more robust against noises, we make each agent from the parent group play the game R 2 times and obtain the average scores E avg j . The top 1 agent, i.e. the one with the highest average score E avg j , is selected as the elite child.

VIII. EXPERIMENTS AND RESULTS
We first demonstrate the quantum advantage of VQC with amplitude encoding in the standard benchmark Cart-Pole environment. Then we show the capabilities of our hybrid TN-VQC architecture in processing larger dimensional input state in the MiniGrid environment. The procedure of evolutionary optimization is the same for both experiments.

A. Cart-Pole
In this experiment, we set the number of generations to be 1700, the population size N = 500, the truncation selection number T = 5, the mutation power σ = 0.02, the number of repetition (for evaluating all the agents) R 1 = 3 and the number of repetition (for evaluating the parents) R 2 = 5. The simulation results of this experiment are shown in Figure 8. It can be seen that after about 250 generations, the average score of the top 5 agents are steadily above 400, and after around 1300 generations, the top 5 agents all converge to the optimal policy and reach a score of 500, which corresponds to the maximum steps allowed in the environment.
A notable achievement is that we use only 26 parameters to reach the optimal result, which is less than those required in typical classical neural networks by at least one order of magnitude. Empowered by amplitude encoding as well as the nature of VQC, we significantly reduce the number of parameters in this specific problem. It is thus highly desirable to explore the feasibility of applying such an encoding method to other quantum RL problems, which could bring about quantum advantage in a way of reducing the model complexity as quantified by the number of parameters to as small as poly(log n), in contrast to the poly(n) parameters typically required in standard neural networks where n is the dimension of the input vector.

B. MiniGrid
Here we consider three configurations as shown in Figure 2. The observation is a 147dimensional vector and there are 6 possible actions, as described in Section III B. In this experiment, we employ the hybrid TN-VQC architecture combining MPS and VQC. The MPS feature extractor receives the 147-dimensional input state from the environment and output a 8-dimensional vector to be encoded into the VQC. Empowered by the tensor network method, we successfully compress the large input vector into a small vector representation favorable for quantum circuit processing. This opens the possibilities of studying other complex RL problems with quantum circuits via applying such a dimension reduction technique. We set the number of generations to be 500, the population size MiniGrid-Empty-8x8-v0 are shown in Figure 9.
In the MiniGrid-Empty-5x5-v0 environment (with a maximum score of 0.955), the simplest one of the three, it is clear that the average score of the top-5 agents is able to reach near-optimal value in less than 40 generations. In the MiniGrid-Empty-6x6-v0 environment (with a maximum score of ∼ 0.956), which is harder than the previous one, we observe that the average score of the top-5 agents rises above 0.9 in 40 generations and reaches nearoptimal value after around 120 generations. In the MiniGrid-Empty-8x8-v0 environment (with a maximum score of ∼ 0.961), the most difficult one among the three, we can see that it takes about 350 generations for the average score of the top-5 agents to rise and stay steadily above 0.9. It is clear that our model performs the worst in the last environment in terms of both the convergence speed and the final scores.

A. Relevant Studies
Early work on quantum reinforcement learning can be traced back to [106], which needs to load the environment into quantum superposition states. This is not generally applicable for classical environments. In [107], the authors consider the situation where computational agents are coupled to environments which are quantum-mechanical. More recent studies introduce and facilitate the use of variational quantum circuits in the applications of reinforcement learning [62,63,[66][67][68]. In contrast to the present study, all of them use gradient-based methods to optimize the policy and/or value function. For the Cart-Pole testing environment which is studied in both works [67,68], we observed that these two works and ours all reach the optimal solution. However, our model only requires 2 qubits, which is significantly smaller than the models in [67,68], which employ a 4-qubit VQC architecture. An interesting research direction is to what extent the gradient-free or gradient-based actually affects the performance of VQC in RL problems. Could certain VQC architectures benefit more from a particular kind of optimization method? We leave this for future investigation.
For a more detailed review of recent developments in quantum reinforcement learning, we refer the interested readers to [64,108].

B. Complex Benchmarks
In this work, we further extend the complexity of the testing environments in comparison to previous works, including the study [62] which considers discrete observation/states. In particular, we largely push the boundary of quantum RL via incorporating the quantuminspired tensor network into a VQC-based architecture. Despite the success, there is still a significant gap between the current quantum RL and the classical RL in terms of the capabilities of processing high-dimensional inputs.

C. Evolutionary Quantum Circuits
The use of evolutionary methods in optimizing quantum circuits can also be found in these recent works [32,109,110]. Both of the works [32,109] use evolutionary methods to optimize VQE problems. Notably, the work [32] introduces an evolutionary approach involving structural mutation to optimize the quantum circuit. On the other hand, the work [110] utilizes graph-encoding method to encode a quantum circuit and then adopt an evolutionary method to optimize the quantum model for certain classification tasks.
However, none of these works considers the direct application of evolutionary optimization to quantum reinforcement learning problems. Our work not only demonstrates the first successful implementation of evolutionary method for quantum RL, but also touches on another aspect rarely studied: the end-to-end training of a hybrid model consisting of MPS and VQC.

D. More from Classical Neuroevolution
In the present study, we employ the evolutionary algorithms specifically for optimizing the quantum circuit parameters. In classical neuroevolution, the whole architectures as well as the neural network parameters can be optimized through evolution. A framework called NeuroEvolution of Augmenting Topologies (NEAT) [111] has been proposed for evolving the classical neural network's topologies along with its weights. It is thus intriguing to investigate the prospect of applying such concepts to evolving quantum machine learning architectures. For evolving a model with a complex architecture and a sizable amount of parameters, it is crucial to encode the model itself in an efficient fashion. Recent advances in neuroevolution [112] could serve as a guidance for designing high-performance evolutionary algorithms in quantum ML.
A major issue yet to be addressed in our work is that our hybrid model can only reach sub-optimal results on harder problems. In particular, it is challenging for the current model to achieve the maximum score when the rewards are sparse. One of the potential solutions to this issue is the novelty search developed in classical neuroevolution [113,114].
The idea behind novelty search is that the agent is not trained to achieve the objective in a conventional way. Rather, novelty search rewards the agents that behave differently [115]. In classical deep RL problems, novelty search has been shown capable of solving hard RL problems with sparse rewards [116]. A potential direction of future research is thus to investigate whether such framework works in the quantum regime.
Evolutionary algorithms also play a critical role in the security of RL models. For example, in the work [117], the authors explore the potential of applying evolutionary algorithms to attacking a deep RL model. Hence, another potential research direction is to study the robustness of quantum RL agents. [118] E. Training on a Real Quantum Computer Given the fact that currently available quantum computers suffer seriously from noises, in this research we only consider the case of noise-free simulation. Although previous results [33][34][35] have indicated that VQC-based algorithms may be resilient to noises through the absorption of these undesirable effects into the tunable parameters, with the limitation of current cloud-based quantum computing resources, it is impractical to implement the whole training process on a real quantum computer to verify customized models such as the TN-VQC one we propose. We expect that such issues could be resolved when commercial quantum devices become more reliable and accessible.

X. CONCLUSION
In this study, we present two quantum reinforcement learning frameworks based on evolutionary algorithm, of which one is purely quantum and the other has a hybrid quantumclassical architecture. In particular, we study two input loading schemes that can reduce the required qubit number of the VQC: amplitude encoding and tensor network compression, and demonstrate through numerical simulation the performance of each. First, through the Cart-Pole problem (with an input dimension of 4), we show that with amplitude encoding, a framework based on a VQC can provide quantum advantage in terms of parameter saving.

(A3)
The rotation angles β s j can be shown to be [82] β s j = 2 arcsin where α i represents each of the amplitudes (α 0 , · · · , α 2 n −1 ). To utilize the aforementioned quantum routine to perform amplitude encoding, we can simply invert each and every operation and apply them in reverse order on the initial quantum state |00 · · · 0 [81]. We provide the example for 2-qubit system used in our Cart-Pole experiment in Figure 11.