Uncovering Instabilities in Variational-Quantum Deep Q-Networks

Deep Reinforcement Learning (RL) has considerably advanced over the past decade. At the same time, state-of-the-art RL algorithms require a large computational budget in terms of training time to converge. Recent work has started to approach this problem through the lens of quantum computing, which promises theoretical speed-ups for several traditionally hard tasks. In this work, we examine a class of hybrid quantum-classical RL algorithms that we collectively refer to as variational quantum deep Q-networks (VQ-DQN). We show that VQ-DQN approaches are subject to instabilities that cause the learned policy to diverge, study the extent to which this aﬄicts reproduciblity of established results based on classical simulation, and perform systematic experiments to identify potential explanations for the observed instabilities. Additionally, and in contrast to most existing work on quantum reinforcement learning, we execute RL algorithms on an actual quantum processing unit (an IBM Quantum Device) and investigate diﬀerences in behaviour between simulated and physical quantum systems that suﬀer from implementation deﬁciencies. Our experiments show that, contrary to opposite claims in the literature, it cannot be conclusively decided if known quantum approaches, even if simulated without physical imperfections, can provide an advantage as compared to classical approaches. Finally, we provide a robust, universal and well-tested implementation of VQ-DQN as a reproducible testbed for future experiments.

At the same time, state-of-the-art deep RL methods require an exorbitant computational budget to match or exceed human performance on seemingly simple tasks, such as playing arcade video games. As an example, Badia et al. [15] report training times of roughly 53 000 hours, distributed over 256 machines, to achieve superhuman performance on all 57 Atari games of the Arcade Learning Environment benchmark [16]. Also, the learning dynamics of these approaches, both in terms of stability and optimality, are not yet fully understood and remain a subject of current research [17,18,19,20].
Concurrent to these developments, quantum computing [21] has started to receive increasing interest in real-life applications. It promises computational speedups, especially selected weaklystructured search problems like integer factoring [22], or unstructured database search [23,24] by exploiting fundamental phenomena of quantum mechanics (see Sec. 2.2). Reinforcement learning can be regarded as a search problem (in terms of seeking an optimal policy, as we outline in Sec. 2.1). Consequently, it is natural to ask whether a quantum speedup is realisable in this domain.
Limitations on achievable speedups have been studied in detail [25], and lower bounds are known for several important fundamental problems [26]. Despite numerous technological challenges rooted in, amongst others, noise and imperfections of nearterm intermediate scale quantum devices [27], sufficient margins for industrially relevant improvements remain [28,29], but necessitate a more precise understanding and a critical evaluation of the performance of quantum approaches on currently available hardware designs. Since RL, like other machine learning approaches, relies on stochastic components that may amplify variations in algorithmic performance (and, more generally, challenge replication efforts), this is another aspect that requires careful consideration.
In this article, we examine and extend a class of recent hybrid quantum-classical approaches to reinforcement learning that we collectively refer to as Variational-Quantum Deep Q-Networks (VQ-DQN). Originally proposed by Chen et al. [30] and later refined by Lockwood and Si [31], VQ-DQN builds upon the deep Q-networks (DQN) algorithm [32,33], which replaces the core neural network component with a quantum machine learning model, namely, a variational quantum circuit (VQC) [34]. Although the results published in [32,33] promise interesting properties, we show that VQ-DQN approaches are subject to instabilities that ultimately cause the learned policy to diverge. Policy divergence is obviously detrimental to the practical utility of the approach, especially if it already happens in perfect simulations of quantum systems. Quantum computers that can be manufactured under the constraints of current technological limitations additionally suffer from noise, imperfections, and very strongly limited amounts of available quantum bits. They are referred to as noisy, intermediate scale quantum computers (NISQ). To understand the additional degradation caused by these imperfections on the performance of RL approaches, we perform comparative experiments on actual quantum hardware-a gate-based IBMQ device (Falcon r4) operated in Ehningen, Germany.
In general, our investigation is part analysis and part reproduction study, and we provide a repro-duction package with a well-tested implementation 2 of VQ-DQN written in TensorFlow [35] and Qiskit [36] as an open testbed for future experiments.
The paper is structured as follows: Section 2 provides a concise introduction to DQN (2.1), VQCs (2.2), and the VQ-DQN algorithm (2.3). Section 3 reviews related work. Section 4 describes our methodological approach towards finding and characterising instabilities. Section 5 summarizes our experiments. Section 6 explains the validation experiment on real quantum device. Further, we proceed to compare the DQN with variational quantum circuit against a DQN with classical neural network in Section 7. Finally, we conclude in Section 8.

Background
To introduce the concepts used in this study, the following paragraph discusses notation and basic principles of both, machine learning and quantum computation.

Deep Q-Learning
Most formulations of RL center around the notion of a Markov decision process (MDP) [37], where an agent interacts with an environment at discrete time steps t. In each time step, the current configuration of the environment is summarised by the state S t ∈ S. Based on this information, the agent selects an action A t ∈ A according to a policy π(s, a) = P[A t = a|S t = s]. Executing the selected action causes a transition of the environment to a next state S t+1 ; simultaneously, the agent receives a scalar reward R t+1 that quantifies the contribution of the selected action towards solving the task. The agent's goal is to maximize the return, i.e., the discounted sum of rewards, G t = T t =t γ t R t until a terminal state S T is reached. In that, the discount factor γ controls how much the agent favors immediate over future rewards. Both S t+1 and R t+1 are assumed to obey the Markov property (i.e., conditional independence of previous states and actions given S t , A t ). However, the MDP's dynamics, P[S t+1 , R t+1 |S t , A t ], are typically unknown to the agent, which necessitates learning a policy by trial-and-error.
The fundamental idea of Deep Q-Learning (also referred to as deep Q networks, DQN) [32,33] is to learn the optimal state-action value function Q * (s, a) = max π E [G t |S t = s, A t = a, π] -that is, the return expected when taking action a in state s, and then following an optimal policy in all future states. Once Q * (s, a) is known, an optimal policy can be easily recovered by selecting actions greedily, that is π * (s) = arg max a Q * (s, a). This is achieved by training a neural network to satisfy the wellknown Bellman Optimality Equation (BOE) that relates the values of a state-action pair to the value of the next state: More concretely, the deep Q-network is trained to minimize the difference between the left-and right-hand side of this equation (also known as the temporal difference error or TD-error ), under some loss function (e.g., L 2 ), evaluated on mini-batches of transitions (S t , A t , R t+1 , S t+1 ) sampled by the agent. These transitions are sampled using an offpolicy approach -instead of applying the current greedy policy (also called target policy), an -greedy behavior policy that selects a random action with probability is chosen. Decaying over the course of training allows the agent to explore the environment, while guaranteeing that the behavior policy and target policy (and hence, the underlying data distributions) converge eventually.
As Mnih et al. [38] point out, learning Q * with a high-capacity function approximator leads to convergence problems. To this end, DQN makes use of (1) a target network, which is a copy of the deep Qnetwork with temporarily fixed weights to evaluate the right-hand side of 1, and (2) an experience replay buffer [39] from which experienced transitions are re-sampled for mini-batch gradient descent. For a detailed discussion of these specifics, we refer the interested reader to [32,33].

Variational Quantum Circuits
Quantum computation uses the qubit as the fundamental unit of information. In contrast to classical bits, a set of n qubits can not only assume the 2 n classical basis states (i.e., 0, 1, . . . , 2 n − 1), but also any superposition of these basis states. Note that superimposable quantum states reside in an infinite state space than their classical counterparts, which is often seen as an indication of increased computational capabilities, although the exact reason for possible quantum speeds remains elusive [40].
The variational quantum circuit is a machine learning model based on quantum circuits [34]. Similar to neural networks, VQCs consist of sequential layers that represent parameterised transformations on the VQC's quantum state. In particular, VQC layers apply e.g. learnt single-qubit rotations (in X-, Y -, and Z direction using the corresponding Pauli operators [21]) to each qubit of the circuit. Entanglement can be generated by applying a series of CNOT-gates [21] to pairs of qubits. The specific single-qubit rotation parameters are learned via gradient-descent on an error signal, computed over the expected measurements in Z direction of one or more output qubits.

Q-value extraction
For a given input MDP state, Q-values are predicted for all |A| actions simultaneously by taking the expectation value of a measurement (in Z direction) of a corresponding number of output qubits. The resulting measurements lie within [−1; 1]; obtaining valid action values thus requires further processing, for instance by scaling the measured results by a learnt multiplicative factor.

Input encoding
To input a (classical) MDP state s ∈ S to the VQC, that state needs to be represented as a quantum state |Ψ(s) using the available qubits. Chen et al. [30] address this problem by only considering MDPs with discrete state spaces and associating each MDP state with one of the 2 N quantum basis states. Lockwood and Si [31] and Skolik et al. [41] extend this method to MDP states with continuous components with a simple encoding scheme, with which the authors report results on the "Blackjack" and "CartPole-v0" environments (see Ref. [42] for implementation details). In particular, each component of the input state s is encoded by applying parameterised Pauli rotation gates [21] to one respective qubit in the circuit (initialised to |0 ). Lockwood and Si [31] propose two encoding schemes: Scaled (S) encoding, which determines a rotation angle by scaling finite-domain input components to [0, 2π], and Directional (D) encoding, which encodes infinite-domain inputs by rotating the qubit by π if the input is greater than 0. Skolik et a. [41] additionally present Continuous (C) encoding, which computes rotation angles as the arctan of the respective input component.

Deep Q-Learning and its instabilities
The DQN approach dates back to Watkin's Q-Learning [43] and has seen a lot of interest over the years due to its immense potential in learning capabilities. Deep Q-Learning is itself an active field of research because of its versatility in end applications. Nevertheless, as versatile as the end applications are, the algorithm possesses space for improvements in its stability and speed of convergence to a solution [17,44,45,46,47,48,49,15]. In particular, the Q-learning approaches, i.e., offpolicy learning with function approximation and bootstrapping, are known to diverge in certain scenarios. This divergence occurs more often when the Q-value is approximated using a non-linear function approximator such as a deep neural network. However, the root causes are still unknown [50,51,52,18].

Quantum Reinforcement Learning
Over the past few years, there have been several attempts to improve the performance of reinforcement learning algorithms via possible 'quantum advantage' using quantum computing. Like in the classical realm, no one method has emerged as the superior approach in performance or generality. The first quantum reinforcement learning (QRL) algorithm (to our knowledge) has been proposed by Dong et al. [53], which uses a modified version of Grover's algorithm [23] to learn a state-value function. As in the classical reinforcement learning family, whose members vary in algorithm and methodology, various algorithms for QRL have been studied [54,55,56,57]. The VQ-DQN algorithm was originally proposed by Chen et al. [30] where the authors have used variational quantum circuits to solve two different discrete environments, namely, 'cognitive radio' and 'frozen lake'. Both these environments are discrete environments where the state space is finite. The next study on VQ-DQN algorithm was conducted by Lockwood and Si [31], where the authors used a VQC to solve both continuous and discrete environments. Another study that analyses the learning performance and behavior of VQ-DQN was conducted by Skolik et al. [41]. Here the authors explore the effects of having a VQC as a Qvalue approximator along with techniques like data re-uploading and a hybrid quantum-classical model.

Reproduction study
To gauge the learning capability of VQ-DQN, we first reproduce the results published by Lockwood and Si [31] on the CartPole-v1 task (cf. Sec. 2.3). We train five VQ-DQN agents and evaluate their performance during training using the source code 3 published by the authors. The results are visualised in Fig. 1. The blue line indicates episode returns. The red line represents a moving average of the (up to) 20 previous returns. 4 While our measurements reproduce the computational outcome of the published results, we identify two notable methodological aspects that require careful consideration and interpretation: Training frequency-A step of mini-batch gradient descent is carried out only once per episode (namely, after its termination). This differs substantially not only from the original DQN algorithm, but also from the pseudo-code provided by Lockwood and Si [31], were training is executed in regular intervals after a set number of trajectories has been sampled by the agent. We are not aware of other approaches in the literature that pursue or analyse this approach, and conjecture that it might have a detrimental effect on learning, since the distribution of transitions in the replay buffer grows faster than the amount of data that the agent perceives. The adaptation also complicates the comparison between independent runs of the algorithm, depending on the length of the experienced episodes.
Performance evaluation-Measuring agent performance in terms of a moving average over previous runs is not a good indicator for learning success: Averaged returns have been generated by different policies, that is, trained on increasing numbers of transitions at different stages of -decay. Further, the averaging approach shadows any underlying instabilities as indicated by the raw episode returns: In all five runs, the blue line oscillates strongly between low and high return values, indicating that the underlying policy network/circuit fails to converge towards an optimal policy. Note that in complex environments, DQN convergence can be non-monotonic in terms of measured returns (see, e.g., Ref. [32]). Observing oscillations of this magnitude on CartPole (which can be learnt in an approximately monotonic fashion by a simple neural network with DQN, refer to 7) does not give a promising outlook on VQ-DQN's capability to generalise to more challenging tasks.
Besides, we would like to explicitly point out that the experiment is based on CartPole-v1, where return values of up to 500 can be achieved. In contrast, returns in CartPole-v0 cannot exceed 200, which is important to take into account when judging closeness to optimality of particular approaches, especially when the visual display of episode return time series uses clipped axes.  One other study which overcame these instabilities using a VQ-DQN algorithm to solve the Cartpole environment is conducted by Skolik et al. [41].
Here the authors have used slightly different gate connectivity in their VQC compared to Lockwood and Si [31]. Apart from the change in VQC architecture, the authors also perform a gradient descent optimization step after every 30 sampling steps. They also present their total reward attained in each episode averaged over ten different agents rather than presenting a moving average. Skolik et al. [41] have studied and tested various combinations of pure and quantum-classical hybrid VQC architectures in their work. However, the pure VQC model did exhibit the same instabilities exhibited by Lockwood and Si's model. Skolik et al. [41] used a hybrid VQC model where the inputs to and outputs from the VQC were multiplied with classical weights' along with the data re-uploading strategy [58] to overcome these instabilities. Data re-uploading is a strategy where the encoding circuit is reintroduced at multiple instances in a VQC. Reintroducing the encoding circuit increases the expressivity of the model [59]. Even though the hybrid model exhibited a relatively stable learning behavior, the impact of classical weights on the overall training process is not distinguished nor studied. The results of our reproduction attempt of the work by Skolik et al. are shown in Fig. 2. These reproduction experiments were conducted based on the parameters given in the Appendix section of Ref. [41]. The measurement results shown in Fig. 2 confirm the published results.

Experiments
Previous implementations of the VQ-DQN approach are either not published [41], or show various methodological issues [31] that we have discussed in detail in the previous section. For (1) having a stable and uniform VQ-DQN framework that coincides with the classical RL practices, (2) to provide a replication of existing results on top of mere reproduction, and (3) to be able to conveniently integrate extensions into the approach, we re-implement the original deep Q-learning algorithm as described in [32,33] in TensorFlow [35] (Sec. 5.1 discusses implementation details).
Using this implementation, we run a set of experiments to systematically evaluate the observed instabilities. Throughout all our experiments, we used the CartPole-v0 environment to ensure comparability with [41] and [31], and also to keep computational cost at bay. Sec. 5.2 investigates the effects of the chosen input encoding and Q-value Here we reproduced the hybrid quantum-classical model with data re-uploading and the pure quantum model with data re-uploading as proposed by Skolik et al. [41] extraction method on performance and stability. Using these insights, we run an extensive crossvalidation study described in Sec. 5.3. Additionally we have investigated properties of the VQC parameter space as a potential cause for instabilities; as the experiments conducted based on this speculation did not lead to a justifiable root cause, we focus only on the experiments on the input-encoding, Q-value extraction methods, and cross-validation mentioned above in this paper. However, we have included a brief discussion in Appendix A for reference.

Methodology
To describe our methodology, let us first set the employed conventions: By sampling steps, we refer to the transitions sampled from the -greedy behavior policy. By training step, we understand one iteration of gradient descent. Words in monospaced font indicate configurable parameters of the algorithms.
To ensure comparability between our different experimental setups, and especially between previous research and our dedicated experiments, we choose sampling steps as fundamental unit of training time. Each experiment is run for 50 000 sampling steps. We deliberately use a long time horizon to capture any phenomena that may materialise late in the learning process caused by slow convergence, but retain the possibility to terminate successful runs prematurely, as described in detail below. Initially, the replay memory is pre-filled with train_after=1000 sampling steps, corresponding to at least five full episodes, using a uniform random policy with = 1.
A sampling step does not necessarily entail a training step; instead, a training step is carried out every train_every sampling steps. As backpropagation [60] on quantum devices is computationally intensive due to gradients being estimated via the parameter-shift rule [61,62], we introduced this parameter as a means to keep the number of training steps per episode feasible. We note, however, that in this paper, we only report validation results on quantum hardware, while the agent has been trained in simulation. Similarly, we update the target network parameters to equal the policy network parameters every update_every sampling steps. After the initial warm-up phase, we decay linearly over epsilon_duration sampling steps in total, starting at a value of epsilon_start=1, and ending at a value of epsilon_end=0.01. Keeping > 0 ensures continued exploration with a neargreedy policy.
Since performance on the -greedy policy is not indicative of learnt performance when is large [63], we estimate the expected return achieved by the current greedy policy in regular intervals. Specifically, we measure return over a single episode on a copy of the training environment every validate_every=100 sampling steps (note that the parameter does not influence the actual training process, and is just used for performance monitoring). If the average validation return over the past consecutive 25 validation steps reaches 196 (recall that the maximum return is 200, and that we need to allow for some jitter), we regard the task as solved and terminate training early. While this differs from the official CartPole-v0 benchmark (see https://gym.openai.com/envs/ CartPole-v0/) that necessitates a return of at least 195 sustained over 100 episodes, we find that training is very unlikely to diverge past this point, given that has decayed sufficiently. 5

Encoding and Extraction Methods
After experimentally verifying the correctness of our implementation, we replace the Q-network by a VQC using the circuit architectures proposed in Refs. [31,41]. The need for mapping input parameters onto quantum states has already been discussed in Sec. scaled encoding applied to finite-domain input components, directional encoding otherwise. Along with the encoding strategies, we also investigate the impact of different Q-value extraction methods on agent performance. This is necessary due to the mismatch between VQC outputs and Q-values. In particular, we distinguish between: (1) Local Scaling: each output is scaled by a dedicated trainable weight as described in Ref. [41]. (2) Global Scaling (GS): all outputs are scaled by a single trainable weight. (3) Global Scaling with Quantum Pooling (GSP): quantum pooling as described in Ref. [31], followed by global scaling.

Initial Experiment
We conducted experiments for each combination of input encoding, Q-value extraction method and circuit architecture, totalling in 18 runs. To this end, we adapted hyperparameters from Ref. [41] to our slightly modified algorithm described in Section 5.1 (without data re-uploading). VQC weights are initialised to zero and classical weights to 1 to avoid the vanishing gradient problem as suggested in Ref. [64]. Results are shown in Fig. 3. As is apparent, instabilities occur in every run and are not tied to a specific encoding-/extraction setting. Nevertheless, some models only achieve comparatively low returns on average: In particular, runs involving directional encoding tend to perform subpar, which we attribute to the high information-loss incurred by the encoding scheme. Directional encoding is therefore not considered in further experiments.
website that have enjoyed traversing the maximum number of episodes, and none of the results shows difference in convergence behaviour depending on the convergence criterion used. However, for experiments on the experimental IBM Quantum device, a reduced number of episodes is crucial to ensure practical feasibility of the calculations.
To minimize the number of classical parameters, we focus on global scaling (with and without pooling) in further experiments. While local scaling has not performed worse or less stable, the additional classical parameters increase model capacity, and might therefore shadow deficiencies on the quantum parts.

Cross-Validation
As instabilities persist throughout our experiments, we turn to hyperparameters as a source of instabilities. To this end, we re-utilize the above setting (C, SC/GS, GSP) with hyperparameters from Ref. [41] as a starting point. Following recommendations [65,66,67,68] from classical supervised learning, we add a linear decay to the learning rate η. In particular, we decrease η over a period of eta_duration training steps from eta_start towards a target value of eta_end=0.01*eta_start. Additionally, we progressively increase the update_every parameters as learning progresses. This choice is motivated by the observation that the delta between target and policy network decreases as the agent becomes more proficient on the task. Finally, to optimize resource utilization and minimize training time, we increase the batch size from 16 to 32, since this does not have a major impact on the agent's performance [69].

Baseline
Results for the baseline case are depicted in Fig. 4 and Tab. 2. We only present a selection of the best-performing hyperparameter constellations due to space constraints, but provide the full set of results on the accompanying website. As evident from the figure, almost every model was able to achieve stable optimal performance (according to our earlystopping criterion). Generally, the SC encoding tends to convergence faster as compared to models

5.3.2.
Baseline with data re-uploading From Fig. 4, it is evident that the performance of the VQ-DQN algorithm also suffers due to the choice of encoding strategy used along with the bad choice of hyperparameters. For example, the agent with the continuous encoding format does not learn an optimal policy in many cases. Here to increase the expressivity of the model, we can use techniques such as data re-uploading [41,58]. The results for the baseline case with data re-uploading are depicted in Fig. 5 and Tab. 2. As in Sec. 5.3.1, We only present a selection of the best-performing hyperparameter constellations due to space constraints. From the results shown in Fig. 5, we can conclude that the data re-uploading strategy does not significantly increase the VQ-DQN algorithm's performance. Though it increases the expressive power of the model, which in turn allows the agent to learn optimal behavior in some cases (for example, agent with Continuous (C) encoding), the performance change is negligible or even negative in most cases. Moreover, the data re-uploading strategy increases the gate count in the VQC architecture, and this increase in gate count is not ideal for the NISQ devices due to noise.

Validation on IBM Quantum Device
Results from Sec. 5.3.1 and Sec. 5.3.2 illustrate that a VQC can learn a stable policy to solve the CartPole-v0 environment using the DQN algorithm if the right set of hyperparameters are used. In order to gauge the detrimental influence of device noise on an agent trained using an ideal simulator in solving the environment, we tested the trained model in an actual IBM quantum device [71]. As a first step, we had to port the VQ-DQN algorithm from the tensorflow+tensorflow-quantum API [35,72] to the pytorch+qiskit API [36,73] as the IBM quantum devices use the qiskit API [36] as their primary programming library. There is one significant difference between the qiskit API [36] and the tensorflow-quantum API [72] to be noted here. The tensorflow-quantum [72] API calculates the expectation value analytically, whereas the qiskit API [36] estimates the expectation value by simulating the ideal quantum device and measuring its outcomes. Likewise, the expectation values are estimated in the IBM quantum device [71] by measuring the outcome multiple times. Further, we trained the best-performing model without data re-uploading from Sec. 5.3.1 using qiskit qasm_simulator [36] and verified the correctness of our implementation in comparison to the results from Sec. 5.3.1. We chose a model without data re-uploading due to the fact that the quantum devices available right now are prone to noise. Hence adding more gates via data re-uploading in NISQ devices seems counter-productive. Once the correctness was verified, we uploaded the weights trained using the qasm_simulator to the IBM Quantum (ibmq_ehningen) device and validated the learned policy. The results of these validation runs are shown in Fig. 6.
Though the agents trained in the ideal simulator learned an optimal policy to solve the Cartpole-v0 environment, testing the trained agent in the ibmq_ehningen device did not reproduce the optimal behavior. This degradation in behavior is due to the noise present in the IBM Quantum device. An agent trained in the IBM Quantum device from scratch might reduce the effect of noise and learn a policy close to the optimal policy. Additionally, different types of error mitigation techniques can be employed to reduce the effects of noise at the cost of additional overhead. However, when we attempted to train the agent from scratch on the IBM quantum device, the training turned out to be infeasible due to the following practical issues: (1) We observed waiting times in the queue to start a job execution (referred to as fair-share queue for jobs in IBM Quantum systems) in the cloudbased IBM Quantum device that were typically two orders of magnitude (or more) larger than the actual job execution time. As (roughly speaking) a single action selection corresponds to a single job in the fair share queue, even completion of a single episode takes a substantial amount of time. (2) The overall time it takes to achieve low-variance estimators of expectation values can become quite large due to the large number of shots (i.e., measurement samples) taken for a single circuit instance.
Here, the first hindrance can be overcome in time as the availability of quantum devices and resources is expected to increase in the near future. As improvements in hardware and orchestration of quantum and classical computational Lockwood  Step Validation return Step Validation return  resources progress, we might also be witness to an increased number of circuit layer operations per second (CLOPS) [74]. When we started the training process in the ibmq_ehningen device, the job execution time for each action selection took between 15 to 30 seconds, and each training step took around 3 minutes (as the training step performs gradient decent via parameter-shift rule). These long execution and waiting times make the training process in real quantum devices impractical for training algorithms like VQ-DQN, where the agent has to interact with the environment sequentially.

Comparison to classical Neural Network
A popular "quantum advantage" claimed by a good fraction of the literature in QRL is that the VQC has better state-action pair representation, samples efficiently, and learns an optimal policy faster than the classical neural network [30,31,41]. Hence to compare the sample efficiency of a VQ-DQN-agent trained on an ideal simulator against a classical neural network, we trained a simple fully-connected network with one hidden layer to solve the Cartpole-v0 environment. To ensure a fair comparison, we restricted the total number of parameters of the network to 58, and did crossvalidation on the same set of hyperparameters as explained in Sec. 5.3. Step Validation return The results shown in Fig. 7 indicate that initially, the VQC seems to learn faster than the neural network. For a more rigorous discussion we resort to Ref. [75], where sample efficiency of an algorithm is defined for an online learning setting as the number of time steps from which on an agent trained by the algorithm perceives an average reward exceeding a certain threshold V thresh with high probability.
For a weaker statement adapted to a numerical treatment, we propose to use significance testing under the null hypothesis of mean reward being smaller than V thresh . Thus, we define sample efficiency as the number of time steps from which on the null hypothesis is rejected with respect to the given threshold. As statistical test we propose to use a one-sample t-test [76,77], in particular its one-sided version as we compare the performance of a particular algorithm against a given threshold. Thus, we perform sufficiently many independent runs of each algorithm and fix the significance level at α = 0.05.
With respect to this metric the variational quantum circuit indeed crosses V thresh = 120 faster than the classical network; however, for larger threshold values, no definite statement can be made.

Conclusion
We have systematically studied the performance of quantum-assisted reinforcement learning schemes on both simulators and physical quantum computers. We find-not quite unexpected-that at the current, early state of technological development, quantum computers do not bring any measurable advantage in this scenario. We even find that simulated quantum systems do not bring clear advantages over classical approaches.
Nonetheless, a number of constructive insights can be drawn from our experiments. Following previous work, we have trained models on classical simulators and only performed the execution step on quantum hardware. This approach, albeit practically necessitated by current-day hardware, creates a mis-match in terms of handling noise: For future work, we recommend including noise in the training process, especially since Ref. [78] suggests for smallscale systems that existing noise models lead to a good match between simulation and hardware, and therefore provide a more faithful basis for comparing between algorithmic performance on simulated and physical hardware.
Most importantly, our results do not corroborate observations made when reinforcement learning on quantum computers was first introduced into the literature in Ref. [30]: While the authors in this approach upload weights determined by classical training onto a quantum machine as we do in this paper, they find that executing the model does not vary much between simulation and NISQ machine. We, on the contrary, observe a total mismatch in performance. We expect the most probable explanation for this discrepancy to lie in (a) the size of the machine (five versus 27 qbits) and the problem of choice (cognitive-radio versus cartpole; a random policy as would be caused by growing amounts of noise from NISQ devices is obviously better suited to the former than the latter).
We encounter additional hindrances towards the practical application of quantum computers: Waiting time on queues in a shared, cloud-like environment is a major practical issue, which will however be alleviated with the broader availability of quantum chips. Nonetheless, the temporal contributions of sequential elements of algorithms to the overall computation time would also occur in a nonshared setting and do substantially increase walltime run-times, which is an obvious impediment to practical utility.
As long as noise and imperfections are unavoidable, we find that adapting algorithms and approaches to account for these issues is a major design challenge for quantum algorithms. One possible approach would be to equip simulated QPU designs with appropriate, yet tunable and physically realistic noise behaviour. By seeking optimal models and parameters under these unavoidable constraints, an "ideal" noise model can be identified, and future QPUs be built such that design tradeoff decisions are taken so that the resulting hardware closely mimics the identified noise and imperfection behaviour. In other words, we hypothesise that in the space of hardware design decisions, and assuming that hardware imperfections impact different computations in a different way, this opens a degree of freedom that can be leveraged to design custom algorithmic-specific hardware.
Funding: This work was supported by the German Federal Ministry of Education and Research (BMBF), funding program "quantum technologies -from basic research to market", grant number 13N15645.