Realizing a deep reinforcement learning agent for real-time quantum feedback

Reuer, Kevin; Landgraf, Jonas; Fösel, Thomas; O’Sullivan, James; Beltrán, Liberto; Akin, Abdulkadir; Norris, Graham J.; Remm, Ants; Kerschbaum, Michael; Besse, Jean-Claude; Marquardt, Florian; Wallraff, Andreas; Eichler, Christopher

doi:10.1038/s41467-023-42901-3

Download PDF

Article
Open access
Published: 06 November 2023

Realizing a deep reinforcement learning agent for real-time quantum feedback

Nature Communications volume 14, Article number: 7138 (2023) Cite this article

4136 Accesses
5 Citations
20 Altmetric
Metrics details

Subjects

Abstract

Realizing the full potential of quantum technologies requires precise real-time control on time scales much shorter than the coherence time. Model-free reinforcement learning promises to discover efficient feedback strategies from scratch without relying on a description of the quantum system. However, developing and training a reinforcement learning agent able to operate in real-time using feedback has been an open challenge. Here, we have implemented such an agent for a single qubit as a sub-microsecond-latency neural network on a field-programmable gate array (FPGA). We demonstrate its use to efficiently initialize a superconducting qubit and train the agent based solely on measurements. Our work is a first step towards adoption of reinforcement learning for the control of quantum devices and more generally any physical device requiring low-latency feedback.

Deep reinforcement learning for efficient measurement of quantum devices

Article Open access 18 June 2021

Experimental quantum speed-up in reinforcement learning agents

Article 10 March 2021

Coherent transport of quantum states by deep reinforcement learning

Article Open access 20 June 2019

Introduction

Executing algorithms on future quantum information processing devices will rely on the ability to continuously monitor the device’s state via quantum measurements and to act back on it, on timescales much shorter than the coherence time, conditioned on prior observations^1,2. Such real-time feedback control of quantum systems, which offers applications e.g. in qubit initialization^3,4,5,6, gate teleportation^7,8 and quantum error correction^9,10,11, typically relies on an accurate model of the underlying system dynamics. With the increasing number of constituent elements in quantum processors such accurate models are in many cases not available. In other cases, obtaining an accurate model will require significant theoretical and experimental effort. Model-free reinforcement learning¹² promises to overcome such limitations by learning feedback-control strategies without prior knowledge of the quantum system.

Reinforcement learning has had success in tasks ranging from board games¹³ to robotics¹⁴. Reinforcement learning has only very recently been started to be applied to complex physical systems, with training performed either on simulations^{15,16,17,18,19,20,21} or directly in experiments^{22,23,24,25,26,27,28,29}, for example in laser^22,25,29, particle^23,24, soft-matter²⁶ and quantum physics^27,28. Specifically in the quantum domain, during the past few years, a number of theoretical works have pointed out the great promises of reinforcement learning for tasks covering state preparation^{30,31,32,33,34}, gate design³⁵, error correction^36,37,38, and circuit optimization/compilation^39,40, making it an important part of the machine learning toolbox for quantum technologies^41,42,43. In first applications to quantum systems, reinforcement learning was experimentally deployed, but training was mostly performed based on simulations, specifically to optimize pulse sequences for the quantum control of atoms and spins^17,18,21. Beyond that, there are two pioneering works demonstrating the training directly on experiments^27,28 which was used to optimize pulses for quantum gates²⁷ and to accelerate the tune-up of quantum dot devices²⁸. However, none of these experiments^{17,18,21,27,28} featured real-time quantum feedback. Real-time quantum feedback is crucial for applications like fault-tolerant quantum computing⁴⁴. Realizing it using deep reinforcement learning in an experiment has remained an important open challenge. Very recently, a step into this direction was made in ref. ⁴⁵, which demonstrates the use of reinforcement learning for quantum error correction. In contrast to what we present in this paper, these experiments⁴⁵ relied on searching for the optimal parameters of a controller with fixed structure.

Here, we realize a reinforcement learning agent that interacts with a quantum system on a sub-microsecond timescale. This rapid response time enables the use of the agent for real-time quantum feedback control. We implement the agent using a low-latency neural network architecture, which processes data concurrently with its acquisition, on a field-programmable gate array (FPGA). As a proof of concept, we train the agent using model-free reinforcement learning to initialize a superconducting qubit into its ground state without relying on a prior model of the quantum system. The training is performed directly on the experiment, i.e., by acquiring experimental data with updated neural network parameters in every training step. In repeated cycles, the trained agent acquires measurement data, processes it and applies pre-calibrated pulses to the qubit conditioned on the measurement outcome until the agent terminates the initialization process. We study the performance of the agent during training and demonstrate convergence in less than three minutes wall clock time, after training on less than 30,000 episodes. Furthermore, we explore the strategies of the agent in more complex scenarios, i.e. when performing weak measurements or when resetting a qutrit.

Results

Reinforcement learning for a qubit

In model-free reinforcement learning, an agent interacts with the world around it, the so-called reinforcement learning environment (Fig. 1). In repeated cycles, the agent receives observations s from the environment and selects actions a according to its policy π and the respective observation s. In the important class of policy-gradient methods¹², this policy is realized as a conditional probability distribution π_θ(a∣s), which can be modeled as a neural network with parameters θ. To each sequence of observation-action pairs, called an episode, one assigns a cumulative reward R. The goal of reinforcement learning is to maximize the reward \(\bar{R}\) averaged over multiple episodes, by updating the parameters θ e.g. via gradient ascent \({{\Delta }}{{{\mathbf{\uptheta}}}} \sim {\nabla }_{{{{{{{{\mathbf{\uptheta }}}}}}}}}\bar{R}\)¹². Such a policy-gradient procedure is able to discover an optimal policy even without access to an explicit model of the dynamics of the reinforcement learning environment.

In the present work, we use reinforcement learning to learn strategies for real-time control of quantum systems. Here, observations are obtained via quantum measurements, actions are realized as unitary gate operations, and the reward is measured in terms of the speed and fidelity of initializing the quantum system into a target state, see schematic in Fig. 1. In our experiment, the quantum system is realized as a transmon qubit with ground \(\left|g\right\rangle\), excited \(\left|e\right\rangle\), and second excited state \(|f\rangle\) dispersively coupled to a superconducting resonator⁴⁶ (see Supplementary Note 1 for details). We probe the qubit with a microwave field, which scatters off the resonator and is amplified and digitized to result in an observation vector s = (I, Q), where I and Q are time traces of the two quadrature components of the digitized signal^47,48,49 (see Supplementary Note 2 for details and Supplementary Note 3 for averaged time traces). Depending on s the agent selects, according to its policy π, one of several discrete actions in real time. In the simplest case, it either idles until the next measurement cycle, it performs a bit-flip as a unitary swap between \(\left|g\right\rangle\) and \(\left|e\right\rangle\) or it terminates the initialization process.

To train the agent, we transfer batches of episodes to a personal computer (PC) serving as a reinforcement learning trainer. The reinforcement learning trainer computes the associated reward for each episode and updates the agent’s policy accordingly (see Supplementary Note 4 for details), before returning the updated network parameters θ to the FPGA.

Implementation of the real-time agent

We implement this scheme in an experimental setup, in which the agent, for each episode, can perform multiple measurement cycles j, in each of which it receives a qubit-state-dependent observation s ^j and selects an action a^j, until it terminates the episode, see Fig. 2a. If the agent selects the bit-flip action, a π-pulse is applied to the qubit after a total latency of τ_EL,tot = 451 ns, dominated by analog-to-digital and digital-to-analog converter delays. The agent’s neural network contributes only τ_NN = 48 ns to the total latency as it is evaluated mostly during qubit readout and signal propagation (see Supplementary Note 2 for detailed discussion of the latency). To provide the agent with a memory about past cycles we feed downsampled observations (s ^j−1, . . . , s ^j−l) and actions (a ^j−1, . . . , a ^j−l) from up to l = 2 previous cycles into the neural network. To characterize the performance of the agent, we perform a verification measurement s^ver after termination.

**Fig. 2: Schematic of neural-network-based real-time feedback control.**

Any neural-network agent used for real-time system control greatly benefits from short latencies in the signal processing. For our FPGA implementation we therefore introduce a network architecture, in which new measurement data is processed as soon as it becomes available, thereby keeping latencies at a minimum. More specifically, we sequentially feed elements \({I}_{k}^{\, j},{Q}_{k}^{\, \, j}\) of the digitized time trace s ^j = (I^j, Q^j) into each layer of the neural network concurrent with its evaluation, see Fig. 2b and Supplementary Note 5 for details. We have also explored the use of the same type of neural network for quantum state discrimination, in a supervised-learning setting^50,51,52 (see Supplementary Note 3).

Training with experimental data

We train the agent based on experimentally acquired episodes to maximize the cumulative reward R = V_ver/ΔV − nλ (see Supplementary Note 4 for details). Here, V_ver is the integrated observation in the final verification measurement V_ver = w_ss^ver with weights w_s chosen to maintain the maximal signal-to-noise ratio under Gaussian noise^49,50,53,54. Therefore, V_ver/ΔV is a good indicator for the ground-state population, with a normalization factor \({{\Delta }}V={{{{{{{{\bf{w}}}}}}}}}_{{{{{{{{\bf{s}}}}}}}}}(\langle {{{{{{{{\bf{s}}}}}}}}}_{g}\rangle -\langle {{{{{{{{\bf{s}}}}}}}}}_{e}\rangle )\) setting the scale. The second term penalizes each cycle by a constant λ. For larger λ, trajectories requiring more cycles till termination will achieve a lower reward. Consequently, the strategy minimizing the averaged reward 〈R〉 for larger λ results in shorter trajectories, i.e. a lower average number of cycles 〈n〉, while the initialization error 1 − P_g is larger. Thus, λ controls the trade-off between short episode length and high initialization fidelity. We note that for training and applying the agent, we do not require the explicit functional forms of 〈n〉(λ) and (1 − P_g)(λ), which in general depend on the properties of the quantum system.

We first train the agent to initialize the qubit using fast, high-fidelity readout. In this regime, an initialization strategy based on weighted integration and thresholding is close-to-optimal, and we can thus easily verify and benchmark the strategies discovered by the agent. To study the agent’s learning process, we monitor the average number of cycles 〈n〉 until termination and the initialization error 1 − P_g, inferred from a fit to the measured distribution of V_ver (see Supplementary Note 2 for details), see Fig. 3a, b. The agent learns how to initialize the qubit for both prepared initial states, starting from either the equilibrium state (red) or its counterpart with populations inverted by a π-pulse (dark blue). The initialization error 1 − P_g converges to about 0.2 % after training with only about 30,000 episodes, which includes 100 parameter updates by the reinforcement learning trainer on the PC. The training process takes only three minutes wall clock time. This relatively short training duration, limited mainly by data transfer between the PC and FPGA, enables frequent readjustment of the neural network parameters and thus allows to account for drifts in experimental parameters.

**Fig. 3: Experimental data for reinforcement learning with a network-based real-time agent.**

Policy for strong measurements

After the training has been completed, we visualize the agent’s strategy by plotting the action probabilities P(a) vs. V, see Fig. 3c. We compare this strategy to a thresholding strategy, in which the action is chosen based on the value of V only. We observe that the agent follows this simple strategy in regimes of high certainty. In between, the transitions of the individual probabilities are smooth. This is not due to some deliberate randomization of action choices, but rather a sign that the agent’s policy depends on additional information beyond the integrated signal V shown here, as the agent has access to the full measured time trace.

To evaluate the agent’s performance we analyze the tradeoff between initialization error 1 − P_g and average cycle number 〈n〉 as a function of the control parameter λ, see Supplementary Note 4 for details. As expected, we find that an increase in 〈n〉, controlled by lowering λ, results in a gain of initialization fidelity until 1 − P_g converges to about 0.18% (for 〈n〉 ≥ 1.1 cycles), about a tenfold reduction compared to the equilibrium state, see Fig. 3d, e. We attribute the remaining infidelity mostly to rethermalization of the qubit between the termination and the verification cycle, and, possibly, state mixing during the final verification readout. In our experiment, this rethermalization rate is N_eq/T₁ ≈ 1 kHz with N_eq = 1.4%, contributing ~0.07% to the infidelity. As anticipated, the agent’s performance matches the performance of simple, close-to-optimal, thresholding strategies, where we vary the acceptance threshold to control the average cycle number 〈n〉 (black crosses). This indicates that the strategies discovered by the agent are also close to optimal. In addition, we also note that state transitions are rare, because the measurement time is significantly shorter than the relaxation time τ ≪ T₁. Therefore, the ability of the neural network to detect state transitions from the readout time trace does not result in a significant change in performance in the presented experiments. We have also studied the ability of the neural network to distinguish different quantum states in dependence on the measurement time τ (see Supplementary Note 3) for which we observe pronounced improvements in performance when increasing τ.

Weak measurements and qutrit readout

The observations until this point demonstrate that our real-time agent performs well and trains reliably on experimentally obtained rewards. Next, we discuss regimes where good initialization strategies are more complex. As a first example, we investigate the agent’s strategy and performance when only weakly measuring the qubit. We reduce the power of the readout tone, while keeping its duration and frequency unchanged, such that bimodal Gaussian distributions of a prepared ground and excited state overlap by 25% (see Supplementary Note 2). In this case, we find that the agent benefits from memory, if it is permitted to access information from l previous cycles, see Fig. 4a. Whenever the current measurement hints at the same state as the previous measurement (upper right and lower left in each panel) the agent gains certainty about the state and thus becomes more likely to terminate the process (green region in the lower left corner) or swap the \(\left|g\right\rangle\) and the \(\left|e\right\rangle\) state (blue region in the upper right corner). As for strong measurements, we find a trade-off between 〈n〉 and 1 − P_g when varying λ, see Fig. 4b. Importantly, we observe that agents making use of memory (l = 2, red circles) require fewer rounds 〈n〉 to reach a given initialization error than agents without memory (l = 0, green triangles) or a thresholding strategy (black crosses). In addition, we note that the agent without memory (l = 0) needs slightly more rounds than the thresholding strategy to reach a certain initialization error, although both methods have an approximately equal amount of information available. We have not investigated this effect in detail, but one possible explanation are decay and rethermalization rates varying during the several days of acquisition time of the data.

**Fig. 4: Reinforcement learning results for weak measurements and three-level systems.**

In addition, we have studied the performance of the agent when also considering the second excited state \(|\, f\rangle\), which we have neglected so far. The \(|\, f\rangle\) state is populated with a certain probability due to undesired leakage out of the computational states \(\left|\, g\right\rangle\) and \(\left|e\right\rangle\) during single-qubit, two-qubit and readout operations⁵⁴. Thus, schemes which also reset \(|\, f\rangle\) into \(\left|\, g\right\rangle\) are required. For this purpose, we enable the agent to also swap \(|\, f\rangle\) and \(\left|\, g\right\rangle\) states by adding a fourth action, and train the agent on a qutrit mixed state with one third \(\left|\, g\right\rangle,\left|e\right\rangle\) and \(|\, f\rangle\) population. For this qutrit system, state assignment typically processes two different projections of the measurement trace V = w_Vs^ver and W = w_Ws^ver, where w_V and w_W form an orthonormal set of weights. Here, we use V and W to visualize the agent’s strategy. Whenever the measurements firmly indicate that the qutrit is in some given state, the agent proceeds with the corresponding action, while the agent’s policy is more complex and harder to predict when measurements fall in-between such clear outcomes, see Fig. 4c.

We find that an agent that can swap \(|\, f\rangle\) to \(\left|g\right\rangle\), in addition to the other actions, efficiently resets the transmon from a qutrit mixed state with an initialization error 1 − P_g ≈ 0.2% for an average number of cycles 〈n〉 ≈ 2 (blue squares), see Fig. 4d. In contrast, an agent which cannot access the gf-flip action needs significantly more rounds till termination to reach a similar initialization error, as the agent needs to rely on decay from the \(|\, f\rangle\) level, which in our setup had a lifetime of \({T}_{1}^{(f)}=6\) μs. For the agent that cannot access the gf-flip action, we also observe a sudden increase in 〈n〉 from 2.2 to 3.4 when decreasing λ from 0.22 to 0.10. Above λ > 0.1, the agent only resets the \(\left|e\right\rangle\) level, as the loss in R associated with the additionally required cycles would be larger than the gain associated with the increase in initialization fidelity from resetting the \(|f\rangle\) level.

These examples demonstrate the versatility of the reinforcement learning approach to discovering state initialization strategies under a variety of circumstances.

Discussion

In conclusion, we have implemented a real-time neural-network agent with a sub-microsecond latency enabled by a network design which accepts data concurrently with its evaluation. The need for such optimized real-time control will increase due to the ever more stringent requirements on the fidelities of quantum processes as quantum devices grow in size and complexity. We have successfully trained the agent using reinforcement learning in a quantum experiment and demonstrated its ability to adapt its strategy in different scenarios, including those for which memory is beneficial. Our experiments are an example of reinforcement learning of real-time feedback control on a quantum platform.

While our experiments focused on the initialization of a single qubit into its ground state, it turns out that a range of other conceivable real-time quantum feedback tasks operating on a single qubit are straightforward extensions of the demonstrated protocol. Initialization into an arbitrary superposition state can be achieved by realizing a suitable final unitary operation after qubit initialization. Alternatively, one can perform all measurements in a suitably rotated basis where the target state is one of the measurement basis states. The weak measurement scenario which we explored could be extended as well by measuring in different bases, slowly steering a quantum state towards the desired target without immediate projection.

There are a number of other possible scenarios for real-time quantum feedback control on a single qubit which are less directly related to what we have demonstrated in this work. For example, in the qutrit scenario, one may realize a measurement which does not distinguish between two of the three qutrit states. Realizing such a measurement would enable the detection of decay processes out of that subspace and allow for a subsequent reset into the subspace. One could also extend the presented work to settings in which the qubit is driven, e.g., designing an agent to learn the stabilization of Rabi oscillations, in the spirit of the approach discussed in ref. ⁵⁵. Finally, in the future multi-qubit scenarios can be explored expanding on the techniques presented in this paper.

Understanding the scaling of neural networks with the size of the quantum system and overcoming hardware restrictions on FPGAs are important steps towards applying these methods to larger systems. Such advances will enable the discovery of new strategies for tasks like quantum error correction^36,37,38 and many-body feedback cooling^31,32,33,34.

Data availability

The data supporting the findings of this letter and corresponding Supplementary Information file have been deposited in the ETH Zurich repository for research data under https://doi.org/10.3929/ethz-b-000637125.

Code availability

The code used for data analysis is available from the corresponding authors upon request.

References

Wiseman, H. & Milburn, G. Quantum Measurement and Control (Cambridge University Press, 2009).
Zhang, J., Liu, Y.-X., Wu, R.-B., Jacobs, K. & Nori, F. Quantum feedback: Theory, experiments, and applications. Phys. Rep. 679, 1 (2017).
Article ADS MathSciNet MATH Google Scholar
Ristè, D., Bultink, C. C., Lehnert, K. W. & DiCarlo, L. Feedback control of a solid-state qubit using high-fidelity projective measurement. Phys. Rev. Lett. 109, 240502 (2012).
Article ADS PubMed Google Scholar
Campagne-Ibarcq, P. et al. Persistent control of a superconducting qubit by stroboscopic measurement feedback. Phys. Rev. X 3, 021008 (2013).
CAS Google Scholar
Salathé, Y. et al. Low-latency digital signal processing for feedback and feedforward in quantum computing and communication. Phys. Rev. Appl. 9, 034011 (2018).
Article ADS Google Scholar
Negnevitsky, V. et al. Repeated multi-qubit readout and feedback with a mixed-species trapped-ion register. Nature 563, 527 (2018).
Article ADS CAS PubMed Google Scholar
Steffen, L. et al. Deterministic quantum teleportation with feed-forward in a solid state system. Nature 500, 319 (2013).
Article ADS CAS PubMed Google Scholar
Chou, K. S. et al. Deterministic teleportation of a quantum gate between two logical qubits. Nature 561, 368 (2018).
Article ADS CAS PubMed Google Scholar
Ofek, N. et al. Extending the lifetime of a quantum bit with error correction in superconducting circuits. Nature 536, 441 (2016).
Article ADS CAS PubMed Google Scholar
Andersen, C. K. et al. Entanglement stabilization using ancilla-based parity detection and real-time feedback in superconducting circuits. npj Quantum Inf. 5, 69 (2019).
Article ADS Google Scholar
Ryan-Anderson, C. et al. Realization of real-time fault-tolerant quantum error correction. Phys. Rev. X 11, 041058 (2021).
CAS Google Scholar
Sutton, R. S. & Barto, A. G. Reinforcement Learning: An Introduction (MIT press, 2018).
Silver, D. et al. Mastering the game of Go with deep neural networks and tree search. Nature 529, 484 (2016).
Article ADS CAS PubMed Google Scholar
Kober, J., Bagnell, J. A. & Peters, J. Reinforcement learning in robotics: a survey. Int. J. Robotics Res. 32, 1238 (2013).
Article Google Scholar
Bellemare, M. G. et al. Autonomous navigation of stratospheric balloons using reinforcement learning. Nature 588, 77 (2020).
Article ADS CAS PubMed Google Scholar
Praeger, M., Xie, Y., Grant-Jacob, J. A., Eason, R. W. & Mills, B. Playing optical tweezers with deep reinforcement learning: in virtual, physical and augmented environments. Mach. Learn.: Sci. Technol. 2, 035024 (2021).
Google Scholar
Guo, S.-F. et al. Faster state preparation across quantum phase transition assisted by reinforcement learning. Phys. Rev. Lett. 126, 060401 (2021).
Article ADS CAS PubMed Google Scholar
Ai, M.-Z. et al. Experimentally realizing efficient quantum control with reinforcement learning. Sci. China Phys. Mech. Astron. 65, 250312 (2022).
Article ADS Google Scholar
Kuprikov, E., Kokhanovskiy, A., Serebrennikov, K. & Turitsyn, S. Deep reinforcement learning for self-tuning laser source of dissipative solitons. Sci. Rep. 12, 1 (2022).
Article ADS Google Scholar
Degrave, J. et al. Magnetic control of tokamak plasmas through deep reinforcement learning. Nature 602, 414 (2022).
Article ADS CAS PubMed PubMed Central Google Scholar
Peng, P. et al. Deep reinforcement learning for quantum hamiltonian engineering. Phys. Rev. Appl. 18, 024033 (2022).
Article ADS CAS Google Scholar
Tünnermann, H. & Shirakawa, A. Deep reinforcement learning for coherent beam combining applications. Optics Express 27, 24223 (2019).
Article ADS PubMed Google Scholar
Kain, V. et al. Sample-efficient reinforcement learning for CERN accelerator control. Phys. Rev. Accel. Beams 23, 124801 (2020).
Article ADS CAS Google Scholar
Hirlaender, S. & Bruchon, N. Model-free and bayesian ensembling model-based deep reinforcement learning for particle accelerator control demonstrated on the FERMI FEL. Preprint at https://arxiv.org/abs/2012.09737 (2020).
Yan, Q. et al. Low-latency deep-reinforcement learning algorithm for ultrafast fiber lasers. Photonics Res. 9, 1493 (2021).
Article Google Scholar
Muiños-Landin, S., Fischer, A., Holubec, V. & Cichos, F. Reinforcement learning with artificial microswimmers. Sci. Robotics 6, eabd9285 (2021).
Article Google Scholar
Baum, Y. et al. Experimental deep reinforcement learning for error-robust gate-set design on a superconducting quantum computer. PRX Quantum 2, 040324 (2021).
Article ADS Google Scholar
Nguyen, V. et al. Deep reinforcement learning for efficient measurement of quantum devices. npj Quantum Inf. 7, 100 (2021).
Article ADS Google Scholar
Li, Z. et al. Deep reinforcement with spectrum series learning control for a mode-locked fiber laser. Photonics Res. 10, 1491 (2022).
Article Google Scholar
Chen, C., Dong, D., Li, H.-X., Chu, J. & Tarn, T.-J. Fidelity-based probabilistic Q-learning for control of quantum systems. IEEE Trans. Neural Netw. Learn. Syst. 25, 920 (2013).
Article Google Scholar
Bukov, M. et al. Reinforcement learning in different phases of quantum control. Phys. Rev. X 8, 031086 (2018).
CAS Google Scholar
Borah, S., Sarma, B., Kewming, M., Milburn, G. J. & Twamley, J. Measurement-based feedback quantum control with deep reinforcement learning for a double-well nonlinear potential. Phys. Rev. Lett. 127, 190403 (2021).
Article ADS CAS PubMed Google Scholar
Sivak, V. V. et al. Model-free quantum control with reinforcement learning. Phys. Rev. X 12, 011059 (2022).
CAS Google Scholar
Porotti, R., Essig, A., Huard, B. & Marquardt, F. Deep reinforcement learning for quantum state preparation with weak nonlinear measurements. Quantum 6, 747 (2022).
Article Google Scholar
Niu, M. Y., Boixo, S., Smelyanskiy, V. N. & Neven, H. Universal quantum control through deep reinforcement learning. npj Quantum Inf. 5, 1 (2019).
Article CAS Google Scholar
Fösel, T., Tighineanu, P., Weiss, T. & Marquardt, F. Reinforcement learning with neural networks for quantum feedback. Phys. Rev. X 8, 031084 (2018).
Google Scholar
Nautrup, H. P., Delfosse, N., Dunjko, V., Briegel, H. J. & Friis, N. Optimizing quantum error correction codes with reinforcement learning. Quantum 3, 215 (2019).
Article Google Scholar
Sweke, R., Kesselring, M. S., van Nieuwenburg, E. P. & Eisert, J. Reinforcement learning decoders for fault-tolerant quantum computation. Mach. Learn. Sci. Technol. 2, 025005 (2020).
Article Google Scholar
Zhang, Y.-H., Zheng, P.-L., Zhang, Y. & Deng, D.-L. Topological quantum compiling with reinforcement learning. Phys. Rev. Lett. 125, 170501 (2020).
Article ADS CAS PubMed Google Scholar
Fösel, T., Niu, M. Y., Marquardt, F. & Li, L. Quantum circuit optimization with deep reinforcement learning. Preprint at https://arxiv.org/abs/2103.07585 (2021).
Carleo, G. et al. Machine learning and the physical sciences. Rev. Mod. Phys. 91, 045002 (2019).
Article ADS CAS Google Scholar
Dawid, A. et al. Modern applications of machine learning in quantum sciences. Preprint at https://arxiv.org/abs/2204.04198 (2022).
Krenn, M., Landgraf, J., Foesel, T. & Marquardt, F. Artificial intelligence and machine learning for quantum technologies. Phys. Rev. A 107, 010101 (2023).
Article ADS CAS Google Scholar
Liyanage, Wu, N. Y., Deters, A. & Zhong, L. Scalable quantum error correction for surface codes using fpga. https://arxiv.org/abs/2301.08419 (2023).
Sivak, V. V. et al. Real-time quantum error correction beyond break-even. Nature 616, 50 (2023).
Article ADS CAS PubMed Google Scholar
Blais, A., Grimsmo, A. L., Girvin, S. M. & Wallraff, A. Circuit quantum electrodynamics. Rev. Mod. Phys. 93, 025005 (2021).
Article ADS MathSciNet CAS Google Scholar
Blais, A., Huang, R.-S., Wallraff, A., Girvin, S. M. & Schoelkopf, R. J. Cavity quantum electrodynamics for superconducting electrical circuits: an architecture for quantum computation. Phys. Rev. A 69, 062320 (2004).
Article ADS Google Scholar
Wallraff, A. et al. Approaching unit visibility for control of a superconducting qubit with dispersive readout. Phys. Rev. Lett. 95, 060501 (2005).
Article ADS CAS PubMed Google Scholar
Gambetta, J., Braff, W. A., Wallraff, A., Girvin, S. M. & Schoelkopf, R. J. Protocols for optimal readout of qubits using a continuous quantum nondemolition measurement. Phys. Rev. A 76, 012325 (2007).
Article ADS Google Scholar
Magesan, E., Gambetta, J. M., Córcoles, A. D. & Chow, J. M. Machine learning for discriminating quantum measurement trajectories and improving readout. Phys. Rev. Lett. 114, 200501 (2015).
Article ADS PubMed Google Scholar
Flurin, E., Martin, L. S., Hacohen-Gourgy, S. & Siddiqi, I. Using a recurrent neural network to reconstruct quantum dynamics of a superconducting qubit from physical observations. Phys. Rev. X 10, 011006 (2020).
CAS Google Scholar
Lienhard, B. et al. Deep-neural-network discrimination of multiplexed superconducting-qubit states. Phys. Rev. Appl. 17, 014024 (2022).
Article ADS CAS Google Scholar
Walter, T. et al. Rapid, high-fidelity, single-shot dispersive readout of superconducting qubits. Phys. Rev. Appl. 7, 054020 (2017).
Article ADS Google Scholar
Magnard, P. et al. Fast and unconditional all-microwave reset of a superconducting qubit. Phys. Rev. Lett. 121, 060502 (2018).
Article ADS CAS PubMed Google Scholar
Hacohen-Gourgy, S. et al. Quantum dynamics of simultaneously measured non-commuting observables. Nature 538, 491 (2016).
Article ADS PubMed Google Scholar

Download references

Acknowledgements

This work was supported by the Swiss National Science Foundation (SNSF) through the project “Quantum Photonics with Microwaves in Superconducting Circuits” (Grant No. 200021_184686, C.E.), by the European Research Council (ERC) through the project “Superconducting Quantum Networks” (SuperQuNet), by the National Centre of Competence in Research “Quantum Science and Technology” (NCCR QSIT), a research instrument of the Swiss National Science Foundation (SNSF), by ETH Zurich, the Munich Quantum Valley, which is supported by the Bavarian state government with funds from the Hightech Agenda Bayern Plus, and by the Max Planck Society.

Author information

Christopher Eichler
Present address: Physics Department, University of Erlangen-Nuremberg, Staudtstraße 5, 91058, Erlangen, Germany

Authors and Affiliations

Department of Physics, ETH Zurich, CH-8093, Zurich, Switzerland
Kevin Reuer, James O’Sullivan, Liberto Beltrán, Abdulkadir Akin, Graham J. Norris, Ants Remm, Michael Kerschbaum, Jean-Claude Besse, Andreas Wallraff & Christopher Eichler
Quantum Center, ETH Zurich, CH-8093, Zurich, Switzerland
Kevin Reuer, James O’Sullivan, Liberto Beltrán, Abdulkadir Akin, Graham J. Norris, Ants Remm, Michael Kerschbaum, Jean-Claude Besse & Andreas Wallraff
Max Planck Institute for the Science of Light, Staudtstraße 2, 91058, Erlangen, Germany
Jonas Landgraf, Thomas Fösel & Florian Marquardt
Physics Department, University of Erlangen-Nuremberg, Staudtstraße 5, 91058, Erlangen, Germany
Jonas Landgraf, Thomas Fösel & Florian Marquardt

Authors

Kevin Reuer
View author publications
You can also search for this author in PubMed Google Scholar
Jonas Landgraf
View author publications
You can also search for this author in PubMed Google Scholar
Thomas Fösel
View author publications
You can also search for this author in PubMed Google Scholar
James O’Sullivan
View author publications
You can also search for this author in PubMed Google Scholar
Liberto Beltrán
View author publications
You can also search for this author in PubMed Google Scholar
Abdulkadir Akin
View author publications
You can also search for this author in PubMed Google Scholar
Graham J. Norris
View author publications
You can also search for this author in PubMed Google Scholar
Ants Remm
View author publications
You can also search for this author in PubMed Google Scholar
Michael Kerschbaum
View author publications
You can also search for this author in PubMed Google Scholar
Jean-Claude Besse
View author publications
You can also search for this author in PubMed Google Scholar
Florian Marquardt
View author publications
You can also search for this author in PubMed Google Scholar
Andreas Wallraff
View author publications
You can also search for this author in PubMed Google Scholar
Christopher Eichler
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

K.R. and J.O. prepared and calibrated the experimental setup. K.R., L.B., and A.A. implemented the neural network on the FPGA. F.M. and C.E. conceived the idea for the experiment. J.L., T.F., and F.M. simulated the system and the network and determined the network structure, training algorithm and reward function, with input from K.R. and C.E. K.R. and J.O. designed the device. G.J.N., M.K., A.R., and J.-C.B. fabricated the device. K.R. and J.O. carried out the experiments and analyzed the data, with support from J.L. K.R., J.L., F.M., and C.E. wrote the manuscript with input from all co-authors. F.M., A.W., and C.E. supervised the work.

Corresponding authors

Correspondence to Kevin Reuer or Christopher Eichler.

Ethics declarations

Competing interests

The authors declare no competing interests.

Peer review

Peer review information

Nature Communications thanks Antti Vepsäläinen, Daoyi Dong and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. A peer review file is available.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Information

Peer Review File

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Reuer, K., Landgraf, J., Fösel, T. et al. Realizing a deep reinforcement learning agent for real-time quantum feedback. Nat Commun 14, 7138 (2023). https://doi.org/10.1038/s41467-023-42901-3

Download citation

Received: 17 March 2023
Accepted: 25 October 2023
Published: 06 November 2023
DOI: https://doi.org/10.1038/s41467-023-42901-3

Comments

By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.