Reinforcement learning pulses for transmon qubit entangling gates

The utility of a quantum computer depends heavily on the ability to reliably perform accurate quantum logic operations. For finding optimal control solutions, it is of particular interest to explore model-free approaches, since their quality is not constrained by the limited accuracy of theoretical models for the quantum processor - in contrast to many established gate implementation strategies. In this work, we utilize a continuous-control reinforcement learning algorithm to design entangling two-qubit gates for superconducting qubits; specifically, our agent constructs cross-resonance and CNOT gates without any prior information about the physical system. Using a simulated environment of fixed-frequency, fixed-coupling transmon qubits, we demonstrate the capability to generate novel pulse sequences that outperform the standard cross-resonance gates in both fidelity and gate duration, while maintaining a comparable susceptibility to stochastic unitary noise. We further showcase an augmentation in training and input information that allows our agent to adapt its pulse design abilities to drifting hardware characteristics, importantly with little to no additional optimization. Our results exhibit clearly the advantages of unbiased adaptive-feedback learning-based optimization methods for transmon gate design.


I. INTRODUCTION
Quantum computing holds immense potential to revolutionize various fields, such as optimization, simulation, and cryptography -in some cases promising exponential computational speedup compared to its classical counterpart.However, a central obstacle to harnessing this potential is the challenge of realizing reliable quantum operations.Therefore, achieving high-fidelity quantum gates is a crucial prerequisite to unlocking the full potential of quantum computing for practical applications.
A common approach is to optimize control protocols based on effective models of the physical platform.With a suitable model at hand, high-fidelity strategies can often be achieved through analytical insights, gradientbased optimization methods, or error amplification techniques [1][2][3][4][5][6].However, present-day quantum systems are characterized by substantial levels of noise, decoherence, and other environmental disturbances, for which accurate models are rarely known.Moreover, even when a good model is known, it is often not exactly solvable, limiting its usefulness for optimal quantum gate design.A route to circumvent these shortcomings is to resort to model-free approaches, which facilitate gate optimization through direct interactions with the quantum device.
Recently an adaptive approach using reinforcement learning (RL) has become increasingly popular in quantum gate design due to its model-free nature, obviating the need for a precise description of all details of the system [7].When trained in a simulation, the RL approach rivals optimal control techniques in synthesizing high-precision quantum gates using discretized control for qubit-based [8][9][10] and qudit-based [11,12] systems, including potentially drastic improvements in exploration [10] and sample efficiency [13].Similarly, RL was found successful in optimizing continuous controls for a generic qubit model [14] as well as a hardware-specific gmon model [15].Ref. [15] further demonstrates the resilience of the RL-designed pulse sequences to stochastic noise when optimized with knowledge of a noisy environment.Discrete control algorithms have also been adapted to successfully learn faster single-qubit gates from scratch using experimental data from IBM's superconducting platform [16,17]; Ref. [17] additionally uses RL to improve upon the standard structure of an analytical cross-resonance pulse sequence.
While many aspects of the RL approach are more broadly applicable, in this work, we will specifically address the gate design problem for fixed-frequency fixedcoupling transmon qubits.For this platform, the modelbased approach has yielded valuable insights in the pursuit of crafting high-fidelity entangling gates by utilizing the cross-resonance interaction [18,19].An effective approximate analytical model, capable of capturing the dominant Hamiltonian terms generated by the primary cross-resonance drive as well as undesired cross-talk [20], has paved the way for the development of various error suppression techniques, including echo sequences [21], selective-darkening/active cancellation [22][23][24], optimal control theory [25], rotary pulsing [26], and most recently derivative pulse shaping [27].However, the intricate nature of real hardware and its inevitable imperfections persist, hindering our ability to achieve flawless quantum gate operations.Moreover, while the analytical insight motivates a specific family of control pulses, it remains unknown whether even better solutions can be found by expanding the considered protocol space.
As mentioned above, RL approaches have been explored to design entangling gates for the transmon platform.However, one of the inherent strengths of RL -its capability to discover innovative strategies free from the confines of theoretical protocol sequences -has remained underutilized.The absence of such flexibility results in lengthy pulse sequences which, in turn, impose severe limitations on the fidelity of these operations.Moreover, the ability of RL agents to learn adaptive strategies, that include optimal reactions to the feedback received when they are deployed, has so far received little attention.For example, although RL solutions display a degree of temporal robustness due to exposure to changing underlying system characteristics during training [17], leveraging the adaptability of the RL agent to deal with such fluctuations explicitly remains largely unexplored.
In this work, we address these and related open questions by deploying a continuous control RL algorithm to construct piece-wise constant (PWC) pulse sequences for cross-resonance and CNOT gates without any prior knowledge about the controlled system.We emphasize that this model-free approach only requires feedback from the environment (simulated or experimental) and has no information about the physical model for the environment's dynamics.Our RL training agent only has access to the quantum state and the gate fidelity, which, in principle, can be obtained experimentally via tomography and fidelity benchmark; however, in this work we train the RL agent in simulation.We tailor the simulated environment to fixed-frequency fixed-coupling transmons using realistic system characteristics to have a direct comparison between our RL results and the existing error suppression techniques in superconducting platforms.Note that in this particular transmon architecture, the qubit frequency depends on the fabrication of the transmon chip itself and cannot be controlled throughout the gate duration.
We first demonstrate that our unbiased RL agent is capable of generating novel high-fidelity control solutions that outperform current state-of-the-art cross-resonance pulse sequences.By effectively navigating the vast design space of multi-segment PWC functions to identify high-quality pulses for multiple continuous control drives simultaneously, our agent has achieved a remarkable feat in addressing an up to 120-dimensional control problem, as compared to the 20-dimensional problem considered in Ref. [17].Without compromising the fidelity, our agent additionally discovers control solutions with large drive amplitudes, that can lead to a maximum reduction in gate duration by 30%, while being feasible to implement on modern NISQ devices.We further show how to augment our RL approach so that our agent can learn to adapt to drifts in the underlying hardware parameters (characteristics), a common issue that plagues near-term superconducting devices.This adaptation offers a twofold advantage: immediate, high-fidelity control solutions without any extra optimization when dealing with moderate drifts or a reduction in the number of training iterations required to address more significant changes in hardware parameters.These findings underscore the practicality of the RL approach as a potent alternative for tackling the quantum gate design problem.
The following sections are organized as follows.We start with defining the quantum gate design problem for one and two-qubit gates in Sec.II.We give a brief overview of the state-of-the-art gate implementations in Sec.III.We then present our reinforcement learning approach in Sec.IV and our simulated results in Sec.V. Finally in Sec.VI, we conclude and discuss future directions.

II. QUANTUM GATE DESIGN PROBLEM
A quantum gate design task aims to realize a logical operation over one or more qubits via optimizing a set of available time-dependent external control fields {d j (t)} over some gate duration T .The effect of these fields is described by a control Hamiltonian H ctrl (t) = H ctrl [{d j (t)}] while the intrinsic dynamics of the qubits is captured by the system Hamiltonian H sys .Together, they generate the full unitary evolution where T denotes the time-ordering operator.We measure the accuracy of approximation of the target operation U target by the resultant unitary U via the average gate fidelity [28] F avg (U, where U qubit = Π qubit U Π qubit is the unitary map projected to the qubit subspace of dimension n.The average is taken over all initial states |ψ 0 ⟩ distributed uniformly according to the Haar measure.Here we focus on superconducting qubit platforms where local Z rotations can be performed virtually [29], i.e., without incurring any additional time.We include this degree of freedom in the unitary by augmenting U qubit to V Z (θ)U qubit , in which the near-optimal angles θ are given by the matrix elements of M = U qubit U † target , see App.A 2.
We consider a target gate fidelity of 99.9%, which is an order of magnitude higher than the 99% fidelity of the surface code threshold for two reasons.First, this level of fidelity is expected to enjoy a drastic reduction in the number of physical qubits when using the surface code [30].Second, for the typical two-qubit gate durations considered in this work, e.g., 248.9 ns and 177.8 ns (the smallest time unit is the inverse sampling rate dt = 2/9 ns for the considered device), the gate fidelity is coherence limited at 99.9% and 99.93%, respectively.These limits are determined by computing the average gate error under a channel with the amplitude damping rate T 1 = 300µ s and the phase damping rate T 2 = 300µs [31], which are achievable in current devices [32].Thus, having in mind any realistic decoherence in the near term, our target fidelity of 99.9% coincides with the coherence limited fidelities for the considered range of gate duration.
In addition to the average gate fidelity, we also investigated the worst-case fidelity as an alternative figure of merit.However, we did not find any discernible advantage and report this additional result in App.E 2.
In the following, we provide the explicit form of the Hamiltonian used to model superconducting transmon qubits, while state-of-the-art gate implementations are discussed next in Sec.III.

A. Single-qubit Hamiltonian
We begin by modeling a single transmon in the Duffing approximation [33], with the lab frame Hamiltonian (ℏ = 1): where ω and α denote the |0⟩ ↔ |1⟩ transition frequency and anharmonicity, respectively, and b, b † are ladder operators.This transmon can be driven at frequency ω d via a control Hamiltonian where we have factored out the drive strength Ω d to keep the real and imaginary parts of the complex control signal d(t) normalized to [−1 , 1].By rotating to the driving frame via the transformation R(t) = e −iω d tb † b and ignoring fast rotating terms (cf.App.A 1 for details), we arrive at the rotating frame Hamiltonian.
To make the connection to single-qubit rotations explicit, we consider only the first two levels and make the where X and Y denote respectively the Pauli-X and Pauli-Y matrices.Evidently, turning on the complex control field d(t) induces qubit rotation around the x and y axes, which, for long enough gate duration, is sufficient for realizing any single-qubit gate.With the Euler-angle decomposition, one can achieve any desired single-qubit gate by merely calibrating the R X (±π/2) rotations, and, as discussed above, Z gates can be implemented virtually.Note that in practice much shorter gate durations are desirable, in which case a number of errors (such as high state population [34]) will unavoidably arise and therefore need to be counteracted, see Sec.III.

B. Two-qubit Hamiltonian
We now extend the model to describe a pair of transmons by combining two Duffing Hamiltonians coupled by a resonator.The resonator acts as a bus for coherent communication between the quantum states of the two Hamiltonians that will define the qubits.Although a variety of different logical two-qubit gates are possible with this setup [35], we will discuss here the cross-resonance interaction Hamiltonian, which is the current standard for fixed-frequency architecture.
When the resonator's fundamental frequency is much larger than both |0⟩ ↔ |1⟩ transition frequencies of the individual transmons, we can project the Hamiltonian onto the zero-excitation subspace of the bus resonator to obtain the following lab-frame effective Hamiltonian: (8) Here, ωj denotes the resonator-dressed qubit frequency and J denotes the effective coupling strength [20].
In addition to a standard on-resonance control field d(t) on each transmon, an entangling operation can be realized by driving one qubit at the frequency of another via a cross-resonance (CR) control field u(t).The twotransmon control Hamiltonian then becomes Here, u 01 (t) refers to the cross-resonance pulse sent to qubit 0 when driven at the frequency of qubit 1, and vice versa for u 10 (t).Moving into the frame rotating at ω d for both transmons using the transformation ) , and ignoring the fast rotating terms, we obtain the rotating-frame Hamiltonian where δ j = ωj − ω d defines the detuning for the j-th transmon.The pair of transmons, whose dynamics is described by H 2 (t), is illustrated in Fig. 1.In this work, we simulate the dynamics in the frame rotating at the second transmon's frequency, i.e., setting δ 1 = 0.With the first transmon as control and the second as target, the main effect of the cross-resonance drive can be studied by setting u 01 to a constant value Ω and other control fields to zero: To obtain the effective ZX interaction rate (or strength) within the qubit subspace while accounting for higher levels, one can employ perturbation theory [20] to obtain the following approximate effective CR Hamiltonian where A ∈ {I, Z} and B ∈ {I, X, Z}.In the presence of classical cross-talk and incorrect phases in control drives, B can be extended to include the PauliY matrix [24].Within perturbation theory for small coupling J and small drive Ω, the interaction rates have the following scaling The resultant dominant ZX term can then be used to implement the following entangling operation known as the cross-resonant (CR) gate, which is locally equivalent to the popular CNOT gate.Such entangling operations, together with the capacity to realize any single-qubit gate, enable universal quantum computation in the superconducting transmon platform.

C. Leakage
Although only the first two levels of a transmon are used to represent a qubit, the higher levels are nevertheless still present and can be populated as the system evolves.We capture the most prominent leakage outside the ideal computational qubit subspace by including the second excited state of the anharmonic oscillator |2⟩ (cf.Fig. 1).The full state space can be decomposed into a direct sum of the computational subspace χ 1 and the leakage subspace χ 2 .Projectors onto these subspaces respectively are denoted as I 1 and I 2 [36].
Under a unitary quantum channel E(ρ) = U ρU † , state leakage averaged over all initial pure states in the qubit subspace is given by .
Here, we have used the fact that the average of |ψ 0 ⟩⟨ψ 0 | results in the maximally mixed state I 1 /dim(χ 1 ).For a system of two transmons, the computational subspace χ 1 is spanned by {|00⟩ , |01⟩ , |10⟩ , |11⟩}, and thus dim(χ 1 )=4.Intuitively, the average leakage L quantifies the population fraction initially prepared in the computational subspace, that ultimately ends up outside of this subspace [34].While a prerequisite for achieving a highfidelity quantum gate is to minimize leakage at gate completion, conventional anti-leakage schemes typically suppress leakage throughout the entire gate duration.This is due to the difficulty in restoring population into the computational subspace at the end, as well as (historically) disproportionately larger decoherence rates for higher levels.Nevertheless, high-fidelity control solutions, with considerable excursion beyond the computational subspace during gate duration, do in fact exist and are achievable with the use of RL optimization, as we will demonstrate later, e.g., see the data reported for RL protocols in Fig. 7 and the corresponding discussion in Sec.V B.

D. Entanglement
In addition to the fidelity, an important goal of a twoqubit operation is to generate entanglement.Among a number of different options, we select a simple metric called linear entropy which quantifies the entanglement of a joint density matrix ρ describing the pure state of both qubit A and qubit B as follows where Tr A (Tr B ) denotes partially tracing out qubit A (B).To calculate the linear entropy of an initial state |ψ 0 ⟩ after a unitary operation U , we simply substitute When applied to a multilevel system like transmons, we make sure to normalize the final state after projecting to the qubit subspace.
To assess the entanglement capacity of a quantum gate, we draw inspiration from the widely adopted entangling power of unitary operations [37], defined as the average linear entropy produced by a unitary operator when acting on the space of all two-qubit product states.Since the average is taken over two single-qubit Haar measures instead of a single joint two-qubit Haar measure, it can be computed exactly using the set of tensor products of all six Pauli eigenstates {|0⟩ , |1⟩ , |±⟩ , |±i⟩}.Of the resulting 36 two-qubit product states, 16 are maximally entangled, while 20 remain separable for gates in the class of locally equivalent CNOT operations, including the ZX(π/2) gate.Within the scope of our work, the linear entropy averaged over these 16 initial states is sufficient to capture the entangling power of a unitary operation.We shall therefore define this quantity as the average linear entropy Slin .
In the context of driving one qubit at a frequency of another in order to implement an entangling gate, it is common to attribute the entanglement generated entirely to the use of such a cross-resonance drive.However, this might not be the case when an on-resonance drive is used simultaneously with the cross-resonance drive.As we shall see, studying the average linear entropy Slin of optimized pulses implementing two-qubit gates reveals that in some of the control solutions discovered by RL, the roles of different drives are not as isolated as one might initially believe; e.g., see Fig. 9, indicating the existence of an entirely new class of solutions.
Standard error suppression techniques for implementing gates on transmon-qubit devices.The analytical waveforms are discretized at the inverse sampling rate dt = 2/9 ns.a) RX (π/2) implemented using DRAG scheme with an in-phase Gaussian pulse d(t) and its out-of-phase derivative (blue).ZX(π/2) implemented using b) an echoed pulse and c) an echo-free/direct pulse, consisting of a main cross-resonance pulse u01(t) (orange) along with on-resonance drives, d0(t) and d1(t), on control (blue) and target (green and red) qubits.

III. STATE-OF-THE-ART IMPLEMENTATIONS OF TRANSMON GATES
To provide a meaningful point of comparison for our approach, we review the conventional methods for implementing quantum gates within a superconducting platform and specifically the theoretical foundations underpinning each ansatz.As concrete examples, we showcase the standard error suppression techniques for both the single-qubit R X (π/2) gate and the two-qubit ZX(π/2) gate in Fig. 2.
The most basic implementation of a single-qubit rotation around the x-axis involves driving the target transmon resonantly with a real-valued pulse envelope d(t), according to Eq. 6. Common choices for the pulse shape include Gaussian and Gaussian Square waveforms as they offer smooth ramp-up and ramp-down.The standard error suppression approach utilizes an additional out-of-phase component equal to the derivative of the in-phase part, see Fig. 2a, which has been shown to significantly reduce gate error including leakage to the second excited level.This is known as the Derivative Removal for Adiabatic Gate (DRAG) scheme [38,39], in which the amplitude of the real Gaussian pulse, the detuning, and the scaling factor of the imaginary derivative component can be optimized.Calibration of these parameters on current superconducting hardware can reliably achieve average gate fidelity of above 99.95%[40].
For two-qubit entangling gates such as ZX(π/2), the standard implementation makes use of a cross-resonance pulse u 01 along with resonant drives (d 0 , d 1 ) on both the control and target qubit, according to Eq. 10.These components can be combined in an echoed or direct fashion [40].As illustrated in Fig. 2b, the echoed scheme employs an echo pulse sequence where the CR pulse is broken into two halves (yellow envelopes) with the second one inverted (Ω → −Ω) and positioned between two π-pulses applied to the control qubit The amplitude inversion changes the sign of ω ZX and ω IX according to the relations in Eq. ( 13), while the addition of two π-pulses can be understood as a conjugation by XI for every term in the effective CR Hamiltonian, leading to the following contribution from the second half When combined with the first half, this should ideally lead to a complete cancellation of unwanted IX, ZI, and ZZ terms.Nevertheless, experimental results reveal a significant IY component as well as a small ZY term, which can be attributed to classical crosstalk.This issue can be rectified by applying an on-resonance tone to the target qubit with an identical waveform as the CR pulse, known as active cancellation [24].On the other hand, the direct scheme employs an echo-free sequence with the same symmetric active cancellation tone, while introducing an additional asymmetric rotary component.
In particular, the symmetric part reduces the effects of IX and IY terms whereas the asymmetric part helps offset ZZ and ZY terms.For both schemes, calibration of the amplitudes and phases of the main CR pulse, in tandem with calibration of the additional tones, achieves between 99% to 99.7% average gate fidelity [24,40].
As seen in the above examples, the standard pulse designs rely heavily on a theoretical understanding of the platform, i.e., types of interaction induced when certain control drives are activated or when certain error processes are present.On a real device, however, deviation from the theoretical model is unavoidable and closed-loop optimization is required to mitigate the unwanted effects.Additionally, the perturbative approach of deriving the State Action Environment Reward RL Agent FIG. 3. Basic reinforcement learning loop.The agent interacts with its environment (different from the conventional definition of environment in physics) by taking actions and in turn receiving information about the environment's new state.In addition, the agent receives a reward indicating the usefulness of the last action for achieving the given task.
effective interaction rates break down at high control amplitudes, preventing exploration for potential solutions in the strong drive and short time regime.
While these theoretical ansätze offer the advantage of straightforward calibration procedures with a minimal number of parameters, they may also impose significant limitations and/or necessitate longer gate durations to compensate for errors not captured by the relevant theoretical model.Moreover, should previously unidentified errors come to light, it will be necessary to develop and implement novel error suppression strategies.Established alternative approaches usually involve gradientbased optimization, such as GRAPE [3], which still requires detailed knowledge about the model and access to the gradient of the loss function.A model-free approach like reinforcement learning is therefore highly desirable since it offers adaptability to system dynamics by learning from "direct interactions" which we will define in the next section.Even when equipped with a relatively simple but flexible design space, such as piece-wise constant pulses, RL has the potential to unearth control solutions that are out of reach in conventional methods [10].Furthermore, RL leaves us with a representation of gained knowledge from the control problem, i.e., the agent, that can be reused and analyzed for additional insights (cf.Sec.V).

IV. REINFORCEMENT LEARNING
Reinforcement learning operates on a simple principle of trial and error.A generic problem involves an agent learning to make decisions to complete a task by interacting with an environment.Therefore, it is natural to formulate a reinforcement learning problem using a finite Markov Decision Process, in which a decision is made based solely on the current state of the system but not the entire history.We illustrate the basic reinforcement learning loop in Fig. 3.In this framework, at every step i, the agent selects an action a i based on a probability distribution or policy π(a i |s i ), conditioned on the current state of the environment, s i .After execution, the agent observes a new state s i+1 along with a reward r i+1 which indicates progress towards completing a particular task.The process terminates once the task is completed or the number of steps reaches a set limit, defining the end of an episode.Training the agent then involves running many of these episodes to gather experience, while exploration is encouraged by adding randomness to the action selection procedure, the trial process.At the same time, the policy is iteratively adjusted to maximize the expected cumulative reward E [ i r i+1 ] at the end of each episode, guiding the agent away from unproductive actions, the error process.Together, trial and error allow the RL agent to explore new actions effectively, and eventually arrive at a highly-rewarded behavior.
An agent trained exclusively on a single environment typically excels only within that specific context, making it less adaptable when confronted with a new environment.To mitigate this limitation, one effective strategy is to expand the agent's training scope to encompass a variety of environments.Moreover, equipping the agent with some context information about its current environment can significantly enhance its learning process and overall decision-making capability.This framework is commonly referred to as reinforcement learning with context [41], and it has been demonstrated as particularly valuable for tasks that require generalization to a range of environment parameters.
We now adapt the RL framework to designing quantum gates, in which we task an agent to build a piece-wise constant (PWC) pulse to realize a target operation on a transmon environment, as illustrated in Fig. 4. In the following, we detail our simulated environment, motivate our choice of states, actions, and rewards, and describe the selected RL algorithm.

A. Environment
Our environment simulates the dynamics of two transmons according to the Hamiltonian in Eq. 10, considering them as directly coupled anharmonic oscillators which can be controlled via external microwave pulses (cf."Environment" box in Fig. 4).The first two levels of the oscillators act as qubits and the main contribution of leakage out of the qubit subspace is captured via inclusion of the third level.The environment is then completely characterized by a set of system parameters, including detuning {δ 0 , δ 1 }, anharmonicity {α 0 , α 1 }, control drive strength {Ω d0 , Ω u01 , Ω d1 , Ω u10 }, and coupling J, which we collect into a single vector ⃗ p = [J, Ω u01 , ...].We denote the main set of system parameters used in this work as ⃗ p 0 whose components are summarized in Table II.Any drifts in the system characteristics are considered w.r.t to ⃗ p 0 via the relative change ∆⃗ p/⃗ p 0 = (⃗ p − ⃗ p 0 )/⃗ p 0 , where we have used an element-wise vector division.
Action: The RL agent interacts with the transmon environment by directly modifying the complex-valued control pulses [u 01 (t), d 1 (t), ...].Using the PWC ansatz, pulse shaping is equivalent to picking an amplitude A at each discrete time step ∆t until an N -segment pulse is complete, resulting in a gate duration T = N ∆t.To maintain experimental viability and avoid unrealistic oversampling, the time step is chosen such that 1/∆t be below the sampling rate (and bandwidth) of standard control electronics [42], i.e., ∆t > dt = 2/9 ns in this work.From the RL point of view, each complete pulse constitutes an episode, after which, the environment is reset so a new pulse can be tried out.Allowing the pulse amplitude A to take any value at every step tends to result in highly volatile pulses, similar to those typically obtained from unconstrained optimal control using methods like GRAPE [3].Instead, we aim for slowly varying solutions by defining the agent's action to be the relative amplitude change and restrict it to some continuous window [−w, w].By setting u 01,i ≡ u 01 (i∆t) and d 1,i ≡ d 1 (i∆t), we can write the action at step i as where each component is restricted to a drive-dependent window ≤ w u , a Hence, the action space dimension corresponds to twice the number of available control fields, as shown in Fig. 4a.By choosing the windows w u and w d to be small, e.g., less than 10% of the maximum allowed range, we systematically restrict the action space which additionally improves learning.Finally, we clip the resultant amplitudes to [-1,1] to ensure that the control fields do not exceed the maximum allowed drive strengths.

State:
The evolution of the transmon system due to external control fields and internal dynamics is characterized by a unitary map U (t, 0) computed from Hamiltonian in Eq. 10.Given a set of basis states {ψ j }, the evolution of an arbitrary pure initial state reads Thus, tracking the time-evolved unitary map is equivalent to tracking the time-evolved basis states {ψ j (t)}.As we aim to design target operations between two-level systems, we assume no occupation beyond the qubit subspace initially.That means, for example, in the singlequbit gate case, it is sufficient to track only the following basis states Reinforcement learning for designing high-fidelity quantum gates.RL framework involves two main entities: the environment, which is a system of two coupled transmons simulated as anharmonic oscillators truncated a three energy levels, and the RL agent, which uses the DDPG algorithm for learning continuous control drives.We focus on learning 2 control drives (cross-resonance u01 and qubit 1 rotation d1) in the main text, and report additional results for including a third control drive (qubit 0 rotation d0) in App.E 1. a) Step 1: Collecting data.At every step, the current state s of the environment is characterized by the time-evolved quantum state of the transmons {ψj(t)}, the previous control pulse amplitudes Aprev, and the relative changes in system parameters ∆⃗ p/⃗ p0.Based on that state s, the RL agent proposes an action a to determine control drive amplitudes that evolve the transmon environment forward in time.The environment outputs the next state s ′ and a fidelity-based reward r (cf.Eq. 24), and the transition tuple (s, a, r, s ′ ) is stored to an Experience Replay Buffer.An episode is complete when the RL agent fully constructs an N -segment pulse, and data from many episodes are collected for training.Here we consider a sparse reward scheme, meaning a non-zero reward is given only at the end of each episode.In addition, during data collection, some noise is injected into the RL agent's action to encourage exploration of new control solutions (cf.Alg. 1).b) Step 2: Training.Transition data from the Experience Replay Buffer are randomly sampled for batch-training two networks in DDPG algorithm: a value network Q, which learns to accurately predict the expected cumulative reward Q(s, a) of taking an action a from a state s, and a policy network µ, which learns to propose an action a = µ(s) that maximizes this Q-value.Outside of this training process, RL agent typically refers to the policy network µ because it generates all of the agent's actions.c) Step 3: Testing.Once trained, the RL agent can deterministically construct pulses with fidelity ≳ 99.9%, not only for a fixed environment, but also for environments whose parameters have drifted.
where we have truncated our simulation at three levels.Complete knowledge of the evolved basis states {ψ j (t)} at every step allows the agent to discern the effect of its actions on the environment.Due to our restriction of the action space to contain only relative changes in the control fields as in Eq. 18, we also include the pulse amplitudes in the previous time step to the state provided to the agent: In addition to designing gates for a fixed environment, we also wish to generalize the agent's designing capability to environments where the system parameters have drifted from their original values.While the agent can indirectly discern this change through the evolution of the quantum state, we have observed that furnishing the agent with explicit information about the current system characteristics can enhance its learning process.Instead of directly feeding the agent the system parameter vector ⃗ p whose entries can take a wide range of values, we can provide the same context information via the relative change in system parameters ∆⃗ p/⃗ p 0 , transforming the RL input state into where ⃗ p 0 denotes the original values in Table II.Reward: RL approaches typically utilize a reward that is provided at every step to incentivize the agent to learn the correct actions.Alternatively, the agent can also learn from a single reward granted at the end of each episode.In the case of a fidelity-based objective and when considering a closed-loop implementation using an actual device, this sparse reward scheme demands fewer measurements during intermediate steps, making it more experimentally favorable.With this in mind, we have opted for the sparse reward scheme and have determined that it is adequate for the agent's learning process.As the fidelity approaches unity, improvements tend to slow down, yielding increasingly marginal gains.To enhance the discernibility of positive signals, we define the reward function to be the negative log infidelity at the final time step Here, it is important to note that an improvement of one unit in the reward corresponds to a one-order-ofmagnitude enhancement in fidelity, e.g., r : 2 → 3 corresponds to F : 0.99 → 0.999.

B. Algorithm
In our pursuit of designing quantum gates via pulse shaping, we have established a large design space of PWC functions for the RL agent to explore.We require an algorithm capable of handling continuous-valued actions to fully harness the flexibility of this design space for achieving high-fidelity solutions, and also to effectively utilize continuous control resources in realistic hardware.Furthermore, given the limited access to near-term quantum devices, an algorithm with efficient training data usage is highly desirable.Thus, we select the Deep Deterministic Policy Gradient algorithm (DDPG), which satisfies all of these criteria [43].
We begin by laying the groundwork for DDPG, which is rooted in the concept of Q-learning.A Q-value quantifies the expected cumulative reward associated with taking an action from a specific state and subsequently following a particular policy π thereafter.In reinforcement learning, the expected cumulative reward is commonly subject to discounting over future time steps in order to incentivize the agent to complete its objective faster to receive a higher reward.Formally, the Q-value for a state-action pair (s i , a i ) at time step i under a policy π(a|s) is defined as follows: Here, γ ∈ [0, 1] is the discount factor, and the expectation value E is taken over actions selected using the policy π.With the current action a i already selected, there is only one possible value for the immediate reward r i+1 = r i+1 (s i , a i ), allowing us to take it out of the expectation and substitute it in the Q-value definition for the next state-action pair.The optimal strategy is then to pick the highest-valued action at every step, which leads to the recursion relation for the optimal Q-value Q * : also known as the Bellman optimality equation [44].As the dependence on the policy π is removed in the above, the optimal Q-value can be iteratively updated using any transition data tuple (s i , a i , s i+1 , r i+1 ) regardless of the collecting policy, a process commonly known as off-policy training.In practice, observed transitions are stored in a replay buffer from which a mixture of new and old transitions are sampled to train the agent.A typical replay buffer stores about half a million transitions, allowing much more efficient re-use of old data as compared to other RL methods.
When the number of discrete states and discrete actions are not too large, the corresponding Q-values are stored in a finite-size table that can be iteratively updated.As the state space becomes continuous, one instead approximates the optimal Q-value by a deep neural network with parameters ϕ as Q ϕ ≈ Q * .To adapt this deep Q-learning method to continuous actions, DDPG additionally utilizes a deterministic policy network with parameters θ for action generation: a i = µ θ (s i ).During training, a noise process is injected into the agent's action to encourage exploration as seen in Fig. 4a.With the main goal of maximizing the expected cumulative reward, we want not only a policy network that can generate actions with high Q-values, but also a value network that can approximate the Q-values well according to Eq. 26, which leads to the following update rules: for each transition data tuple (s i , a i , s i+1 , r i+1 ).From these update rules, it is clear that updating one network changes the loss function of the other, creating a moving target problem.Therefore, when computing the targets on the right hand sides of Eqs. 27, we employ target networks (ϕ ′ , θ ′ ) that slowly track the learned networks (ϕ, θ), which minimizes the effect of fast-moving targets.The complete algorithm is detailed in App.B. DDPG is known to struggle when the action space gets too large, which leads to exploding Q-values during training.One solution is to train two Q-networks and use the smaller value for computing the targets in Eq. 27 to mitigate Q-value overestimation (a.k.a., twin network trick).Another solution is to delay the policy network update for better Q-network convergence in between (a.k.a.delayed policy trick).These techniques, when combined with the standard DDPG, result in an augmented algorithm commonly known as Twin Delayed DDPG (TD3) [45].Unfortunately, we could not obtain conclusive evidence as to whether TD3 outperforms DDPG in all situations for our problem.Therefore, in the main text we focus on DDPG and delegate discussion of a case in which TD3 provides an advantage to App.E 1.

V. RESULTS
Here we report our main results of utilizing the DDPG algorithm for continuous control to solve the twoqubit gate design problem for superconducting transmon qubits.To ensure the consistency and reproducibility of the RL approach, we repeat each case multiple times with different seeds.Each seed leads to a different set of random generators used for initializing and optimizing of neural network parameters, sampling experience from the replay buffer, and injecting exploration noise to agent action during training.Unless stated otherwise, the reported results appear typical within a handful of realizations.The best cases are discussed here in the main text while training data over multiple seeds are included in App. C. We summarize relevant training hyperparameters in Table I as well as the main set of system parameters in our quantum simulator in Table II.
We juxtapose the RL-designed strategies with the conventional error suppression schemes in Sec.III where we employ the Nelder-Mead method to optimize the relevant parameters in each ansatz to maximize the average gate fidelity.For the single-qubit DRAG scheme, we simultaneously optimize 2 parameters for d(t): the amplitude of the real Gaussian pulse and the scaling factor of the imaginary derivative pulse.For two-qubit entangling gates, we find that first optimizing the amplitudes of the Gaussian Square pulses for u 01 (t) and d 1 (t), and then optimizing for their phases, yields the best result.
In particular, the echoed scheme consists of two tunable Gaussian Square pulses, cross-resonance (u 01 ) and target cancellation tone (symmetric part of d 1 ), constituting a 4-parameter optimization problem.Meanwhile, the direct scheme contains an additional target rotary tone (asymmetric part of d 1 ), increasing the number of optimizable parameters to 6.These are in contrast with the 18-dimensional (single-qubit gate) and 80-dimensional (two-qubit gate) control problems addressed by our RL  Pulses designed by our RL agent appear considerably different from the direct scheme, in both pulse shape and quantum state dynamics.Furthermore, our RL agent manages to shorten the gate duration to 177.8 ns without compromising 99.9% fidelity.RL training hyperparameters are given in the "Fixed Environment" section of Table I.
approach that we will see shortly.
Our main results can be summarized as follows.First, we demonstrate that an RL agent can be trained via direct interaction with a simulated environment to successfully explore the vast design space of PWC functions (Sec.V A).Although the environment is treated as a black box, the PWC time step is chosen such that 1/∆t is below the sampling rate (and bandwidth) of standard control electronics [42], to avoid unrealistic oversampling.The discovered strategies are unbiased by prior theoretical knowledge and distinct from the established analytical solutions in Sec.III.Second, we illustrate the benefit of RL optimization with the flexible PWC ansatz in finding high-fidelity control solutions at shorter gate duration (Sec.V B).We then assess the novelty in the roles played by each drive (Sec.V C), followed by the robustness of optimized pulses to short-timescale stochastic noise (Sec.V D).Finally, we augment our agent to generalize and adapt to drifts in system characteristics (Sec.V E), making use of the left-over representation of gained knowledge which is an advantage of RL over conventional control algorithms.
A. Learning without prior knowledge

Single-qubit gate in two-transmon setting
As a benchmark for our approach, we start by tasking the RL agent to learn the single-qubit π/2 rotation around the x-axis in a two-transmon setting, given by In Fig. 5, we report an RL-designed 9-segment pulse that implements the IX(π/2) gate using a single control drive d 1 .The agent learns to construct both real and imaginary parts of the pulse, thus tackling an 18-dimensional optimization problem.The RL-designed pulse achieves a 10 ns gate duration which is over 3× faster than the 35.6 ns DRAG pulse.With triple the maximum pulse amplitude, the 10 ns pulse maintains a comparable fidelity at 99.9%, despite having leakage larger by a few orders of magnitude during intermediate steps, as seen in Fig. 5d.The data, therefore, suggest that the RL agent learns to exploit the presence of the second level to the advantage of reducing the gate duration.Such speed-up already offers a significant reduction in operating time as general quantum circuits consist mostly of single-qubit gates.

Two-qubit entangling gates
Applying the same algorithm for 20-segment waveforms, our RL agent successfully learns 248.9 ns pulses that implement ZX(π/2) and CNOT gates completely from the ground up, achieving fidelity F > 99.9%.The agent constructs complex-valued pulses for a crossresonance drive u 01 and an on-resonance drive on the target transmon d 1 , constituting an optimization problem of dimension 80 (20 segments × 2 drives × 2 real numbers).In fact, our agent also finds equally high-fidelity solutions to an even higher-dimensional optimization problem when given access to three drives (d 0 , u 01 , d 1 ).However, these 3-drive control solutions require the ability to send two pulses at different frequencies simultaneously to the same transmon, which, to the best of our knowledge, is not a commonly used technique.Therefore, we defer the discussion of 3-drive results to App.E 1 and focus on constructing pulses using the only two drives (u 01 , d 1 ) for our main results section.
In Fig. 6, we present a clear-cut contrast between state-of-the-art direct scheme and RL approach for the ZX(π/2) gate.First, the RL pulse envelope goes beyond the square Gaussian structure in the direct scheme while having a higher maximum amplitude and a more prominent imaginary part, as seen in Fig. 6a-b.In Fig. 6c-d, we illustrate the evolution of Bloch coordinates of the target qubit initialized at |0⟩, for two cases: when the initial control qubit state is either |0⟩ (blue) or |1⟩ (red).Maximal entanglement is achieved when these two timeevolved target qubit states are exactly opposite on the Bloch sphere.In the direct scheme, the conditioned target qubit states start out by rotating together around the x-axis at slightly different speeds.They increase their distance after a couple of revolutions and stop at their final destinations on each end of the y-axis.By contrast, the RL scheme appears to bring the states directly to their respective destinations with minor course corrections in between; this has to do with our choice of allowed action windows w u and w d between the u 01 and d 1 drives.We find that setting w u = 10w d , in this case, yields the best training performance, which inadvertently restricts our agent to solutions with a considerably weaker drive on the target qubit.More interestingly, the evolution roughly splits into two parts (cf.middle and right columns of Fig. 6c-d): the state conditioned on 1 (red) moves while the state conditioned on 0 (blue) remains approximately stationary in the first half of the pulse sequence, and vice versa in the second.The above observations suggest that the strategies learned by our RL agent are fundamentally different from the standard analytical protocols.

B. Achieving shorter gate duration
By extending our gate design study to different gate durations, we find that the RL approach coupled with 0.9 Fidelity and leakage of optimized ZX(π/2) pulses at different gate durations.We compare results for the direct, echoed, and RL schemes.(a) Best fidelity achieved over a dozen runs as function of gate duration.Short vertical lines mark the approximate entangling times obtained via numerical block-diagonalization for constant pulses of average amplitudes of the three data points marked by the color boxes.(b) Maximum population leakage throughout gate duration for the same set of pulses.For full evolution of population leakage throughout gate duration, see Fig. 15.The increase in maximum population leakage coincides with the decrease in fidelity for the direct and echoed schemes at shorter gate durations.RL-designed pulses, however, maintain 99.9% fidelity down to 177.8 ns gate duration, potentially making use of large population leakage.RL training hyperparameters are given in the "Fixed Environment" section of Table I.
the flexibility of the PWC waveform consistently outperforms optimized direct and echoed schemes.In Fig. 7a, we observe that fidelities obtainable using these standard approaches drop below 99.9% when their gate durations approach 213 ns and 320 ns, respectively.Meanwhile, RL pulse duration can be shortened significantly, down to 177.8 ns while maintaining the same performance.
We first compare the gate durations of optimized control solutions with the approximate entangling time τ defined for a constant pulse of amplitude Ω.For the ZX(π/2) gate, we have τ (Ω) = (π/2)/ω ZX (Ω), where ω ZX is the effective ZX interaction rate obtained via numerical block-diagonalization of the two-transmon Hamiltonian in Eq. 11 (cf.Ref [20] for more details).We compute τ (Ω) for the three cases considered in Fig. 6 using their average amplitudes, which are 58 MHz, 65 MHz, and 136 MHz, respectively.These approximate entangling times are displayed as short vertical lines in Fig. 7a and are color-coded to the corresponding points on the graph.On one hand, we observe that τ practically equals the gate duration for the 248.9 ns direct pulse, which is not surprising as its Gaussian Square waveform can be well-approximated with a constant pulse.On the other hand, for the RL pulses, the gate durations and approximate entangling times no longer agree, which can be attributed to their considerably more complicated PWC waveform.This observation suggests non-trivial dynamics in the control solutions discovered by our RL agent, which we further investigate in the following by examining the amount of population leakage as well as the evolution of entanglement generated in several initial states and rotation angles.
While the target operation involves only the first two levels, they are not isolated from other excited states and input control fields inevitably drive some population out of the computational subspace.In Fig. 7b, we show the maximum leakage at different gate durations; and observe that an increase in leakage for direct and echoed pulses coincides with a decrease in fidelity.By contrast, RL-designed pulses manage to preserve their performance despite experiencing large state leakage which inevitably arises as the agent explores high-amplitude solutions in order to shorten the necessary entangling time.Our results suggest a prominent presence of leakage processes beyond the effective model, and that our RL agent has found a way to make good use of them.Indeed, such behavior is supported by prior research which indicates that the improved gate speed can be attributed to the more strongly coupled higher energy levels.[46].
Within the computational subspace, we first note the deviation from the effective model by looking at the entanglement generated in two initial states, |00⟩ and |10⟩; they are expected to remain separable throughout the evolution under the effective CR Hamiltonian given in Eq. 12.We show the evolution of the linear entropy S lin of these states in Fig. 8a for both direct and RL schemes, and observe small but non-zero amounts of entanglement.For the former, we observe more entanglement generated in the |10⟩ state which can be attributed to pulse ramp up and ramp down not taken in account in the effective Hamiltonian analysis [20].For RL-designed pulses, on the other hand, especially one with shorter gate duration, both states become considerably more entangled at intermediate time steps, indicating the presence of entangling processes beyond those expected from the effective CR model.This suggests that our RL agent has managed to remove these unwanted entangling processes at gate completion in order to achieve a high final fidelity.
For a more detailed picture of both single-qubit rotation and entangling processes, we take a closer look at the strengths of different interactions in the Pauli basis as a function of time.To do so, we first compute the averaged Hamiltonian by taking the logarithm of the unitary U (t, 0) at time t and project it onto the qubit-subspace We expand tH qubit avg in the Pauli basis P i ∈ {I, X, Y, Z} where θ ij defines the rotation angle that depends on the P i ⊗ P j interaction strength and duration t.We can then invert the relation and compute the rotation angle as Computing ln U (t, 0) amounts to choosing an appropriate branch cut to obtain sensible results for the timedependent rotation angles, a procedure we discuss in App.D 2. It is important to note that these branch cuts are chosen to reveal a semi-stable increase in θ ZX , which in turn, results in clear discontinuities in other angles.
We can now analyze these rotation angles θ ij to reveal the arisen interactions in greater depth.In Fig. 8b, we show the time evolution of the rotation angles for the XX and Y X interactions.Their nonnegligible presence, even in the direct scheme, is not expected from the effective CR model given in Eq. 12, which further supports our previous observation of the linear entropy in Fig. 8a.The saw-tooth time-evolution of these interactions corresponds to the off-resonance precession of the control qubit, and this pattern differs for all three cases presented.It is worth noting that the short 177.8 ns RL pulse results in the largest precession rate as well as the most prominent XX and Y X interactions, which we attribute to its significantly higher drive amplitude as seen in Fig. 6.We thus confirm that the more complex waveforms in RL pulses feature an increased presence of entangling processes beyond the ZX term.
In Fig. 8c-d, we also display several other interactions that are expected from the effective CR model, including ZX, IX, IY , and IZ.First, we observe a similar gradual accumulation of θ ZX across the board.The target qubit rotations, however, look completely different.In the direct scheme, an active cancellation tone was introduced to mainly suppress a large unwanted IX term generated by the bare CR pulse.The inclusion of the asymmetric rotary part provides the additional degrees of freedom to suppress more unwanted terms.While having some success, these techniques are limited by the rigidity of the Gaussian Square waveform.This can be seen from monotonic evolution of the IX, IY , and IZ rotation angles in the direct pulse as illustrated in the left column of Fig. 8cd.By contrast, the evolution of these rotation angles in RL pulses, shown in the middle and right columns of Fig. 8c-d, appears to be non-monotonic and significantly more flexible throughout the gate duration, suggesting a much more powerful error suppression potential resulting from the PWC waveform.Indeed, we observe that our RL agent successfully brings all unwanted interactions [cf.Fig. 17] for the remaining terms close to zero at gate completion, even in the case of a significantly higheramplitude 177.8 ns pulse where the effective CR model completely breaks down.These findings reveal a considerable deviation in environment characteristics beyond the perturbative effective dynamical model, and yet, our PWC-equipped model-free RL agent remains unbiased and adjusts accordingly to obtain high-fidelity control solutions.
With the training setting used above, i.e., learning pulses for only 2 drives (u 01 , d 1 ) via a standard DDPG algorithm, our RL agent only manages to find one 99.9%fidelity 177.8 ns solution out of a dozen runs.This is because we need to increase the allowed relative change in pulse amplitudes at each step, effectively broadening the action space, in order to compensate for such a short gate duration.Larger action space tends to result in Qvalue overestimation, which ultimately destabilizes the training process.By employing a few additional modifications to the training setting such as expanding the agent's access to 3 drives (d 0 , u 0 , d 1 ), implementing the TD3 tricks, and training the agent for longer, we notice more stable training and an improved probability of success on some occasions but not universally.Therefore, we postpone the discussion of these additional results to App.E 1.

C. Novelty in the roles of drives
We can further highlight the novelty in control solutions found by our RL agent by analyzing the role played by each drive in implementing a two-qubit operation.We do so by taking the each optimized pulse sequence, removing the on-resonance component d 1 , and comparing the left-over cross-resonance component u 01 with the original pulse.In the weak drive regime, we expect the cross-resonance drive u 01 to be the sole entanglement generator and the on-resonance drive d 1 to only affect the local rotation of the target qubit.By examining the changes in the linear entropy and the qubit control fidelity when the on-resonance drive is removed, we confirm that, in the direct scheme, d 1 has little to no effect on the entanglement generated and the motion of the control qubit, as illustrated in by the overlapping orange curves in Fig. 9.This suggests that optimizing the Effect of removing target qubit drive d1 from optimized ZX(π/2) pulses.a) Average linear entropy Slin as defined in Sec.II D. b) Fidelity of control qubit averaged over initial states |00⟩ and |10⟩.The target qubit drive practically only affects the target qubit rotation in the direct scheme.Meanwhile, the RL agent discovers solutions where this on-resonance drive works in tandem with the crossresonance drive to generate entanglement and rotate the control qubit.
Gaussian Square pulses in the direct scheme leads to control solutions exhibiting a clear separation of roles: the cross-resonance drive generates almost all entanglement and ensures that the control qubit ends up in the intended state; meanwhile, the on-resonance drive focuses on correcting the target qubit state.Interestingly, our RL agent discovers additional strategies where these roles get mixed up to different degrees, represented by how much the blue curves in Fig. 9 deviate from one another.In these cases, the on-resonance drive d 1 actually works in tandem with the cross-resonance drive u 01 to generate entanglement and rotate the control qubit.Such a behavior can be attributed to high driving amplitudes which activate interactions beyond the desired ZX term, leading to the observed novelty in solutions discovered by our RL agent.

D. Robustness of optimized pulses
When implemented on a real device, the performance of optimized control solutions inevitably suffers from a variety of error sources such as imperfect controls and noisy readouts.We simulate these effects by introducing Gaussian fluctuation on system parameters ⃗ p; this allows us to assess the robustness of the control solutions discussed so far.Specifically, at every step of the size of the inverse sampling rate dt = 2/9 ns, fluctuations are 0.0 0.5 1.0 1.5 2.0 2.5 3.0 Gaussian noise variance (%) at rate dt = 2/9 ns 0.9 0.99 0.999 Fidelity F 248.9 ns direct 248.9 ns RL 177.8 ns RL FIG. 10.Robustness of optimized ZX(π/2) pulses to short-timescale stochastic noise.We simulate stochastic noise by adding uncertainty to system parameters at inverse sampling rate dt = 2/9 ns during evaluation.RL-design pulses outperform the direct scheme up to 0.5% noise.For larger noise, the same-duration RL solution behaves similarly to its direct counterpart whereas the shorter-duration RL solution degrades at a faster rate.RL training hyperparameters are given in the "Fixed Environment" section of Table I.
sampled from a zero mean Gaussian distribution with a standard deviation of up to 3% of the original system parameters ⃗ p 0 listed in Table II.For each deviation value, we collect 50 samples and report the results for optimized ZX(π/2) pulses in Fig. 10.RL-designed pulses outperform the direct implementation up to 1% deviation at the same gate duration (248.9 ns) and up to 0.5% deviation at shorter gate duration (177.8 ns).The shorter gate duration RL solution degrades quickly as the deviation increases, possibly due to large jumps in its non-smooth PWC waveform.Since the 248.9 ns RL pulse has a much lower amplitude, the jumps in its PWC waveform are not as detrimental.As a result, it shows no discernible difference in performance degradation rate, when compared to its smooth Gaussian Square waveform counterpart in the direct scheme.Overall, the mean fidelities drop to around 99-99.5% at a 3% deviation, which translates to about 66 kHZ for the coupling value and about several MHz for the remaining parameters.These results suggest a trade-off in the RL solutions between fidelity and gate duration in the presence of stochastic noise, which could stem from the non-smooth feature of PWC pulses as well as proximity to the smallest duration necessary to generate sufficient entanglement.

E. Adapting to drifting system characteristics
In addition to the Markovian, short-timescale fluctuation, systems characteristics of superconducting transmon qubits are also known to drift over a longer time, owing to, e.g., dilution refrigerator temperature or fabrication defects.Therefore, in the following, we explore the idea of generalizing a single RL agent to a range of system parameters so that no additional training is required Fidelity of fixed-environment control solutions in the presence of system drifts.We sample drifts in system parameters (see legend) of a single type (solid curves) and of all types simultaneously (dashed black curve), and then bin the data points according to the maximum drift.The binned mean fidelities (curves) and their standard deviations (shaded areas) are displayed.(a-b) Fidelity of pulses from direct implementation and RL optimization are evaluated on new environments.The direct pulse is susceptible mostly only to drift in drive strength, whereas the RL solution is susceptible to drift in all parameters.(c) Fidelity of adaptive pulse found by RL agent via interaction with each new environment; solution remains susceptible to drift in detuning and anharmonicity while generalizing well for drift in drive strength.Data represents optimizing a 248.9 ns CNOT pulse, with RL training hyperparameters given in the "Drifting Environment" section of Table I.
when re-calibration is needed [47].We additionally discover that knowledge of the change in system parameters can be utilized to strengthen the generalization capability of the agent.Finally, from this point on, we will switch to the objective of learning the CNOT gate to further highlight our approach's applicability to different target gates.Traditionally, a CNOT gate can be achieved by performing a ZX(π/2) gate followed by a target qubit rotation.Since our RL agent learns to perform these gates from scratch, its control solution for the CNOT gate does not have to follow the aforementioned decomposition and can be learned directly, resulting in pulses reported in Fig. 18.

Evaluating generalizability
We first discuss our method of evaluating the generalizability of an RL agent on multiple systems whose parameters have drifted from the original values ⃗ p 0 .We study two main situations, when only a single type of parameter has changed, i.e., only coupling: only drive strengths: or when all parameters have changed.We gather the drifts in system parameters into a context vector We then sample these changes randomly, evaluate the fidelity of the tailored control solutions, and bin these data points according to the drift with largest absolute value.Using a bin-size of 0.2%, we collect samples such that the number of data points increases from ∼ 15 points in the central bins to ∼ 60 points in the ±7% bins.Finally, we study the binned mean and standard deviation of the fidelity as a function of the maximum drift.
We can now examine the generalizability of the control solutions discussed so far, namely the direct and RL pulses optimized for the original system parameters, as shown in Fig. 11.We focus on a 6% range of maximum drift on each side of the original values.In our simulation, since the coupling value is much smaller, the drift of the same percentages for this parameter has significantly less effect than the others (green curves).While generalizing well for most parameters due to its smoothness, the direct pulse in Fig. 11a is highly susceptible to drift in drive strength as the rotation angle is directly affected.The RL-designed pulse in Fig. 11b, on the other hand, degrades quickly for drifts in all parameters as expected for a non-smooth PWC waveform.
Adaptive control solutions from the RL agent for different system parameters behave more interestingly, which can be seen in Fig. 11c.We find a large susceptibility to changes in detuning and anharmonicity, likely because they directly change the spacing between transmon energy levels and fundamentally alter the system's internal physics, e.g., the resonant frequencies.Meanwhile, the drive strengths are connected to external controls which the RL agent has direct influence over, resulting in a much more robust behavior.Overall, previous control solutions exhibit poor generalization performance when drift in all system parameters is considered.12. Improved generalization fidelity when using augmented RL approach.As reference, we import fidelity curves (solid) from an agent trained on a fixed environment from Fig. 11c.We display improved results for training on an environment with drifting system parameters, when the agent has no knowledge (dotted) or full knowledge (dashed) of the drifting parameters, i.e., context information.(a) Drift in detuning frequency only.Training on environments with drifting system parameters is sufficient in improving the fidelity to ≳ 99.9% within a 5% drift.Having context information provides a slight improvement in performance while cutting training time in half.(b) Drift in all parameters.Context is essential to stabilize training, and provides the best generalization result.RL training hyperparameters are given in the "Drifting Environment" section of Table I.

Learning to generalize
We now identify several ingredients necessary for improving the generalization performance of our agent.We start with a simpler problem where we task the agent to adapt to drift in detuning while all parameters remain fixed.By simply allowing the agent to interact with many systems with different detuning values during training, we observe an immediate and significant performance increase (cf.dotted blue curve in Fig. 12a).We achieve this by sampling the detuning from a uniform distribution of a 5% range around its original value at the beginning of each episode.As a result, our agent can observe the changing effects of its action on drifted systems, and thus, learn to adapt accordingly.To further help the agent discern different systems more effectively, we can provide it with specific knowledge about the system with which it is currently interacting, namely the size of the detuning drift relative to its original value.We refer to this piece of information as context and this additional input to the agent remains constant within each episode, implying that the system characteristics remain constant throughout the entire gate duration.While having little effect in generalization performance here (dashed blue curve), the training time needed for a context-aware agent is actually cut in half.
We apply our findings to the full problem with drift in all parameters and plot the result in Fig. 12b.Training on a drifting environment remains necessary but is no longer sufficient to achieve good generalization results (dotted black curve).In fact, the best result we report when training without context is obtained in only a few training iterations; after that, the performance drops precipitously (cf.black curve in Fig. 14b).We believe that feedback with the quantum state is no longer enough for our agent to distinguish different environments.For example, different environments can get to the same state with a different set of actions.Therefore, without additional information, the agent encounters a good amount of confusion during training.
Providing our agent with context information about the drifts in all parameters greatly alleviates the problem (dashed black curve).Here, instead of sampling from a uniform distribution as in the previous case, we sample drifts for all parameters from a zero-mean Gaussian distribution of a 2% standard deviation, in order to highlight the effect on the generalization results.Indeed, instead of collapsing at 5% drift (blue dashed curve), the fidelity in this case gradually decreases (black dashed curve) since the agent actually gets to interact with system drifts beyond 5% during training under the Gaussian distribution.Overall, extending the training environment to include system drifts and providing the agent with context information about those drifts significantly stabilizes the generalization task when all system parameters are involved.More importantly, the resultant agent can immediately propose pulses with 99.9% fidelity at up to 4% drift without any further training.These simulations justify the suggested practicality of RL in the presence of a reasonable system drift on near-term devices.
When dealing with a more substantial drift, we find that fine-tuning is necessary to achieve 99.9% fidelity, although the number of training episodes required can be notably less than when starting from scratch.To investigate this phenomenon, we first reiterate that the RL agent used in Fig. 12b has been trained to generalize to system parameters sampled from a Gaussian distribution of 2% standard deviation around the original values ⃗ p 0 , as in Fig. 13a (grey distribution).We then select a set of drifted system parameters, denoted as ⃗ p drifted [48] with maximum drift of −5.7%, for which the generalized agent suggests a control solution achieving 95% fidelity.Subsequently, we specialize our agent by training it on a fixed environment with these specific system parameters (blue delta distribution).
Comparing this approach to training a separate agent entirely from scratch within the same environment, we observe a reduction of 1.3× in the number of training less eps less eps a b FIG. 13.Reduction in training episodes when using a generalized agent to fine-tune.a) Training distribution of system parameters for three cases: generalizing to a Gaussian distribution P0 of 2% standard deviation (gray) as a starting point (same agent for dashed black curve in Fig. 12b), then either fine-tuning to a narrower distribution P1 of 0.2% standard deviation (purple), or fine-tuning to a fixed environment, i.e., a delta distribution P2 (blue).P0 is centered at the original system parameters ⃗ p0 whereas P1 and P2 are centered at ⃗ p drifted .b) Mean training fidelity as a function of training episodes for comparison between fine-tuning (to P1 and P2) and training from scratch (under P2).Shaded region is defined by the minimum and maximum training fidelity.For each case, we perform 3 runs with different seeds and report the best learning curve.RL training hyperparameters are given in the "Drifting Environment" section of Table I.Fine-tuning to P1 and P2 offers a 8.1× and 1.3× reduction in episodes required to reach 99.9% mean fidelity, implying great potential for transfer learning.
episodes required to reach 99.9% fidelity, as illustrated in Fig. 13b.Intriguingly, when we instead specialize our agent to a drifting environment characterized by a Gaussian distribution of 0.2% standard deviation and mean at ⃗ p drifted (cf.purple distribution in Fig. 13a), the episode reduction jumps to 8.1×.These preliminary findings hint at the potential for substantial transfer learning in certain cases.However, further investigations are needed to understand the underlying causes for such a wide range of effectiveness, which we plan to pursue in future studies.

VI. CONCLUSION AND OUTLOOK
In this study, we have showcased the advantages of harnessing reinforcement learning for the design of crossresonance gates, fully independent of known theoretical protocols and pre-existing error suppression techniques.
Our unbiased approach employs an off-policy agent to customize continuous control parameters, shaping complex-valued pulses concurrently for both the crossresonance and the target on-resonance drives.Compared to established optimal control methods, RL has the advantage of i) being suited for closed-loop optimization due to its model-free nature; ii) it is capable of nonlocal exploration; and iii) it generates a representation of gained knowledge as a valuable by-product.
Using the RL methodology, we demonstrated the discovery of novel control solutions that fundamentally differ from conventional error suppression techniques for two-qubit gates, such as direct and echoed schemes, while surpassing them in both fidelity and execution time.At the typical gate duration of 248.9 ns (for transmon devices) where the direct and echoed schemes achieve 99.937% and 99.501% fidelity respectively, RL-designed solutions can cut the error in about half compared to the better scheme, achieving F RL ≳ 99.966%, for both ZX(π/2) and CNOT gates.Not only that, our agent identified a potential maximum reduction of 30% in gate duration, while maintaining the same level of fidelity exceeding 99.9%.This can be attributed to the flexibility of the piece-wise constant ansatz capable of managing leakage out of the computational subspace, as well as unwanted coherent processes that inevitably arise at large drive amplitudes.
Furthermore, we illustrated the possibility of augmenting our approach to enable our agent to flexibly adapt its design capability to accommodate drifts in the underlying hardware.We found that exposing the agent to an environment with drifting system parameters during training while providing it with context information about these drifts, allows our agent to learn the appropriate control solutions and generalize well across a range of drifted system parameters.Concretely, our contextaware agent can readily propose control solutions with ∼ 99.9% fidelity when all system parameters, including detuning, anharmonicity, coupling strength, and drive strength, are allowed to drift within a 4% range around their original values.In instances of more substantial drifts, our generalized agent serves as a valuable starting point for fine-tuning, resulting in a remarkable 1.3−8.1×acceleration in optimization iterations when compared to starting from scratch.
Based on these findings, we can assert that the RL approach alleviates the necessity for a precise model, presenting a versatile framework applicable to designing various cross-resonance-based gates.When combined with piece-wise constant protocol space, our RL agent demonstrates its capacity to devise innovative pulse shapes that surpass the capabilities of conventional ansätze in terms of both fidelity and gate execution duration.The quest for shorter, high-fidelity pulses is particularly significant, given that various calibration methods are nearing the coherence limit imposed by state-of-the-art gate duration and qubit relaxation times.Furthermore, our contextaware RL approach effectively addresses hardware drifts, indicating the possibility of reducing and even eliminating additional training, and thus expensive calibration experiments, as long as system characteristics remain within a reasonable range.
When applied to experiments conducted on real-world hardware, our off-policy method carries the potential for significant data efficiency gains, as the agent can be trained on data collected by any policy.Consequently, while the initial training phase may incur high costs, subsequent retraining can be expedited thanks to the collected dataset.Additionally, actual drives delivered to the qubits are generally smoothed out from the raw jagged PWC input pulses, which should enhance the robustness of the optimized solutions to control fluctuations.
As the quantum computing community progresses toward larger platforms, the capacity of a single agent to extend its design capabilities across diverse system characteristics becomes increasingly pivotal for scalability [47].In fact, as the number of qubits grows, it becomes inevitable that certain qubits will exhibit overlapping system characteristics [49].In such a scenario, our context-aware agent, trained to generalize within a specific region of system parameters, can readily be applied to a group of qubits sharing similar characteristics.Moreover, these experiments can be conducted simultaneously, as qubits with akin parameters are likely to be positioned at a considerable distance from each other in the first place, further enhancing the efficiency of our RL agent.
In the immediate future, we are eager to integrate our approach into established gate optimization procedures for superconducting devices, as well as extending its utility to various quantum computing platforms.At the same time, we also aim to broaden the applicability of our RL agent to handle more intricate operations, such as the SWAP gate, multi-qubit gates, or gates on qudits.With the recent advancement enabling better control of the |1⟩ ↔ |2⟩ transition [50,51] and the potential of improving quantum speed limit by expanding beyond the qubit subspace [52], the synthesis of even faster qubit gates as well as qutrit gates emerges as an intriguing and imminent application for our RL protocol.On the algorithmic front, we emphasize the significance of enhancing generic RL algorithms through generalization and transfer learning techniques to bolster the method's practicability, especially for large-scale platforms.With the field of reinforcement learning, and machine learning in general, growing at an unprecedented rate, we hope to continue leveraging these powerful advancements toward the development of practical quantum computers.
yields a near-optimal solution θ * = (θ * 0 , θ * 1 ) that is sufficiently accurate for the fidelity to reach the threshold we set: indeed, we find that the error in computing the maximum average fidelity using this protocol, as compared to numerically optimizing for both angles simultaneously, remains below 10 −5 , which is negligible for the fidelity levels discussed in our work.
In our simulation where we work in the rotating frame of the second transmon, more Z error accumulates on the first transmon.We observe that optimizing for the larger angle yields a more accurate result, whence the order in Eq.A9.

Evolving method
We provide details into our simulator of transmon quantum dynamics, which is unitary in the absence of decoherence processes.Under piece-wise-constant controls, the unitary map naturally simplifies into a product of time-local propagators where ∆t is the discretized time step, and N is the number of segments in the PWC pulse.Each propagator, corresponding to each segment, can be computed by solving the time-dependent Schrödinger equation (TDSE) from t to t + ∆t.For example, in the two-transmon setting, we set H(t) = H 2 (t) given in Eq. 10.In general, the unitary is given by the following time-ordered integral which we numerically obtain using QuTiP's TDSE solver [55].This is necessary because of the detuning phase factors e iδt , even though the control fields u and d are constant within each segment.For the majority of this work, however, we focus on shaping only two control fields, u 01 and d 1 , and the phase factors drop out in the frame rotating at the target qubit frequency.As the result, the piece-wise Hamiltonian is now constant and can be directly exponentiated to obtain the unitary that solves the time-independent Schrödinger equation (TISE) providing a significant computational speedup.We verified that TISE solution converges to TDSE solution as the step size approaches zero.

Algorithm 1 Deep Deterministic Policy Gradient
Require: Initial Q-network and policy parameters ϕ and θ Require: Initial target networks parameters ϕ ′ and θ ′ 1: for step = 1, 2, . . ., M do 2: Reset environment to state s0 if end of episode 3: Sample an exploration noise N

9:
Compute targets yi = ri+1 + γQ ϕ ′ (si+1, µ θ ′ (si)) 10: Update Q-network by minimizing the loss: Update policy by maximizing the Q-values: Update target networks: end if 14: end for Appendix B: DDPG Algorithm In this work, we employ Deep Deterministic Policy Gradient (DDPG), an off-policy Q-learning algorithm suitable for a continuous action space.We summarize the training procedure in Algorithm 1.
The two neural networks for estimating the optimal Qvalue and the agent's deterministic policy are randomly initialized at the beginning of training.For continuous action space, exploration is implemented by adding some noise N directly to the policy network's output: a i = µ θ (s i ) + N .The exploration noise is scaled down over time using the Ornstein-Uhlenbeck process as implemented in the DDPG paper [43].
Transitions (s i , a i , s i+1 , r i+1 ) are collected by following the agent's noisy policy and stored in a large replay buffer B. For the first 10000 steps, the buffer is being filled without learning.After that, a batch of transitions are used along with two independent Adam optimizers to update both networks.To maintain quasi-stable targets throughout training, we soft-update both target networks (ϕ ′ , θ ′ ) via Polyak averaging, see Eq. A14.
We employ the DDPG (and TD3) implementation from RLlib, an open-source industry-grade library for RL [54].We conduct a routine exploration of hyperparameters to identify an effective setting, which we maintaine consistently throughout the study.Due to the complex interplay between hyperparameters in highdimensional analysis, altering one may impact others, making it challenging to provide a comprehensive account.Our focus is on identifying an effective set of hyperparameters and after that, minimizing additional adjustments to maintain the stability in our approach.The detailed hyperparameters used in this work are summarized in Table I.Any other hyperparameters not mentioned are unchanged from the default setting of RLlib version 2.0.0.

Appendix C: Training procedure
We report learning curves for the RL results discussed in the main text, plotting the mean fidelity of pulses encountered as a function of training episodes.Fig. 14a shows the learning curves for RL training on a fixed environment for different gate durations.Our DDPG agent consistently finds ≥ 99.9% fidelity control solutions after about 150,000 episodes for gate durations ≥ 248.9 ns, which corresponds to about 18 hours of training.Below this number, training becomes more challenging as we increase the action windows w u and w d to compensate for shorter physical time.An increase in action space often leads to exploding Q-values, resulting in a lower success rate over multiple runs.It should be pointed out that implementing the TD3 tricks [45] stabilizes training but degrades the achievable fidelity slightly when compared to DDPG in general.
Fig. 14b-c show learning curves for RL training on an environment with changing system parameters.Unlike both stable blue curves, the black curve is only stable when context is included, suggesting the importance of context information for adapting RL to a more realistic situation where all system parameters can drift away from their original values.The generalized agent typically converges around 500,000 episodes, which corresponds to about 4 days of training.
Each training instance initiates 4 workers for sampling interaction with the simulated environment and 1 worker for agent training, utilizing 5 cores simultaneously.Without any significant difference in runtime, training is done either on a typical laptop (M1 3.2GHz or Intel i7 2.7GHz) or on a node within the JUWELS cluster (Intel Xeon Platinum 8168 2.7 GHz).We report the full evolution of population leakage throughout gate duration for RL pulses with different gate durations in Fig. 15 We display high-resolution evolution of the leakage value as well as its moving average.Crosses mark the maximum leakage data points reported in Fig. 7. Larger population leakage is observed for shorter gate duration, which can be attributed to corresponding higher drive amplitude.θZX is first computed in the principal branch cut (blue), i.e., without any phase shifts.Then, a phase shift is determined and added at every time step, resulting in the shifted branch cut (orange), which ensures a well-behaved evolution of rotation angles.Without shifting the branch cut, the observed large jumps obscure meaningful interpretation of the accumulated rotation angles.

Rotation angles
Any two-qubit unitary map U qubit (t, 0) can be expressed in terms of a generating averaged Hamiltonian as follows exp −iH qubit avg t = exp where we have expanded H qubit avg in the Pauli basis given by P i ∈ {I, X, Y, Z}, and the rotation angle in the ij direction depends on the duration t and the P i ⊗ P j in-teraction strength.For example, the cross-resonance gate can be written as ZX(π/2) = exp(−iπZX/4).
For three-level transmons, we first compute the averaged Hamiltonian by taking the logarithm of the unitary U (t, 0), and then project it onto the qubit subspace.This allows us to quantify the strength of different interactions in the unitary at time t via the rotation angle θ ij (t) = Tr i Π qubit ln U (t, 0)Π qubit P i ⊗ P j 2 , (D2) where Π qubit is the projector onto the qubit subspace.
Computing the log of a matrix is non-trivial due to existence of branch cuts, which lead to different θ ij from the same U (t, 0).To see this, we let V be the unitary that diagonalizes H avg and write U (t, 0) = exp(−iH avg t) = V diag(e −iE1t , e −iE2t , ...)V † = V diag(e −iE1t−2in1π , e −iE2t−2in2π , ...)V † = e −iV diag(E1t+2n1π,E2t+2n2π,...)V † ⇒ i ln U (t, 0) = V diag(E 1 t + 2n 1 π, E 2 t + 2n 2 π, ...)V † , (D3) where we have taken into account the periodicity of the complex exponential via a list of integers {n i } corresponding to the eigenvalues {E i }.Note that a choice of {n i } specifies a particular branch cut where the principal branch cut corresponds to {n i = 0, ∀i}.Since we are most interested in the ZX interaction, we would like to pick a branch cut where θ ZX behaves nicely and without large jumps.With that goal in mind, we first consider the principal branch for a rough idea of how θ ZX evolves.After that, we search through {n i = ±1} at every time step, via brute force, to find the branch cuts that result in a smooth evolution of θ ZX , as seen in Fig. 16.These branch cuts can then be applied to obtain the evolution of rotation angle for the other interactions.Even though our approach is not perfect, which can be seen from the outliers in Fig. 8 and Fig. 17, the resultant data points are sufficiently accurate to reflect the main features of each evolution.within the same run time.Despite slower training, TD3 on 3 drives exhibits an improved probability of successful runs.These additional findings suggest potential benefits when simultaneous control of all three drives is accessible.

RL optimization with worst-case fidelity reward
Here we summarize our investigation on using worstcase fidelity as an alternative figure of merit.Let us first discuss the standard approach of estimating worstcase fidelity over an ensemble of initial states restricted to qubit subspace.The restriction is valid as we focus on implementing quantum logic operations between twolevel systems.Under this assumption, an arbitrary pure initial state can be written in terms of the computational basis as |ψ 0 ⟩ = i c i |i⟩ where i ∈ {0, 1} for one qubit and i ∈ {0, 1, 2, 3} for two qubits.The worst-case fidelity of a unitary map U w.r.t U target is defined as where U qubit is the unitary map projected to the qubit subspace.The complex-valued coefficients {c i } can be recast into 3 ( 7) real values for one (two) qubit(s), where we have subtracted a global phase degree of freedom.Numerical optimization is then carried out via Sequential Least Squares Programming method (SLSQP), which we find to be the fastest and most stable out of all methods available in SciPy's library.It is should be emphasized that the worst-case fidelity can be estimated by simply evolving a few states initially in the computational basis, suggesting a straightforward implementation on nearterm devices.
In fact, the estimation of the worst-case fidelity for a single qubit can be further improved by adopting a density matrix perspective.Working with a three-level system, a general qutrit density matrix is written as where r is an 8-dimensional Bloch vector for the qutrit state and λ is a vector of Gell-Mann matrices [56].The normalization condition for a pure state implies |r| = 1.
Restricting the initial state to the qubit subspace leads to r i = 0 for i = 4, . . ., 7 and r 8 = 1/2, resulting in where we have rescaled the Bloch vector to the 3dimensional unit sphere via r i = √ 3n i /2 so that |n| = 1.The relevant Gell-Mann matrices read where {σ i } denote the Pauli matrices.The fidelity of ρ(0) evolved under a unitary U = U (t, 0) w.r.
where we have symmetrized the quadratic term in the third line and used n T n = |n| 2 = 1 in the last.Evidently, minimizing fidelity over all possible initial qubit states is equivalent to minimizing a quadratic function over a sphere.This spherically constrained quadratic programming problem (SCQP) can be efficiently solved using the algorithm outlined in Ref. [57].Indeed, the algorithm requires no initial guess, converges to a single solution within machine precision over multiple runs, and enjoys a ∼ 10× speed up compared to the standard SLSQP method.
Finally, we note that a similar analysis for two qubits results in a quadratic programming problem for a 15dimensional Bloch vector with highly non-trivial constraints beyond the normalization condition [58], rendering the efficient SCQP algorithm inapplicable.Moreover, optimizing for 15 parameters with multiple convoluted constraints turns out to be much harder and less stable than optimizing for 7 parameters as in Eq.E1.Therefore, we deem the reparameterization unnecessary for two qubits and adhere to the standard approach using SLSQP algorithm.
With the outlined methods, we train our RL agent to learn both single and two-qubit gates using worst-case fidelity as the figure of merit and find similarly highfidelity control solutions.Due to the additional optimization, training with worst-case fidelity is slightly slower than training with average fidelity.Moreover, the uncertainty in its estimation using the SLSQP solver appears to destabilize training occasionally.Such an issue is fixed for learning a single-qubit gate when a more robust solver like SCQP is employed.Despite having no obvious advantage within this work, the worst-case fidelity remains an interesting alternative figure of merit to be further studied in future investigations.
FIG. 4.Reinforcement learning for designing high-fidelity quantum gates.RL framework involves two main entities: the environment, which is a system of two coupled transmons simulated as anharmonic oscillators truncated a three energy levels, and the RL agent, which uses the DDPG algorithm for learning continuous control drives.We focus on learning 2 control drives (cross-resonance u01 and qubit 1 rotation d1) in the main text, and report additional results for including a third control drive (qubit 0 rotation d0) in App.E 1. a) Step 1: Collecting data.At every step, the current state s of the environment is characterized by the time-evolved quantum state of the transmons {ψj(t)}, the previous control pulse amplitudes Aprev, and the relative changes in system parameters ∆⃗ p/⃗ p0.Based on that state s, the RL agent proposes an action a to determine control drive amplitudes that evolve the transmon environment forward in time.The environment outputs the next state s ′ and a fidelity-based reward r (cf.Eq. 24), and the transition tuple (s, a, r, s ′ ) is stored to an Experience Replay Buffer.An episode is complete when the RL agent fully constructs an N -segment pulse, and data from many episodes are collected for training.Here we consider a sparse reward scheme, meaning a non-zero reward is given only at the end of each episode.In addition, during data collection, some noise is injected into the RL agent's action to encourage exploration of new control solutions (cf.Alg. 1).b) Step 2: Training.Transition data from the Experience Replay Buffer are randomly sampled for batch-training two networks in DDPG algorithm: a value network Q, which learns to accurately predict the expected cumulative reward Q(s, a) of taking an action a from a state s, and a policy network µ, which learns to propose an action a = µ(s) that maximizes this Q-value.Outside of this training process, RL agent typically refers to the policy network µ because it generates all of the agent's actions.c) Step 3: Testing.Once trained, the RL agent can deterministically construct pulses with fidelity ≳ 99.9%, not only for a fixed environment, but also for environments whose parameters have drifted.

FIG. 5 .
FIG. 5. Optimization for the single-qubit gate X(π/2) in two-transmon setting.(a) IBM 35.6 ns DRAG pulse.(b) 10 ns RL-optimized pulse with training hyperparameters given in the "Fixed Environment" section of TableI.(c) Corresponding evolution of Bloch coordinates for the controlled qubit.(d) Population leakage to |2⟩ for RL pulse is up to a few orders of magnitude higher compared to DRAG during the evolution.RL-optimized pulse is 3× faster with similar average gate fidelity above 99.9%, and makes use of the presence of level |2⟩, at the expense of accessing three times larger amplitudes.

FIG. 6 .
FIG.6.Optimization for the cross-resonance gate ZX(π/2).We display results with fidelity over 99.9% for the direct and RL approaches at gate durations 248.9 ns and 177.8 ns.(a-b) Optimized pulse envelopes for the cross-resonance drive u01 and target qubit drive d1.(c-d) Corresponding evolution of Bloch coordinates for target qubit when the control state is |0⟩ or |1⟩.Pulses designed by our RL agent appear considerably different from the direct scheme, in both pulse shape and quantum state dynamics.Furthermore, our RL agent manages to shorten the gate duration to 177.8 ns without compromising 99.9% fidelity.RL training hyperparameters are given in the "Fixed Environment" section of TableI.
FIG. 9.Effect of removing target qubit drive d1 from optimized ZX(π/2) pulses.a) Average linear entropy Slin as defined in Sec.II D. b) Fidelity of control qubit averaged over initial states |00⟩ and |10⟩.The target qubit drive practically only affects the target qubit rotation in the direct scheme.Meanwhile, the RL agent discovers solutions where this on-resonance drive works in tandem with the crossresonance drive to generate entanglement and rotate the control qubit.
FIG.11.Fidelity of fixed-environment control solutions in the presence of system drifts.We sample drifts in system parameters (see legend) of a single type (solid curves) and of all types simultaneously (dashed black curve), and then bin the data points according to the maximum drift.The binned mean fidelities (curves) and their standard deviations (shaded areas) are displayed.(a-b) Fidelity of pulses from direct implementation and RL optimization are evaluated on new environments.The direct pulse is susceptible mostly only to drift in drive strength, whereas the RL solution is susceptible to drift in all parameters.(c) Fidelity of adaptive pulse found by RL agent via interaction with each new environment; solution remains susceptible to drift in detuning and anharmonicity while generalizing well for drift in drive strength.Data represents optimizing a 248.9 ns CNOT pulse, with RL training hyperparameters given in the "Drifting Environment" section of TableI.
FIG.12.Improved generalization fidelity when using augmented RL approach.As reference, we import fidelity curves (solid) from an agent trained on a fixed environment from Fig.11c.We display improved results for training on an environment with drifting system parameters, when the agent has no knowledge (dotted) or full knowledge (dashed) of the drifting parameters, i.e., context information.(a) Drift in detuning frequency only.Training on environments with drifting system parameters is sufficient in improving the fidelity to ≳ 99.9% within a 5% drift.Having context information provides a slight improvement in performance while cutting training time in half.(b) Drift in all parameters.Context is essential to stabilize training, and provides the best generalization result.RL training hyperparameters are given in the "Drifting Environment" section of TableI.

Appendix D: Dynamics of optimized pulses 1 .
Leakage for RL pulses

FIG. 16 .
FIG.16.Evolution of ZX rotation angle under different branch cuts.θZX is first computed in the principal branch cut (blue), i.e., without any phase shifts.Then, a phase shift is determined and added at every time step, resulting in the shifted branch cut (orange), which ensures a well-behaved evolution of rotation angles.Without shifting the branch cut, the observed large jumps obscure meaningful interpretation of the accumulated rotation angles.

FIG. 17 .
FIG. 17.Remaining rotation angles of optimized ZX(π/2 pulses.As complementary to Fig.8in the main text, we display the remaining rotation angles, categorized into: a) control qubit rotations, b) small entangling interactions expected from CR Hamiltonian, and c) small entangling interactions not expected from CR Hamiltonian.Distinct evolution of rotation angles implies distinct physical processes in all three control solutions.

d 1 FIG. 18 .FIG. 19 .
FIG.18.RL optimized pulses for 3 drives.With the single-qubit drive d0 on the control transmon is included as compared to the main text, our DDPG RL agent effectively solves a 120-dimensional optimization problem.The control solutions found by our agent retain fidelity above 99.9%.RL training hyperparameters are given in the "3 drives" section of TableI.

ψ0 ⟨ψ 0 2 = min {ci} ij c * i c j ⟨i|U qubit U † target |j⟩ 2 ,
|U qubit U † target |ψ 0 ⟩ (E1) Coupled transmons simulated as Duffing oscillators.Simulation is truncated at three energy levels per transmon (the faded fourth level is shown but not considered), and performed in a rotating frame.The first two levels act as qubits (dashed boxes).External control drives (purple) include on-resonance and cross-resonance complex control fields, denoted by d(t) and u(t) in the main text, respectively.The full Hamiltonian in Eq. 10 is completely characterized by the detuning δj and anharmonicity αj for each transmon, drive strengths {Ω d 0 , Ωu 01 , Ω d 1 , Ωu 10 } for 4 external controls, and the direct coupling J.

TABLE I .
[54]ning hyperparameters for training an RL agent to design quantum gates on simulated transmon environment.Unmentioned hyperparameters necessary for the DDPG algorithm are set to their default values in RLlib's implementation[54], version 2.0.0.*When studying the adaptability of our RL agent to drifting system characteristics, we discover a particular region around +2% drift on all system parameters where the 20-segment ansatz yields no solution with fidelity better than 99.5%.The problem goes away when we increase the number of segments to 28, whose result was reported in the main text.
. As expected, shorter RL pulses Population leakage throughout gate duration for RL ZX(π/2) pulses with different gate durations.