A differentiable programming method for quantum control

Optimal control is highly desirable in many current quantum systems, especially to realize tasks in quantum information processing. We introduce a method based on differentiable programming to leverage explicit knowledge of the differential equations governing the dynamics of the system. In particular, a control agent is represented as a neural network that maps the state of the system at a given time to a control pulse. The parameters of this agent are optimized via gradient information obtained by direct differentiation through both the neural network \emph{and} the differential equation of the system. This fully differentiable reinforcement learning approach ultimately yields time-dependent control parameters optimizing a desired figure of merit. We demonstrate the method's viability and robustness to noise in eigenstate preparation tasks for three systems: a~single qubit, a~chain of qubits, and a quantum parametric oscillator.


Introduction
In many applications, effective manipulation of physical systems requires an optimization of the available control parameters such as control fields. Since classical intuition often fails for quantum mechanical systems, devising optimal control strategies can be very challenging [1]. A wide range of traditional optimization methods obtain the optimal control pulse sequences by a gradient-based maximization of a certain, task-specific objective [2][3][4][5][6][7]. Additional constraints beside the main objective can be efficiently implemented using automatic differentiation (AD) [8] and take advantage of GPU acceleration.
In recent years, astonishing advances in reinforcement learning (RL) has led to great opportunities for control optimization. In model-free RL, the training of the agent is based purely on collecting rewards and subsequent modification of network parameters to maximize the expected reward over all possible trajectories without any explicit representation of the system [9]. In contrast to standard optimal control algorithms, perturbations of the initial state are naturally captured within this framework. Successful demonstrations of model-free RL in physics include, e.g. the manipulation of a quantum-spin chain [10], the inversion of the quantum Kapitza oscillator [11], and quantum gate-control optimization [12]. The authors of reference [10] also show that the performance is comparable to standard optimal control algorithms. By implementation of physical knowledge through a sophisticated reward scheme a student/teacher network could discover quantum error correction strategies from scratch [13].
On the contrary, model-based RL approaches learn an explicit representation of the system, i.e. a function which explicitly models state transitions and rewards, simultaneously with a value function or policy. The model of the environment is typically built from the (potentially sparse) reward signals which is, however, computationally expensive. Nevertheless, if feasible, model-based RL bears the potential to learn optimal policies from a smaller number of environment interactions and efficiently handle changing objectives as well [14,15]. For further reading on different approaches within the RL framework, see, for example, OpenAI's Spinning Up project [16].
Deep-learning methods model the response of a control agent with highly parametrized functions in the form of neural networks (NNs). A useful search direction for the optimization is then indicated by the gradients of these parameters with respect to a given figure of merit. If the environment is implemented in a computer simulation where primitives for all derivative operations are defined [17], AD allows us to backpropagate directly through the time trajectory and we thus effectively switch from model-free to model-based RL [18,19]. Given such a differentiable simulation, a great speed-up in the optimization process was demonstrated [20,21]. This approach is part of a paradigm named differentiable programming (DP) where programs are designed in a fully differentiable way [22,23]. The learnable structures, such as NNs, are then embedded in a standard procedural code as a way to obtain the resulting program. Among the first examples, it was demonstrated that DP allows one to find optimal solutions much faster than model-free RL for standard classical problems [18,22]. In physics, tensor network algorithms were programmed in a fully differentiable way [24].
In this paper, we follow this pivotal shift in the usage of RL and show that using the DP method, a control agent can be trained to provide successful control strategies for noisy inputs within different quantum mechanical systems. We start by introducing and illustrating the workflow of our method using the most basic quantum system of a single qubit in section 6.1. We then evaluate the method on two further examples: In section 6.2, we investigate the preparation of the GHZ state in a chain of qubits. In section 6.3, the eigenstate preparation in the case of a quantum parametric oscillator is considered.

Quantum control problem
Suppose a quantum system is described by a Hamiltonian where H 0 is a time-independent part (which we will refer to as the drift term) and H k represents control fields with respective scalar amplitudes u k (t). The total number of independent control fields is K. The dynamics of a state |ψ(t) , represented as a vector in a Hilbert space, is governed by the Schrödinger equation i∂ t |ψ(t) = H(t)|ψ(t) given some initial state |ψ(t 0 ) ≡ |ψ(t = 0) (we use ℏ = 1 throughout the paper). The goal of quantum control is to optimize u k (t) over a given time interval according to a certain figure of merit, e.g. to maximize an overlap with some target state. The overlap F(t) = | ψ(t)|ψ tar | 2 is generally called fidelity 1 . Many numerical approaches consist of discretizing the time interval into N steps in which the control amplitudes are considered constant. The pulse sequences u k (t i ), i = 1…N are then optimized by gradient methods. Popular examples are the GRAPE algorithm [4] and the class of algorithms generally attributed to Krotov [3]. Alternative approaches like the CRAB algorithm [25,26] do not rely on evaluating gradients but rather map the problem to finding the extrema of a multivariate function using direct search methods such as Simplex methods [25].
A common feature of the state preparation methods listed above is their sensitivity to the input state. If the input state is altered, the control fields will in general no longer be optimal. It should be noted, though, that proposals for generalization of the approach have been already put forward, see references [5,27,28]. RL formulates control tasks in terms of Markov Decision Processes which map an individual state |ψ(t i ) to the associated actions. Therefore, the noisy inputs can be treated very naturally, provided that the agent was exposed to sufficiently many training examples. Our differentiable programming method, as described in the next section, uses a similar strategy.

Quantum optimal control with differentiable programming
Let us now describe how a differentiable programming approach allows us to solve the quantum optimal control problem as defined in section 2. The method consists of three main building blocks: the predictive model, the physical system, and the loss function L. The connection between these building blocks is depicted in figure 1. Starting from some initial state, |ψ(t 0 ) , and vanishing control fields, in each time step i the predictive model receives the current quantum state, |ψ(t i ) , as well as the last control field strengths, u k (t i−1 ), as an input. The predictive model maps this input to the next control field strengths, u k (t i ). Adopting the terminology of RL, we now also refer to the predictive model as agent and to the control fields as actions. According to our experience, if the agent is aware of its last actions, the optimal strategy can be found more easily and often provides smoother control signals. Moreover, the output of the predictive model can be easily reformulated to return the gradients rather than the action themselves. In that case, the maximal fluctuations of the control fields between the time steps can be naturally controlled which may be required in experimental applications. Solving the ordinary differential equation (ODE) of the physical Figure 1. Sketch of our differential programming approach for quantum optimal control. In the forward pass, starting from an initial state, |ψ(t0)⟩ (and vanishing control fields), in each time step i, the predictive model maps the quantum state |ψ(t i )⟩ to the associated optimal control amplitudes u(t i ). These amplitudes determine the time evolution of the physical system from |ψ(t i )⟩ to |ψ(t i+1 )⟩, as computed by an ODE solver. Both the state |ψ(t i+1 )⟩ and the amplitudes u(t i ) contribute to the loss function L(t i ). This procedure is iterated for all N steps. In the backward pass, automatic differentiation allows us to compute the gradients of L(t i ) with respect to the parameters of the predictive model by direct differentiation through the ODE solver. These gradients are used to optimize the model's parameters in a series of epochs. To ensure sufficiently smooth control fields and to achieve a faster learning progress, we also give the last amplitudes u(t i−1 ) to the predictive model as an input (dashed arrows). system, as determined by the Hamiltonian and the Schrödinger equation, leads to the next state |ψ(t i+1 ) . This state, as well as the actions taken, u k (t i ), enter the computation of the loss function which maps the inputs to a scalar value. This procedure is iterated for all time steps. Figure 2 shows the architecture of the predictive model. We use artificial NNs with linear (fully-connected) layers and rectified linear units (ReLUs) as the activation functions. We divide the full NN into three separate networks: (i) A state-aware network f s that takes the current quantum state, |ψ(t i ) ∈ C D , and computes a map where D is the Hilbert space dimension and B is the number of output features in the final layer of the state-aware network. (ii) An action-aware network f a that takes the last actions, u k (t i−1 ) ∈ R K , and computes a map where K is the total number of control fields and B is identical as above, and (iii) a combination-aware network f c that post-processes the sum of the two maps f s (|ψ(t i ) ) + f a (u k (t i−1 )), and predicts the next control fields with B and K as above.
Here, we combined f s and f a by a simple addition, but other combinations such as concatenation are also possible. Which combination is most efficient in a given situation will depend on the particular physical system and number of parameters of the model. All our code is implemented in PyTorch [29] which has no complex tensor support, yet. Therefore, we map the Schrödinger equation via the isomorphism to real-valued vectors and matrices. The subscripts Re/Im denote the real and imaginary part, respectively.
T as the input for the state-aware network. For the time evolution of the physical system, as described by the ODE problem (5), we use Heun's method [30] to get the next time point i + 1.
The subsequent computation of the loss function consists of a weighted sum of individual contributions L µ that have a well-defined physical meaning. These terms are explained in detail in section 5. Importantly, since the physical system is implemented entirely in PyTorch and is differentiable, the derivative of the loss function with respect to the network's weights can be computed at any point in time.
We use the Adam optimizer [31] to train the model in a series of epochs, where b simulations are carried out in parallel per epoch, see figure 2. Hyperparameters, such as the coefficients of the individual loss terms as well as the number of output features of the linear layers, are obtained either empirically or using the Bayesian Optimization and Hyperband method (BOHB) [32] which is a combination of random search (bandit-based method) with Bayesian optimization.
The embedding of the physical model into the backpropagation pass (which is not the case in model-free RL), has led to far more effective control strategies and faster training for classical environments [18]. The potential robustness of the method with respect to noise in the input is desired for practical purposes where inputs can not be prepared to arbitrary precision. To test the performance of our scheme, we apply it to different physical systems, which are introduced in the next section.

Qubit chain with nearest-neighbor interaction
We consider a chain of M identical qubits (or spins 1 2 ) with the control Hamiltonian where σ z are the Pauli matrices acting on the ith site of the chain. For any single qubit we have two independent control fields with amplitudes u  The dimension of the Hilbert space scales as D = 2 M . The set of vectors representing all possible configurations of the z projection at individual sites forms a basis. Therefore a general state vector |ψ can be expanded as with unique coefficients C 0 , . . . , C D−1 . The drift term has the form of a nearest-neighbor interaction. As we set J > 0, the ground state is given by the doubly degenerate manifold spanned by the Néel states |ψ 1 gs = | ↑↓↑ . . . z and |ψ 2 gs = | ↓↑↓ . . . z . In section 6.1 we consider a simplified model case with only one qubit and the Hamiltonian where ω is the qubit frequency. Note that the single control field u x (t) is sufficient to realize any rotation as σ y can be obtained as a commutator of σ x and σ z .

Quantum parametric oscillator
The other system considered is a driven non-linear optical resonator whose dynamics can be described by the following Hamiltonian in the rotating frame (see, for example, reference [33]) The first term is a Kerr (non-linear) interaction between the photons. They are annihilated and created using the respective operators a, a † . We can represent a general state |ψ in the Fock basis, i.e. the basis of photon occupation number states |n , as The mean photon number in the state |ψ is given as n ψ = ψ|a † a|ψ . In the following, we fix the amplitude of the coherent two-photon drive relative to the strength of the Kerr non-linearity G =−4U. Our control scheme consists of one tunable parameter u(t) which controls the amplitude of the single-photon drive. Its action can be intuitively understood from the classical counterpart of a similar system. The classical potential V(x) is obtained from equation (9) by writing the operators as c-numbers √ 2a = x + ip and setting p = 0. Figure 3(b) reveals a symmetric double-well potential when the control field is switched off. If the control field is switched on, the potential becomes tilted as shown in figure 3(c).

Loss functions
Multiple objectives can be passed on to the agent via a case-specific loss function L. We assume the generic form [8,26,34] which is a linear combination of elementary loss functions L µ encoding specific details of the control task. The agent will reflect these details according to their relative importance given by the coefficients c µ . They have to be optimized either empirically (as in reference [8]) or by means of some hyperparameter optimization algorithms. The elementary loss functions used in this paper are summarized in table 1. The function L F is a discounted sum over the infidelity for every time step i with the real positive discount factor γ ≤ 1. This loss function serves as the main objective for our control problems as its minimization leads to a maximal overlap with the target state on the entire control interval. Therefore, minimization of L F minimizes the time to reach the target state.
In the case of eigenstate preparation, once the target state has been reached, the control fields can be switched off. Then, the state remains unchanged under the time evolution according to the drift Hamiltonian. We add the loss function L FN to ensure that the target state is prepared at the final time step. To focus on another particular point in time, L FN can be adjusted accordingly.
The remaining loss functions in table 1 represent additional constraints on the final control pulse sequence. Namely, by employing L amp or L ′ amp the agent is forced to prefer smaller amplitudes 2 . Of course, other elementary loss functions can be added to equation (11) according to the specific requirements of the control task.

Optimal control of a single qubit
We start with a minimal quantum model to illustrate the workflow of our method. We consider a qubit in a magnetic field that can be manipulated by a control field in x direction described by the Hamiltonian (8). The general task is to move an arbitrary initial state |ψ(t 0 ) = cos( θ 2 )| ↑ z + sin( θ 2 )e iϕ | ↓ z , with θ ∈ [0, π] and ϕ ∈ [0, 2π) uniformly sampled on the Bloch sphere, into the state |ψ tar = | ↑ z with high fidelity in the shortest possible time. Note that this task is significantly different from the typical quantum optimal control setup where the initial state has fixed values for θ and ϕ.
The predictive model is constructed as in figure 2 and the number of weights and biases are constant for all qubit experiments, see table A1 in appendix A. We use the Bayesian Optimization and Hyperband (BOHB) method [32] to tune the number of parallel simulations b, the learning rate of the Adam optimizer, and the coefficients c F , c FN and c ′ amp of the loss function L (11). The objective for BOHB must not depend explicitly on the c µ , as they are also optimized. A natural choice for the BOHB objective is L F from table 1. More details of the hyperparameter optimization can be found in appendix B.
The results of the control task with the optimized hyperparameters are shown in figure 4. The two rows compare different choices of the number of hyperparameters in the loss function (11). The first two columns visualize the loss and the mean of the final state infidelity as a function of the epochs. Smaller final fidelities are reached when several loss functions are combined. We applied the trained model to a test set of 512 random configurations. The results for the mean and the standard deviation of the fidelity as a function of time and of the applied control fields clearly show the different behaviors for different choices of hyperparameters in the loss function. Particularly, the term L ′ amp pushes the actions to zero and hence helps to achieve a well-defined shorter control sequence. Comparison with previous results [8] shows that we get a very similar control pulse when the initial state is chosen as the ground state of the drift term of the Hamiltonian (8). A comparison with a RL approach can be found in appendix C.

GHZ state preparation in a chain of qubits
In this section, we aim at the preparation of a GHZ state [35] in a chain of M qubits These states play an important role, e.g. for multi-particle generalizations of superdense coding [36] or in quantum secret sharing [37]. The initial state |ψ(t 0 ) is chosen as one of the ground states of the drift Hamiltonian from equation (6) (states |ψ 1 gs and |ψ 2 gs , see section 4.1). We assume the input contains noise which is implemented as flipping any single qubit in the initial state with 10% probability.   (7). The GHZ state corresponds to equal occupations of the leftmost and the rightmost columns in the histogram, i.e. |C0| 2 = |C D−1 | 2 = 0.5. The evolution of the mean fidelityF of the test set during the preparation process for each step i = 1…N is plotted in the fourth column. The blue bands highlight one standard deviation. The last column depicts the applied 2 M independent actions averaged over the test set for each step i. All hyperparameters can be found in table A1 in appendix A.
According to our goal, we set |ψ tar = |GHZ . Note that the target is composed of two eigenstates of the drift Hamiltonian with identical eigenvalues. Therefore, once the GHZ state has been prepared, the control fields can be switched off, and the state evolves trivially with the drift Hamiltonian. Hence, the control problem fits well into the category of eigenstate preparation.
The coefficients c µ from equation (11) were obtained using the BOHB algorithm. The same algorithm was also used to optimize other hyperparameters, i.e. the learning rate, the number of parallel trajectories b, and the number of weights in each layer of the NN for any size of the system M. Their values are summarized in table A1 in appendix A.
The training phase of the NN as well as examples of its performance on test data sets are shown in figure 5. As compared to a typical RL method, the loss functions L show rapid initial progress and do not display any strong oscillations during the learning process. The performance we require from our control scheme is to reach fidelities above 99.9% at latest at the end of the control interval. Obviously, the number of the training epochs required to accomplish this grows with M (see the second column of figure 5). However, for all system sizes considered we manage to train the NN to reach the desired performance as demonstrated in the third and the fourth columns of figure 5. Indeed, even for the case with M = 6 qubits, where the initial noise leads to higher fluctuations in the test data, the final average fidelityF(t N ) meets our precision requirements while having effectively no dispersion at the same time. The last column in figure 5 shows the averaged signals from the control fields. The NN also learned to set all control fields to zero, once the GHZ state has been reached.

Eigenstate preparation in a quantum parametric oscillator
As the last example, we aim at the preparation of a specific eigenstate of the drift Hamiltonian from equation (9). The initial state is, again, considered as noisy, In our numerical implementation we truncate the Hilbert space of photons to a finite dimension D by considering only the first D Fock states |n . Random noise ξ n is uniformly sampled from an interval [−ξ, ξ] where ξ is a fixed parameter of the environment. The exponential factor guarantees that the contribution of the highest-lying Fock states is reduced, i.e. |ψ(t 0 ) is dominantly distributed among the low-lying Fock states.
Let |α be a coherent state, defined as a|α = α|α , with annihilation operator a and complex α [38]. These states are often referred to as 'the most classical ones' as they satisfy the minimal Heisenberg uncertainty relation. We consider an example of a Schrödinger cat state |cat α = 1 √ 2 (|α + |−α ) [39]. The respective Wigner quasiprobability distribution is formed by two Gaussians in phase space located at the coordinates (x = √ 2 Reα, p = √ 2 Imα) with a non-classical interference pattern between them. Such states can serve as resources for quantum computing, since they can encode a qubit protected against phase-flip errors [40][41][42]. In our setting of the model, a specific cat state with α = 2 happens to be the first excited eigenstate of the drift Hamiltonian. We further refer to this state as the eigen-cat state. The goal of this section is to transfer the initial state |ψ(t 0 ) to the target |ψ tar = |cat 2 .
The loss function L, again, has the form prescribed by equation (11). Together with other hyperparameters, the coefficients c µ are presented in table A1 in appendix A. In this case, they were tuned empirically to achieve a desirable performance of the agent 3 . The results are collected in figure 6. As shown in panel (a), after around 300 initial epochs, the agent starts learning quickly. The learning process is stable without any significant oscillations. The performance of the trained NN is shown in panels (b)-(e). The mean fidelity of the test set reachesF ≈ 0.93 with a small variance. The state at the end of the control interval of a randomly chosen example from the test set is depicted in panels (c) and (d). The respective pulse applied is shown in panel (e). The pulse oscillates nearly symmetrically around zero which translates to oscillatory tilting of the classical potential to left and right, c.f. figure 3(c). This qualitative behavior reflects the symmetry of the control task where the target state is equally distributed in both wells and the initial one (though slightly perturbed) at the top of the barrier. As the fidelity approaches its final value, the amplitude of the control field decreases until it only oscillates noisily around zero.
Let us compare to a known protocol to prepare an eigen-cat state; it is based on an adiabatic attenuation of the amplitude G of the parametric drive from equation (9), see reference [43] for recent experimental result. While this is also easily revealed by our agent, we assume here the amplitude G to be a given constant over time and allow the agent only to control the amplitude of an additional single-photon drive. For this problem, the exact pulse shape is not easily anticipated, but our agent managed to find a non-trivial sequence with final fidelities comparable to reference [43].

Discussion and outlook
For the systems tested, the differentiable programming method showed promising results and reliably found successful protocols for eigenstate preparation. The optimal strategies can be obtained after a few hundreds of training epochs. The method is intrinsically robust towards uncertainties in the input data. Thus provides more versatility compared to standard quantum control algorithms. Also, the convergence is smooth and stable with respect to different initial seeds and small variations of the hyperparameters.
Despite the uncertainty in the initial states, we managed to reach fidelities larger than 99.9% in the preparation of a GHZ state in a chain of M qubits. We exclusively used fully-connected layers, which limits the depth and thus also the representational power of the NN. Therefore, our current architecture is not well suited for high dimensional (more than 1000) inputs.
To handle even larger input sizes, modifications of the network architecture, which leverage structure assumption on the input wave function, might become necessary. Our current algorithm relies on a classical simulation of a quantum many-body system and is limited by the rapidly growing dimension of the Hilbert space with M. More sophisticated representations of the wave function, such as matrix product states [44] or the multiconfigurational time-dependent Hartree method for indistinguishable particles (MCTDH-X) [45,46], bear the potential to further improve the scalability of the simulations. Modifications of our method towards classical-quantum hybrid algorithms that incorporate near-term quantum devices are also conceivable [47,48].
In the case of the quantum parametric oscillator, the eigen-cat state preparation was accomplished with high fidelities despite the noisy input data. In our current setting, the agent has full information about the state at each time step. To better represent experimental reality, some form of photon field measurement could be included. The evolution of such an open system can be described by a stochastic Schrödinger equation (SSE). Optimal control with stochastic dynamics has drawn attention recently, see references [34,49]. Our method can be applied to open systems in a straightforward way by replacing the current ODE by a SSE.
Furthermore, the robustness of the performance of the agent against experimental imperfections could be investigated within the current scheme. For instance, uncertainties of the pulse parameters could be treated in the same way as the noise in the input states, i.e. by adding noise to the respective control amplitudes and sampling over its distribution during the training.
In contrast to the approaches leveraging data from experimental measurements [50], we require a model of the physical system to set up the control tasks. However, recent advantages in parameter estimation from data have the potential to bridge this gap [49,51]. Furthermore, a student/teacher approach, similar to reference [13], is a conceivable solution.

Conclusion
In summary, we have introduced a differentiable programming method for quantum control that leverages information from the gradient obtained by differentiation through the dynamical equations of the system.  equation (11)). BOHB is based on the well-known Hyperband approach [53] to determine how many hyperparameter configurations are evaluated at a certain budget, c.f. pseudocode in reference [32]. In our case, the budget is identical to the number of epochs that are used to train the predictive model. In contrast to Hyperband, BOHB replaces the random selection of configurations at the beginning of each iteration by a model-based search. Figure B1(a) visualizes the frequency distribution as a function of the BOHB objective, L F , for model-based and random configurations. As desired, nearly all model-based picks achieve a very small value for L F . For both distributions, an increase of the budget leads to smaller values of L F on average, as expected. The differences between the three budgets can also be seen in the evolution of the final state infidelity, 1 −F(t N ), as a function of the wall clock time in figure B1(b). The wall clock time refers to the actual run time of the optimization or, in other words, it measures the elapsed time between the start of the optimization and the end of a given task in the BOHB iterations. We stress that this time strongly depends on the hardware used for the simulation. Please note that the infidelity is plotted in log-scale and that many configurations reach a very high fidelity.

Appendix C. Differentiable programming versus REINFORCE
Here we compare the performance of our differentiable programming (DP) method with a vanilla policy gradient algorithm (REINFORCE) [9,16,54,55] in the case of the qubit control problem from section 6.1. In a nutshell, our REINFORCE implementation is based on three substeps: Firstly, trajectories τ are sampled from the current Gaussian policy where the variance σ 2 = 0.04 is fixed and the mean µ = µ θ ({|ψ(t i ) , u x (t i−1 )}) is determined by the predictive model, which is structured identically to the one in the DP case, see table A1 in appendix A. Secondly, based on the rewards-to-go where the (properly adjusted) elementary loss functions from section 5 are employed, we approximate the gradient of the expected total reward over all trajectories J θ as .
Note that while the loss functions for the DP are minimized, the rewards in RL are maximized. Note further that the rewards-to-go are normalized before they enter the approximation of the gradient [56]. Thirdly, we use the ADAM optimizer with a learning rate of 0.000 25 to update the network parameters in a series of epochs, each consisting of 1024 trajectories. Figure C2. Preparation of the eigenstate | ↑⟩z of the drift term of the Hamiltonian (8) for the optimized hyperparameters containing the coefficients cF, cFN , c ′ amp based on a REINFORCE implementation. In the first column, the evolution of the mean reward R during the training phase is shown. The second column shows the mean value of the final state infidelity as a function of the epochs. The third and fourth columns show the performance of the model when it is applied to a test set of size 512. The red curve/ blue shaded region indicate the mean/standard deviation of the fidelity and the actions, respectively. The black dashed line highlights the special case with initial state given as |ψ(t0)⟩ = | ↓⟩z. Figure C3. Optimal regions in the hyperparameter space consisting of cF, c ′ amp , cFN and learning rate lr of ADAM as established by BOHB for the DP approach (bottom row) and the RL approach (top row). The figures show projections from the four-dimensional space onto the respective two-dimensional subspaces. The color map encodes the main objective of BOHB to be minimized, i.e. the average infidelity over the entire time interval LF. The brighter areas correspond to a better performance. The continuous map results from an interpolation between the discrete set of sampled configurations probed by the BOHB algorithm. Figure C2 is analogous to figure 4 with RL replacing DP. In general, we obtain a noisier evolution of the total expected reward during training as compared to the (relatively) smooth appearance of the training loss in the main text. Furthermore, the outcome is more sensitive to small changes of hyperparameters including the employed random seed. We expect that more sophisticated approaches, such as trust-region policy optimization [57] or proximal policy optimization [58], in combination with generalized advantage estimation [59] could improve the performance. As expected, more epochs were needed to train the neural network. Also, the agent did not learn to push the control fields towards zero once the target state has been prepared.
A comparison between the mean infidelity L F over all time steps as a function of the hyperparameters c F , c FN , c ′ amp and the learning rate lr for both approaches is depicted in figure C3. What is plotted are the two-dimensional projections of L F onto the subspaces of the respective pairs of hyperparameters. The fact that the landscapes for the DP method (lower panel) reach lower infidelities as compared to the REINFORCE implementation (top panel) shows that the DP method generally performs better. Also the bright regions, indicating a successful performance, are more extended in our approach which implies that it is more stable with respect to changes in the hyperparameters. Remarkably, DP is trained only for 400 epochs while 10 000 epochs are used in case of the REINFORCE algorithm.