Classifying global state preparation via deep reinforcement learning

Quantum information processing often requires the preparation of arbitrary quantum states, such as all the states on the Bloch sphere for two-level systems. While numerical optimization can prepare individual target states, they lack the ability to find general solutions that work for a large class of states in more complicated quantum systems. Here, we demonstrate global quantum control by preparing a continuous set of states with deep reinforcement learning. The protocols are represented using neural networks, which automatically groups the protocols into similar types, which could be useful for finding classes of protocols and extracting physical insights. As application, we generate arbitrary superposition states for the electron spin in complex multi-level nitrogen-vacancy centers, revealing classes of protocols characterized by specific preparation timescales. Our method could help improve control of near-term quantum computers, quantum sensing devices and quantum simulations.


Introduction
Deep reinforcement learning has revolutionized computer control over complex games [1][2][3], which are notoriously difficult to optimize with established methods [3]. Recently, reinforcement learning has also been successfully applied to a wide array of physics problems [4][5][6][7][8][9][10][11][12][13][14][15]. A particularly promising application is optimizing quantum control problems [6, 8-10, 12, 16], which are of utmost importance to enable quantum technologies and quantum devices for information processing and quantum computation. Quantum systems are controlled by unitary operations engineered by a set of physical operations on the quantum system, which we refer to as a protocol. To efficiently manipulate quantum sensing devices and programmable quantum computers, a large set of different unitary operations and protocols has to be mastered. This is in general a non-trivial task. For example, to generate an arbitrary quantum state in a two-level system, a two-dimensional set of protocols has to be learned. Of course, for specific types of driving the parameterization is well-known (e.g. in terms of rotations around the Bloch sphere) and there is a simple understanding of how to control the spin dynamics on the Bloch sphere. However, in higher-dimensional quantum systems or systems where the driving is constrained, no general description is available, which is a limiting factor in the control of quantum devices. Standard numerical tools available can find control protocols for individual target quantum states, however it is difficult to find a class of protocols that can parameterize all the protocols as it can be highly disordered with many near-optimal solutions [5]. Thus, we ask: how can one find classes of solutions to generate arbitrary states in higher-dimensional quantum systems and/or with constrained driving parameters? And can they be used to extract physical insights? Figure 1. Overview of deep reinforcement algorithm to learn protocols to generate any target states. (a), (b) Learning scheme to generate piece-wise constant control protocol with proximal policy optimization (PPO, details in method section): at the first iteration step, a random target state Ψtarget (parameterized by angles θ, ϕ) and the initial state Ψ(t0) is chosen. At iteration step n + 1 of the algorithm, the neural network takes in Ψtarget and the current wavefunction Ψ(tn). The output of the neural network generates the driving parameters (timestep ∆t, driving strength Ω) to manipulate the quantum system to reach Ψ(t n + 1 ), which is fed back to the neural network. These steps are repeated until final step N T . The fidelity of the generated state with the target state is calculated and used to update the weights of the neural network. For training, proximal policy optimization (PPO) uses two separate neural networks, called policy and value network. The policy network decides the next driving parameter, whereas the value network estimates the quality of that choice [17]. This concludes one epoch of the training protocol. Over many training epochs, the neural network learns the control protocols for arbitrary target states parameterized with θ, ϕ. (c) After the training, the neural network returns optimized piece-wise constant control protocols to reach any target state Ψtarget.
Here, instead of finding the optimal driving protocol for a single state, we propose a scalable method to learn all the driving protocols for global state preparation (over the continuous two-dimensional subspace represented by the Bloch sphere, embedded in a higher-dimensional Hilbert space) using deep reinforcement learning and parameterizing the protocols with neural networks. We discover that this approach automatically finds clusters of similar protocols. For example, it groups protocols according to how much time is needed to generate specific target states, which could be used to identify patterns and physical constraints in the protocols. For this multi-dimensional problem, conventional optimization often finds uncorrelated protocols in the target space, especially if there are multiple distinct protocols achieving similar fidelity. This makes it very difficult to interpolate between the protocols. Using our approach, the clustering of similar protocols is a consequence of effective interpolation by the neural network of nearby protocols within the same cluster, thereby achieving effective arbitrary state preparation. The neural network is trained with random target states as input. This makes it naturally suited for parallelization and could allow it to scale to higher dimensions. In essence, our proposal shows a path to solve the key limitations of single-state learning or a transfer-learning approach as shown in [12].
As a demonstration of our method, we apply it to control the electron spin in multi-level nitrogen-vacancy (NV) centers [18][19][20][21]. The multiple levels give rise to complicated coherent and dissipative dynamics, thus ruling out simple driving protocols such as those in two-level systems (e.g. quantum dot, transmon) and further increasing the difficulty of optimization. The electron spin triplet ground states are characterized by a long lifetime which makes them ideal candidates for solid-state qubits used in quantum information processing [22][23][24]. However these states cannot be coupled directly using lasers, which would be important for many applications [25,26]. Optical control can instead be achieved by indirect optical driving via a manifold of excited states, which renders finding fast control protocols a difficult problem. With our algorithm we find protocols to prepare arbitrary superposition states with a preparation speed of about half a nanosecond, which is much faster than protocols based on STIRAP control [19] and reduces the impact of dissipation on the state preparation. Additionally, our protocols require only nine steps, which is considerably less than in comparable approaches [13,21]. More importantly, our algorithm finds near-optimal protocols that are automatically grouped into distinct classes by the neural network. We find that the protocols within a class have similar driving protocols as well as similar total protocols times T. Each class prepares a subset of all the target states on the Bloch sphere with high fidelity.

Learn global control protocols
A quantum system is evolved by a unitary operatorÛ, which takes a quantum state Ψ to the final state Ψ ′ =ÛΨ. The unitary is physically realized by evolving with the HamiltonianĤ(Ω), parameterized by controllable parameters Ω, over a time t withÛ = exp(−iĤt). We can engineer more sophisticated unitaries by piece-wise applying N T different unitariesÛ =Û (NT)Û(NT−1) . . .Û (2)Û(1) , each unitary given bŷ U (k) = exp(−iĤ(Ω (k) )∆t (k) ) with driving strength Ω (k) and timestep ∆t (k) . To apply the unitary on a physical system, a total time T = k ∆t (k) is needed. The minimal time needed to perform a specific task may vary depending on the target unitary and the driving Hamiltonian [27]. We define a protocol as a vector β = {Ω (1) , ∆t (1) , . . . , Ω (NT) , ∆t (NT) } that contains all the control parameters to generate a specific target state Ψ target .
We now outline the algorithm to generate control protocols β θ,ϕ (t) to create all possible target states Ψ target (θ, ϕ) parametrized by angles θ and ϕ (see figure 1). At time t k the input to the neural network a randomly sampled target state Ψ target and a description of the system state (e.g. the current wavefunction Ψ(t k )). The neural network output determines the parameters (driving strength Ω (k) , timestep ∆t (k) ) for the piece-wise constant protocol used to evolve the wavefunction to the new state Ψ(t k+1 ) =Û (k) Ψ(t k ) with unitaryÛ (k) given byÛ (k) = exp(−iĤ(Ω (k) )∆t (k) ). This new wavefunction is then input to the neural network to determine the next protocol step. This is repeated until the final timestep N T , which concludes one training episode. The goal is to drive the system such that wavefunction is as close as possible to a target state Ψ target , measured by maximizing the fidelity: as a reward function. With each training episode, the neural network is trained with randomly sampled target states Ψ target (θ, ϕ) as input, until it converges and learns to represent protocols for all possible target states. The input to the neural network is vector of real numbers. To generate the protocol, the neural network is fed with information about the current state of the system at a specific timestep as well as the target state to be achieved. The neural network is given the angles θ and ϕ, which parametrize the target state. Further, the current wavefunction Ψ(t k ) is fed to the neural network with the real and imaginary parts of the probability amplitudes of each quantum level. The neural network is trained using PPO [17]. The neural network is composed of two parts: the critic estimates at every step the final quality of the protocol to optimize itself, while the actor at each step returns a part of the protocol. We use the PPO algorithm implemented by the OpenAI Spinning Up library [28]. Similar results can be achieved using the numerically more complex Trust Region Policy Optimization method (see results in supplementary materials (available online at stacks.iop.org/MLST/2/01LT02/ mmedia)) [29].

Electron spin control in NV center
The goal is to achieve coherent control between the states |−1⟩ and |+1⟩ of the triplet ground state manifold {|−1⟩ , |0⟩ , |+1⟩} via coupling to a manifold of excited states. Starting from the ground state Ψ 0 = |−1⟩, we would like to achieve a general superposition state of the Bloch sphere spanned by the two states where θ ∈ {0, π} and ϕ ∈ {0, 2π}. These two degenerate ground states are not coupled directly. The degeneracy can be lifted with an external magnetic field B ext . To coherently control them, we couple the ground states via two lasers to the excited state manifold from the ground state triplet within the optical frequency range. Through dissipative couplings (see Methods), a further metastable state can be occupied, thus the NV center is described by an open ten-level system. We investigate the limit where the protocol time is much faster than the dissipation, thus we can approximately treat the system as an effective closed eight-level system (comparison with full dynamics in the supplementary materials). We apply two driving lasers, with time-dependent strengths chosen by the protocol Ω 1 (t), Ω 2 (t). They have a relative detuning δ 1 , δ 2 to the energy difference of states |−1⟩ (|+1⟩) and |A 2 ⟩ (see section 5). This system resembles the well-known Λ system with three levels, however with an additional complicated set of levels that has to be excited and controlled. The excited levels interact non-trivially with one another as well. For these systems, simple or analytic solutions are difficult to find, especially for the non-adiabatic regime considered here.
The goal is to learn protocols to reach arbitrary superposition states of the two levels parameterized by angles θ and ϕ. To evaluate how similar protocols are, we introduce the following measures: Firstly, to evaluate the local change of protocols, we define the norm of the protocol gradient in respect to the target parameters θ and ϕ: where β(θ, ϕ) is the protocol for a given θ and ϕ and ||.|| 1 the L 1 norm. This measures how much the protocol is changing across the Bloch sphere. Secondly, to measure how close two different protocols are, we define the protocol distance: If two protocols are identical, the measure is zero. Else, the protocol distance is positive. Our protocols β are a vector of length 3N T consisting of N T timesteps with the two driving strengths Ω (k) 1 , Ω (k) 2 and length of timestep ∆t (k) . To calculate the correlation measure and gradient, we map the parameters between −0.5 and 0.5.
The result after training is shown in figure 2. We see that the fidelity of reaching the target state is high in most areas, with very sharp lines of low fidelity (see figure 2(a) that divides it into three areas; marked as R1 to R3). The sharpness of these lines increases with number of neurons as the neural network can represent more complex features and abrupt changes. (Further demonstration of this feature on a simpler spin rotation problem in the supplementary materials (available online at stacks.iop.org/MLST/2/01LT02/mmedia).) At these lines, the protocol changes drastically, while it changes only slowly in the other areas. We visualize this with the protocol gradient equation (3) in figure 2(b). It measures how much the protocol changes with θ and ϕ. The gradient is low within those three distinct areas, where the protocol is changing only slowly and protocols are very similar. Protocols in these areas are very similar to each other. We observe specific sharp lines where the gradient is very large. These correlate with the lines of low fidelity in figure 2(a). These lines are borders between two different protocol classes. At the lines of low fidelity, the protocol changes drastically, as the neural network is trying to trying to find a protocol between the two classes. As an analogy, we describe this phenomena as a kind of phase diagram: the areas of low gradient represent 'phases' (class of protocols of specific type), that are separated lines of high gradient akin to a 'phase transition' . At these lines, the algorithm struggles to find good protocols. These 'phases' are very well visible in the protocol time T, shown in figure 2(c) with distinct areas of similar T. These directly reveal some physical insight into the state generation. For states with cos(θ) ≈ 1 (close to initial state), fast protocols are sufficient. For increasing θ, protocols require a longer duration. Our algorithm automatically groups protocols into areas of similar protocol time T. Intermediate timed protocols (area R2) is concentrated for ϕ < π, while protocols with long duration (area R3) is mostly located in ϕ > π. This asymmetry comes from the magnetic field B that lifts the degeneracy between the states |±1⟩. We also note that ϕ = 0 and ϕ = 2π have different results in terms of fidelity and protocol, although they represent the same quantum state. The reason is that the neural network receives as input only the angles θ and ϕ of the target state and does not know about the symmetry of the problem (e.g. that ϕ is periodic). This symmetry can be enforced by hand, yielding similar fidelities (see supplementary materials). In figure 2(d), we show the distances between different protocols using equation (4). We chose a 21 × 21 sampling grid on the Bloch sphere. Although the protocols are actually parameterized by a 3N T dimensional vector, using the t-SNE algorithm we can map the distances between different protocols (given by equation (4)) onto a 2D representation. The color indicates the protocol time T. We see again distinct clusters of similar protocols, corresponding to the areas marked in figure 2 Representative examples of driving protocols from the three different protocol 'phases' are shown in figures 3(a)-(c). The protocols belonging to different 'phases' look distinct, while protocols within the same 'phase' look very similar (see supplementary materials for further example protocols). Figure 3(c) shows the length of each time step, where the total time (given by the integral over the steps) increases from R1 to R3. We observe also distinct shapes in the sequence of driving strengths, which vary for R1 to R3 (see figures 3(a) and (b)). In figures 3(d)-(f) we show the change of the protocol parameters for a cut at ϕ = π along the θ axis. We again observe the three distinct 'phases' for varying θ in the driving parameters. They change contentiously with θ, however remain similar within one 'phase' . However, sudden jumps where the protocol changes drastically are observed at cos(θ) ≈ 0 and cos(θ) ≈ 0.75 (most visible for timestep n t = 4 and n t = 8). At the border between two protocol 'phases' , the protocols change drastically in order to choose between them. This creates the sudden jumps observed. The protocols found are near-optimal solutions within a complicated optimization landscape [5]. As such, there are many possible solutions of nearly the same fidelity. By varying hyperparameters of the neural network (such as the number of neurons) or external parameters (such as the applied magnetic field) one may find different types of the protocols in the two-dimensional subspace. However, in most cases there are either two or three distinct areas (see other example in supplementary materials).
The hyperparameters of the neural network and the reinforcement learning affect the quality of the learning protocols. In figure 4, we show the average fidelity over all target states for varying number of protocol timesteps (figure 4(a)) and number of neural network neurons ( figure 4(b)). We observe that beyond a certain number of neurons or timesteps, the average fidelity does not increase anymore.
To compare with a standard method, we optimize the state preparation with Nelder-Mead (using Python scipy library implementation). The result is shown in figure 5. The two-dimensional space of target states is subdivided into a discrete grid and optimized independently. The achieved average fidelity over all target  (4)) using t-SNE algorithm. Color indicates protocol time of a specific datapoint. Same parameters as in figure 2. Optimization took in total 5.7 × 10 6 training episodes. Each grid point is optimized with Nelder-Mead for up to 20 000 optimization steps, repeated over up to 5 optimization runs starting from a random initially guessed protocol. Optimization is stopped early when fidelity F > 0.99 is reached.
states is lower compared to the reinforcement learning as it tends to get stuck in local optimization minimas. To reduce this problem, each trained target state was optimized up to five times, using random initial guesses. This optimization required also nearly an order of magnitude more training episodes as every grid point is learned independently and sequentially, instead of training all target states at the same time. The resulting optimized protocols across ϕ and θ is quite different compared to the one achieved by reinforcement learning. As an example, we investigate the protocol time T of the optimized protocols in figure 5(b). For reinforcement learning we find a connected landscape with well defined classes. For the grid optimization, however, there is a large spread in the protocol parameter and T varies largely between different target states, even for neighboring θ and ϕ. Thus, a small change in the target state can translate into a very different protocol. If one wants to find a protocol for a target state lying in between two grid points by interpolating between the two grid points, the interpolated protocol would achieve a low fidelity. A two-dimensional representation of the distance between the different protocols reveals no clustering or order (figure 5(c)), in contrast to reinforcement learning (figure 2(c)). From this result, we gather the following explanation: the control problem has many nearly equivalent solutions [5,30], which yield similar fidelity but can have widely different control schemes. Reinforcement learning tends to find solutions belonging to the same control class, which produces the observed clusters, whereas Nelder-Mead converges to one of the solutions at random, giving uncorrelated protocol schemes. Further, this optimization method takes more training epochs than reinforcement learning since each grid point is optimized individually. Scaling to a higher density of discretization points or more than two target parameters would require even more training time for the grid approach. The reinforcement learning approach in [12] also suffers from similar scalability issues, whereas our deep learning algorithm can overcome such scalability problems as all target states can be trained in parallel.

Discussion
We demonstrated how to learn all control protocols for a two-dimensional set of quantum states. The neural network is trained using randomly sampled target states. After training, the neural network knows how to generate all possible target states and classifies the near-optimal protocols automatically in specific groups, e.g. the time needed to generate specific states as well as the driving strengths. The clustering feature could be useful tool to identify physical principles and find generalized driving protocols. For practical usage, the clustering of solutions is also advantageous as the driving protocols within a cluster do not need to change drastically for a small change in target state. The drop in fidelity at the boundary between two clusters is a result of the finite capacity of neural networks to represent sharp changes and could be mitigated by choosing a discontinuous activation function. However, the exact mechanism how the clusters are found by the neural network remains unclear and requires further studies. To speed up training, our algorithm could be easily parallelized by calculating multiple target states at the same time. This is more difficult for other optimization techniques such as transfer learning as it requires input from previously trained neural networks and does not scale well with number of sampling points as well as the dimension of the problem (i.e. number of target parameters) [12].
In this article, we have shown how to optically control the electron spin in a complex multi-level NV center. The electron spin is only indirectly coupled via several other interacting levels, making it difficult to construct good control pulses. Our approach via reinforcement learning gives us access to a tailored protocol for any superposition state. We are able to create arbitrary states within a short time T ≈ 0.5 ns, avoiding slower dissipative processes and superseding other adiabatic-type protocols to create superposition states [19]. Our protocol requires only nine timesteps to achieve high fidelity for arbitrary states, which is considerably less than comparable approaches [13,21]. Our results indicate that even less timesteps could be sufficient to reach high fidelity (see figure 4(a)). The control protocols we found could help to efficiently control NV centers for quantum processing [31] and could be applied for optimal control of other physical systems [27].
Our neural network-based algorithm learns the full two-dimensional set of target states. The same concept could help to identify control unitaries for continuous variable quantum computation [32] or serve as an alternative to transfer learning in quantum neural network states [33]. The neural network finds patterns, which may be useful to identify phase transitions in quantum control [5], as well as identify physical concepts in the way the protocols create quantum states [34,35]. The resulting classification of protocols also opens up the potential of using reinforcement learning in identifying phase transitions in physical systems [36][37][38]. Furthermore, our approach could help correcting drifts in superconducting qubits [39]. When the system parameters change, the control unitaries have to be re-trained to accommodate the change. If one has multiple control unitaries, this may take a long time since every unitary has to be retrained individually. Our method allows one to retrain all the protocols at the same time, which can significantly shorten the time needed to correct errors and drifts. Finally, in our study we have trained the neural network by randomly sampling from all possible target states. It would be interesting to study training from a reduced subset of target states, and then see how well the neural network is able to generalize to yet unseen target states or whether over-fitting occurs.

NV center
In this paper we consider the 10-level model of the NV − center, which comprises three ground states, six excited states and one metastable state. The ground states form a spin-1 triplet with a zero-field splitting of D gs ≈ 2π × 2.88 GHz between the m s =±1 and m s = 0 sublevels. The energy gap of E g = 1.94 eV between the ground states and excited states due to Coulomb interaction gives rise to the well-known zero-phonon line (ZPL) optical transition. The level structure is shown in figure 6.
The ground state Hamiltonian can be written in the basis {|−1⟩, |0⟩, |+1⟩} ≡ {|m s = −1⟩, |m s = 0⟩, |m s = +1⟩} as (setting ℏ = 1 and the energy of |0⟩ as zero) [21]: where g gs = 2.01 is the Landé g-factor for the ground state, µ B is the Bohr magneton, and B ext is the external magnetic field applied along the NV quantization axis. The magnetic field splits the degeneracy of the states |−1⟩ and |+1⟩. Considering low temperatures, the phononic effects in the diamond are suppressed (and not considered here), while the splittings in the excited state due to spin-spin and spin-orbit interactions become significant. Taking into account these interactions, the excited-state Hamiltonian can be written in the basis of where I is the 6 × 6 identity matrix and describe the level splittings in the excited-state manifold with ∆ = 2π × 1.55 GHz, D es = 2π × 1.42 GHz and ∆ ′ ′ ≈ 2π × 0.2 GHz denotes the spin-spin interactions. l z = 2π × 5.3 GHz is the axial spin-orbit splitting, and g es ≈ 2.01 is the Landé g-factor for the excited state. Coherent control of the NV − center is accomplished by applying two laser fields with frequencies ω 1 and ω 2 and Rabi frequencies Ω 1 and Ω 2 .
The electric-dipole coupling between the ground and excited states is given by the interaction Hamiltonian: where is the 3 × 6 coupling matrix with the rows forming the ground state basis and the columns forming the excited state basis. ϵ x = 2Ω 1 cos(ω 1 t) + 2Ω 2 cos(ω 2 t). k=4 |k⟩ ⟨k⟩. Neglecting the counter-rotating terms, the ε x terms in the interaction Hamiltonian are replaced by ϵ ′ x = Ω 1 cos(δ 1 t) + Ω 2 cos(δ 2 t), with the detuning δ i = ω i − E g , i = 1, 2. In addition, there is a metastable state |m⟩ in the Hamiltonian, totaling to a 10 level system. The NV center is subject to dissipation via decay of excited states as shown in table 1. The full system including dissipation is solved using the Lindblad Master equation [40] ∂ρ ∂t whereL m are the Lindblad operators describing the dissipation. For timescales shorter than the fastest dissipation channel, the effect of dissipation can be neglected. For the NV center, this would be for T ≪ 13 ns. In this limit, the dynamics is effectively governed by the coherent Hamiltonian only and reduces to a eight level system, as both state |0⟩ and |m⟩ are only accessible via dissipation.

Deep reinforcement learning
Here, we describe our machine learning algorithm in more detail. We learn the driving protocol via a deep Q-learning network [41], utilizing the actor-critic method with PPO [17]. We use [28] implementation of the algorithm in Tensorflow [42]. A sketch is shown in figure 7. The quantum system is controlled by an agent, that depending on the state s t of the system acts with an action a t (e.g. driving parameters for time t) using the probabilistic policy π(a t |s t ). At every timestep a reward (e.g. the fidelity of quantum state) is paid out. The goal is to repeatedly interact with the quantum system and learn the best policy that gives the highest final reward. One normally starts with a random policy, that explores many possible trajectories. Over the course of the training, the policy is refined and converges (hopefully) to the optimal (deterministic) policy. However, optimizing the policy directly can be difficult, as one round of the protocol is played out over N T timesteps. The question is how to optimize the policy at each step such that one finds the optimal  final reward. A common problem in optimization is that one does not find the global optimal solution, but only a local maximum of the optimization landscape.
Here, it has been shown one can overcome the difficulties of policy learning with Q-learning. The idea is to find the Q-function Q π (s t , a t ) that estimates the future reward (from the point of timestep t) that is paid out at the end the full protocol with this policy. The goal is to learn a policy that prioritizes long-term rewards over smaller short-term gains. The optimal Q-function is determined by the Bellman equation: where E[.] indicates sampling over many instances. γ ≤ 1 is a discount factor that weighs future rewards against immediate rewards.
PPO is based on the idea of combining both methods, policy learning and value learning, into the actor-critic method. The idea is to have two neural networks: a policy network and a value network. The policy network (actor) decides on the next action by determining the parameters of the policy. The value-based network (critic) evaluates the taken action on how well it solves the task and estimates the future expected reward. It is used as an input to train the policy network.
Better performance can be achieved if the Q-function is split into two parts [43]: where A(s t , a t ) is the advantage function and V(s t ) the value function. V(s t ) gives the expected future reward averaged over the possible actions according to the policy. This is the output of the critic network. A(s t , a t ) gives the improvement in reward for action a t compared to the mean of all choices.
Learning is achieved by optimizing the network parameters with a loss function via gradient descent [44]. The loss function of the value network is the square of the difference of the value function of the network and the predicted reward in the next timestep L V (θ) = E t (V θ (s t ) − y t ) 2 , where θ are the current network parameters, y t = r t + V t+1 , where V t + 1 is the output of the value network for the next timestep (it is set to zero if this is the last timestep).
The advantage function A(s t , a t ) tells us how good a certain action a t is compared to other possible actions. The advantage function is the input to train the policy network (the actor). Following the idea of PPO [17], the goal is to minimize the loss function of the policy network: where θ are the network parameters and θ old are the network parameters of a previous instance. Maximizing L p (θ) for the network parameters θ over many sampled instances guides the distribution π θ (s t , a t ) such that it returns actions a t with maximal advantage. However, the ratio b t (θ) = π θ (s t , a t ) π θ old (s t , a t ) , can acquire excessively large values, causing too large changes in the policy in every training step and making convergence difficult. For PPO, it was proposed to use a clipped ratio ε [17]: L p (θ) = −E t [min {b t (θ)A(s t , a t ), clip(r t (θ), 1 − ϵ, 1 + ϵ)A(s t , a t )}] , such that the update at each step stays in reasonable bounds. The input state s t to the neural network are the wavefunction Ψ(t n ) (or density matrix ρ(t n ) at previous timestep n, as well as a random target state Ψ target . The output of the policy network are the parameters for the policy π(a t |s t , µ, σ), where the actions (pulse amplitude and time step duration) are sampled from a normal distribution with mean value µ and width σ. µ is chosen by the neural network (given the input state) and σ is a global variable that is initially large (ensuring that driving parameters are initially sampled mostly randomly to explore many possible trajectories). It is optimized via the loss function and decreases over the training, until at convergence it is close to zero and the sampled driving parameters converge to µ. We constrain the possible output values for the driving parameters by punishing values outside the desired range with a negative reward.
We optimize the neural network over many epochs N E . In each epoch, the Schrödinger equation (or master equation) is propagated for a total time T with N T discrete timesteps of width ∆t, with respective times t n . For one epoch, the system runs the network N T times. From the policy network the driving parameters for the n + 1 timestep are sample. The output of the value network is used to train the policy network. Each network is composed of two hidden layers of fully connected neurons of size N H with ReLu activation functions. The neural networks are trained with the loss function after calculating the full time evolution to time T over N T timesteps. Training data is sampled from a buffer storing earlier encountered trajectories. For the actual implementation, we choose the following parameters: learning rate for both value and policy network α = 10 −5 , training over N E = 800 000 epochs and clip ratio ε = 0.05. Out of bounds driving parameters are punished by a factor of 0.2.
To improve the mean fidelity over all target states, we choose the target states not completely random, but biased towards areas of lower fidelity. This is achieved by laying a 20 × 20 grid across the θ and ϕ space, then binning the last 10 000 results and the achieved fidelity. We choose the next target state by sampling the probability distribution: where N is a normalization factor and η determines how strongly the sampling is biased toward low fidelity. We chose η = 0.5.