Improving the dynamics of quantum sensors with reinforcement learning

Recently proposed quantum-chaotic sensors achieve quantum enhancements in measurement precision by applying nonlinear control pulses to the dynamics of the quantum sensor while using classical initial states that are easy to prepare. Here, we use the cross-entropy method of reinforcement learning to optimize the strength and position of control pulses. Compared to the quantum-chaotic sensors with periodic control pulses in the presence of superradiant damping, we find that decoherence can be fought even better and measurement precision can be enhanced further by optimizing the control. In some examples, we find enhancements in sensitivity by more than an order of magnitude. By visualizing the evolution of the quantum state, the mechanism exploited by the reinforcement learning method is identified as a kind of spin-squeezing strategy that is adapted to the superradiant damping.


I. INTRODUCTION
The rise of machine learning [1] has led to intense interest in using machine learning in physics, and in particular in combining it with quantum information technology [2,3]. Recent success stories include discriminating phases of matter [4][5][6] and efficient representation of many-body quantum states [7][8][9].
In physics, many problems can be described within control theory which is concerned with finding a way to steer a system to achieve a goal [10]. The search for optimal control can naturally be formulated as reinforcement learning [11][12][13][14][15][16][17][18][19], a discipline of machine learning.
Reinforcement learning (RL) has been used in the context of quantum control [17], to design experiments in quantum optics [20], and to automatically generate sequences of gates and measurements for quantum error correction [16,21,22]. RL has also been applied to control problems in quantum metrology [2]: In the context of global parameter estimation, i.e., when the parameter is a priori unknown, the problem of optimizing single-photon adaptive phase-estimation was investigated [23][24][25]. There, the goal is to estimate an unknown phase difference between the two arms of a Mach-Zehnder interferometer. After each measurement, an additional controllable phase in the interferometer can be adjusted dependent on the already acquired measurement outcomes. The optimization with respect to policies, i.e., mappings from measurement outcomes to controlled phase shifts, can be formulated as a RL problem and tackled with particle swarm [23,24,26,27] or differential evolution [25,28] algorithms, where the results of the former were recently applied in an experiment [29].
Also in the regime of local parameter estimation, where the parameter is already known to high precision (typically from previous measurements), actor-critic and proximal-policyoptimization RL algorithms were used to find policies to control the dynamics of quantum sensors [30][31][32]. There, the estimation of the precession frequency of a dissipative spin-1 2 particle was improved by adding a linear control to the dynamics in form of an additional controlled magnetic field [32].
Recently it was shown theoretically that the sensitivity (in the regime of local parameter estimation) of existing quantum sensors based on precession dynamics, such as spinprecession magnetometers, can be increased by adding nonlinear control to their dynamics in such a way that the dynamics becomes non-regular or (quantum-)chaotic [33,34]. The nonlinear kicks (described by a "nonlinear" Hamiltonian ∝ J 2 y compared to the "linear" precession Hamiltonian ∝ J z where J x , J y , J z are the spin angular momentum operators) lead to a torsion, a precession with rotation angle depending on the state of the spins.
Adding nonlinear kicks to the otherwise regular dynamics comes along with a large number of new degrees of freedom that remained so far unexplored: Rather than kicking the system periodically with always the same strength and with the same preferred axis as in Ref. [33], one can try to optimize each kick individually, i.e., vary its timing, strength, or rotation axis. The number of parameters increases linearly with the total measurement time (assuming a fixed upper bound of kicks per unit time), and becomes rapidly too large for brute-force optimization.
In this work, we use cross-entropy RL to optimize the kicking strengths and times in order to maximize the quantum Fisher information, whose inverse constitutes a lower bound on the measurement precision. The cross-entropy method is used to train a neural network that takes the current state as input and gives an action on the current state (the nonlinear kicks) as output. In this way, the neural network generates a sequence of kicks that represents the policy for steering the dynamics.
This represents an offline, model-free approach which is aimed at long-term performance, i.e., the optimization is done based on numerical simulations, without being restricted to a specific class of policies, and with the goal of maximizing the quantum Fisher information only after a given time and not, as it would be the case for greedy algorithms, for each time step. We show that this can lead to largely enhanced sensitivity even compared to the already enhanced sensitivity of the quantum-chaotic sensor with constant periodic kicks [33].

II. QUANTUM METROLOGY
The standard tool for evaluating the sensitivity with which a parameter can be measured is the quantum Cramér-Rao bound [35][36][37]. It gives the smallest uncertainty with which a parameter ω encoded in a quantum state (density matrix) ρ ω can be estimated. The bound is optimized over all possible (POVM=positive operator valued measure) measurements (including but not limited to standard projective von-Neumann measurements of quantum observables), and all possible data-analysis schemes in the sense of using arbitrary unbiased 4 estimator functionsω of the obtained measurement results. It can be saturated in the limit of a large number M of measurements, and hence gives the ultimate sensitivity that can be reached once technical noise has been eliminated and only the intrinsic fluctuations due to the quantum state itself remain.
FIG. 1. Schematic representation of parameter encoding in quantum metrology. Panel (a) shows the standard protocol: the parameter ω is encoded in the initial state ρ through the dynamics, the resulting state is measured, and the parameter is inferred by (classical) post processing of the measurement outcomes. In panel (b), the dynamics is given by the kicked top model: the encoding of the parameter ω through linear precession R z (ω) about the z-axis is periodically disrupted through parameter-independent, nonlinear, controlled kicks (green triangles) with kicking strength k that can render the dynamics chaotic. In panel (c), the dynamics is given by a generalized kicked top model: the kicking strengths k and times t between kicks are optimized in order to maximize the sensitivity with which ω can be inferred (varying k are indicated by different sizes of the green triangles). Variation of the kicking axis is possible but beyond the scope of this work.
The quantum Cramér-Rao bound for the smallest possible variance of the estimateω For a state given in diagonalized form, ρ ω ∶= ∑ d =1 p ψ ⟩ ⟨ψ , where d is the dimension of the Hilbert space, the quantum Fisher information (QFI) is given by [38] where the sum runs over all , m such that p + p m ≠ 0, and ∂ ω ρ ω ∶= ∂ρω ∂ω .

III. THE SYSTEM
We consider a spin model based on the angular momentum algebra, with spin operators hj jm⟩ and J 2 j, m⟩ = ̵ h 2 j(j + 1) j, m⟩, where j and m are angular momentum quantum numbers. Note that the model can be implemented not only with physical spins but with any physical system with quantum mechanical operators that fulfill the angular momentum algebra. The Hamiltonian of our model is given by The first summand describes a precession about the z-axis with precession frequency ω. The second summand describes the nonlinear kicks, i.e., a torsion about the y-axis, see In an atomic spin-precession magnetometer, as discussed in Ref. [33], the first summand corresponds to a Larmor precession characterized by the Larmor frequency ω = gµ B B ̵ h with Landé g-factor g, Bohr magneton µ B , and magnetic field strength B, which is the parameter that one wants to estimate. The nonlinear kicks can, for example, be generated with offresonant light pulses exploiting the ac Stark effect. We introduce a dimensionless kicking strength as k ∶= κ τ and, for the sake of simplicity, we set τ = 1 and ̵ h = 1.
For a pure state, the unitary time evolution of the system between kicks at time t −1 and t is given by where the unitary transformation U ω (k ) propagates the state according to the Hamiltonian i.e., a precession for time t − t −1 followed by a kick of strength k . The kick occurs at the For the standard kicked top (KT), see Fig. 1, the kicking strengths are constant, k = k, and kicking times are given by t = τ = , with ∈ N. Dynamics of the standard KT is nonintegrable for k > 0 and has a well defined classical limit that shows a transition from regular to chaotic dynamics when k is increased. In Ref. [33] the behavior of the QFI for regular and chaotic dynamics was studied in this transition regime (for k = 3 and ω = π 2) which manifests itself by a mixed classical phase space between regular and chaotic dynamics.
Quantum chaos is defined as quantum dynamics that becomes chaotic in the classical limit.
In contrast to classical chaos, quantum chaos does not exhibit exponential sensitivity to changes of initial conditions due to the properties of unitary quantum evolution, but can be very sensitive to parameters of the evolution [39]. The kicked top has been realized with atomic spins in a cold gas [40] and with a pair of spin-1 2 nuclei using NMR techniques [41]. Here, we generalize the standard KT to kicks of strength k at arbitrary times t as given in Any new quantum metrology method needs to demonstrate its viability in the presence of noise and decoherence. We study two different versions of the KT which differ in the decoherence model used: phase damping and superradiant damping. Both can be described by Markovian master equations and are well studied models for open quantum systems [42][43][44][45]. While phase damping conserves the energy and only leads to decoherence in the j, m⟩ basis, superradiant damping leads in addition to a relaxation to the ground state j, −j⟩. Its combination with periodic kicking in the chaotic regimes is known to give rise to a non-equilibrium steady state in the form of a smeared-out strange attractor [45] that still conserves information about the parameter ω, whereas without the kicking the system in presence of superradiant damping simply decays to the ground state. The master equations for both processes have the Kossakowski-Lindblad form [46,47], witḣ for phase damping, whereρ(t) = d dt ρ(t), anḋ for superradiant damping, where J ± ∶= J x ±iJ y are the ladder operators, and γ pd and γ sr denote the decoherence rates. With the generator Λ, defined byρ(t) = Λρ(t), one has in both cases Also for Eq. (8) a formally exact solution has been found [48] and efficient semiclassical (for large j) expressions are available [49,50]. For our purposes it was the simplest to solve Eq. (8) numerically by diagonalization of Λ. Combining these decoherence mechanisms with the unitary evolution the transformation because in both cases the dissipative generator Λ commutes with the precession.
As initial state we use an SU(2) coherent state, which can be seen as the most classical state of a spin [51,52], and is usually easy to prepare (for instance by optically polarizing the atomic spins in a SERF magnetometer). Also, it is equivalent to a symmetric state of 2j spin-1 2 pointing all in the same direction. With respect to the j, m⟩ basis it reads We choose θ = π 2 , φ = π 2 .

IV. OPTIMIZING THE KICKED TOP
A. The kicked top as a control problem We consider the kicked top as a control problem and discretize the kicking strengths k and times t . The precise parameters of the discretized control problem vary between the 8 following examples and are summarized in Appendix A. In the following, t step denotes a discrete time step (measured in units of τ = 1), k step is a discrete step of kicking strength, the RL agent optimizes the QFI at time T opt , and we bound the total accumulated kicking strength ∑ k < 15000 which is never reached in optimized policies though. The frequency ω, that we want to estimate, is set to induce a rotation of the state by tπ 2 (t is measured in units of τ = 1).
Possible control policies are simply given by a vector of kicking strengths k = (k 1 , . . . , k N ) ∈ To each policy corresponds a QFI value, calculated from the resulting state ρ(T opt ), which quantifies how well the policy performs. To tackle this type of problem, various numerical algorithms are available, each with its own advantages and drawbacks [2,3,15]. We pursue the relatively unexplored (in the context of physics) route of cross-entropy RL.

B.
Reinforcement learning In general, the idea of reinforcement learning is to reinforce behaviour that leads to high rewards. The precise mechanism depends on the used RL algorithm.

9
C. The kicked top as a reinforcement learning problem The system, the generalized kicked top as introduced in Section III, represents the RL environment. The agent can choose between only two actions: (i) increase the kicking strength (by k step ) or (ii) go on from the current position in time t step to ( + 1)t step . In this way, the vector k is built up step by step. After each action, the agent obtains an observation given by the full density matrix of the current state of the environment. Since we simulate the evolution of the environment, the density matrix is readily available.
Only after the total time T opt , a reward [the QFI of ρ(T opt )] is given to the agent. This concludes one episode, and the resulting vector k represents a policy. Then, the environment is reset [i.e., the spin is reinitialized with the coherent state at θ = π 2 , φ = π 2 , see Eq. (11)], and the next episode starts.
A neural network represents the RL agent: The observation is given to the neural network's input neurons while each output neuron represents one possible action, i.e., we have two output neurons for "kick" and "go on". The activation of these output neurons determines the probability of executing that action. The policy, however, is not given by the neural network directly. Since the environment is deterministic (i.e., the state evolves deterministically for a given policy k of kicking strengths) there is no point in choosing a stochastic policy such as a neural network. Instead, a single choice of kicking strengths k represents the policy. We obtain this by first training the neural networks using the crossentropy method, then generating a few episodes with the trained neural network, and then picking the episode with the largest QFI. The kicking strengths applied in that episode represent the policy 1 .

D. Cross-entropy method
The RL cross-entropy method [53] we use works as follows: We first produce a set of episodes (i.e., we obtain several vectors k) with a neural network that is initialized randomly.
Then, we rank those episodes according to their reward 2 . We select the best 10% of episodes (with highest reward) for further computations. Every episode can be split into several pairs of action and observation and we use those pairs to train the neural network with the stochastic gradient descent method called Adam [54]. As a result of this training, the weights of the neural network are adjusted, i.e., the agent learns from its experience. Future actions taken by the agent are influenced not only by randomness but also by this experience.
One run of producing episodes, ranking them, and using the best 10% to train the neural network is called an iteration. Training a neural network consists of several iterations. See Appendix C for pseudocode of this algorithm. For the parameters of the training process see Appendix A. In Appendix D we study the learning success for different numbers of episodes and iterations.

V. RESULTS
We compare the QFI for different models: (i) the top (simple precession without kicks), (ii) the standard kicked top, as studied in Ref. [33], with periodic kicks (period τ = 1, i.e., a precession angle of π 2 for one period, and kicking strength k Let us first consider superradiant damping with results presented in Fig. 3. The QFI for the SR-T exhibits a characteristic growth quadratic in time. However, due to decoherence, the QFI does not maintain this growth but starts to decay rapidly towards zero. The time when the QFI reaches its maximum was found to decay roughly as 1 (γ sr j) with spin size j and damping rate γ sr [33].
The situation changes with the introduction of nonlinear kicks. There, the QFI for the SR-KT shows the interesting behavior of not decaying to zero for large times. Instead it reaches a plateau value which was found to take surprisingly high values for specific choices of j and dissipation rates [33], in particular, for j = 2. The system looses energy through superradiant damping while the nonlinear kicks add energy. This prevents the state from decaying to the ground state, which is an eigenstate of the precession and would lead to a vanishing QFI. From this perspective, the plateau results from a dynamical equilibrium However, the full potential of exploiting such effects and increasing the QFI with the help of nonlinear kicks is not achieved with constant periodic kicks. Instead, the RL agent 3 finds policies to make the QFI of the SR-GKT increase further even when the QFI of the SR-T decayed already to zero and the QFI of the SR-KT reached its plateau value.
Examples for j = 2 and j = 3 are presented in Fig. 3. The QFI of the SR-GKT is optimized for a total time T opt which is the largest time plotted in each example. At T opt , the plateau value of the SR-KT for j = 3 is relatively low and the RL-optimized policy achieves an improvement in sensitivity (associated with 1 √ I ω ) of more than an order of magnitude.
Panels (a) and (b) show continuous growth of the QFI through an optimized kicking policy.
Only if the time T opt (the QFI is optimized to be maximal at T opt ) is increased further, the impressive growth of the QFI finally breaks down. Instead of increasing T opt , we choose to increase superradiant damping while keeping T opt constant, which has a similar effect.
In that case, see panels (c) and (d), the RL agent chooses a policy which makes the QFI oscillate at a relatively high level before the time T opt is reached.
The superiority of the policies found by the RL agent can be understood by taking a look at the evolution of the quantum state, see Fig. 4: We represent the quantum state in the space of r = (x, y, z) = (⟨J x ⟩ , ⟨J y ⟩ , ⟨J z ⟩) where ⟨J ⟩ ∶= tr(ρJ ) and, due to the conservation of angular momentum, r = 1 which restricts the space to a sphere. This is represented in Fig. 4 with either a sphere parametrized with x, y, and z, or in a plane (the phase space) spanned by the z-coordinate and the azimuthal angle φ ∈ (−π, π] such that φ = z = 0 corresponds to the positive x-axis, φ = π 2, z = 0 to the positive y-axis, and z = ±1 with arbitrary φ to the positive (negative) z-axis. Due to the small spin size of j = 3, we are deep in the quantum mechanical regime which manifests itself in an uncertainty of the initial spin coherent state that is relatively large compared to total size of the phase space. The distribution of the states evolved under dissipative dynamics exhibit remarkable differences for periodic and RL-optimized kicks: In case of periodic kicks, we find that the initially localized distribution gets distributed over the phase space. It exhibits a maximum on the negative z-axis, see panels (b 1 ) and Fig. 4. This is reminiscent of the dissipative evolution in the absence of kicks, where the state is driven towards the ground state j, −j⟩ which is centered around z = −1. The ground state j, −j⟩ is an eigenstate of the precession and, thus, insensitive to changes in the frequency ω we want to estimate. Similarly, we interpret the part of the state distribution of the SR-KT that is centered around negative z-axis as insensitive. However, the distribution also exhibits non-vanishing parts distributed over the remainder of the phase space that can be understood as being sensitive to changes of ω and therefore explain the non-zero QFI of the SR-KT.
The state corresponding to RL-optimized kicks looks like a strongly squeezed state that almost encircles the whole sphere. Similar to spin squeezing, which is typically applied to the initial state as a part of the state preparation, we interpret the squeezed distribution as particularly sensitive with respect to the precession dynamics. This is due to the reduced uncertainty along the precession trajectories, i.e., with respect to the φ coordinate. We provide clips of the evolution over time of the state distributions that illustrate how the RL agent generates the squeezed state 4 . In particular, the squeezed state distribution can be seen as a feature the RL agent is aiming for with its policy. The distribution of RL-optimized kicks is shown in Fig. 3 (in Appendix F, we provide a finer resolution of the distribution of kicks): It is roughly periodic with period corresponding to a precession angle of π. Also note that for the SR-GKT the Wigner distribution has negative contributions which is associated with non-classicality of the quantum state [55].
An advantage of the superradiant dynamics lies in its well-defined simple classical limit [45], see also Appendix E. The lower two rows of panels in Fig. 4   A broad damping regime is found where gains can be achieved: In the regime of small decoherence rates γ sr , the RL agent can fight decoherence in such a way that the QFI exhibits a continuous growth over the total time T opt [see panels (a) and (b) in Fig. 3]. In comparison with the SR-T, the RL agent benefits of stronger damping in this regime and, therefore, the gain increases with the dissipation rate γ sr . For larger decoherence rates, the RL agent can no longer fight decoherence in the same manner [see panels (c) and (d) in The RL-optimized QFI is associated with a lower bound on the sensitivity (see Eq. 1) for a given measurement time T opt . If measurement time can be chosen arbitrarily, sensitivity is associated with max t I ω (t) t [33]. This sensitivity represents the standard quantity reported for experimental parameter estimation because it takes time into account as a valuable resource; sensitivity is given in units of the parameter to be estimated per square root of Hertz. With RL we try to maximize max t I ω (t) t with respect to policies. Fig. 6 compares the SR-T with the SR-GKT where the latter was optimized with RL in order to maximize the rescaled QFI. Note, that the initial spin coherent state is centered around the positive y-axis, which means it is an eigenstate of the nonlinear kicks; kicks cannot induce spin squeezing at the very beginning of the dynamics. This changes when the spin precesses away from the y axis. Therefore, it makes sense that the RL agent applies the strongest kick only after a precession by about π 2. The actions that the RL agent takes after the rescaled QFI reached its maximum are irrelevant and can be attributed to random noise generated by the RL algorithm.
As we have seen, the interplay of nonlinear kicks and superradiant damping is very special. However, also for other decoherence models the QFI can be increased significantly, for instance in case of a alkali-vapor magnetometer [33]. To demonstrate the performance of the RL agent in connection with another decoherence model, we take a look at phase damping, see Fig. 7. The behavior of the QFI of the PD-T is qualitatively similar to superradiant damping. The introduction of kicks, however, has a qualitatively different effect on the QFI. The RL agent can achieve improvements of the QFI for the PD-GKT at time T opt (the highest time plotted in each panel of Fig. 7) compared with the QFI of the PD-T at the same time. Compared to the superradiant case, improvements are rather small.
Notably, the policies applied by the RL agent are also different from superradiant damping; for instance, the RL agent avoids kicks for large parts of the dynamics.

VI. DISCUSSION
This work builds on recent results on quantum-chaotic sensors [33]. Our aim is to optimize the dynamical control that was used in Ref. [33] to render the sensor dynamics chaotic.
Due to the high dimensionality of the problem we use techniques from reinforcement learning (RL). The control policies found with RL are tailored to boundary conditions such as the initial state, the targeted measurement time, and the decoherence model under consideration.
At the example of superradiant damping we demonstrate improvements in measurement precision and an improved robustness with respect to decoherence. A drawback of RL often lies in the expensive hyperparameter tuning of the algorithm. However, here we show that a basic RL algorithm (the cross-entropy method) can be used for several choices of boundary conditions with practically no hyperparameter tuning (there was no hyperparameter search necessary, solely parameters that directly influence the computation time were chosen conveniently).
In the example of superradiant damping, we unveil the approach taken by RL by visualizing the quantum dynamics with the help of the Wigner distribution of the quantum state.
This reveals that RL favors a policy that is reminiscent of spin squeezing. However, instead of squeezing the state only at the beginning of the dynamics, the squeezing is refreshed and enhanced in roughly periodic cycles in order to fight against the superradiant damping.
In the spirit of Ref. [33], these findings emphasize the potential that lies in the optimization of the measurement dynamics. We are optimistic that reinforcement learning can be used to tackle other problems in quantum metrological settings in order to achieve maximum measurement precision with limited quantum resources. Table I shows the parameters of the control problem and for the optimization used in each example. We train n agents RL agents for n iterations iterations with n episodes episodes in each iteration. Each episode is simulated until a total time T opt is reached. Then we produce n samples sample episodes of each trained RL agent and choose the best episode to plot the sample policies and gains.  Here we give further information on the neural network and the hyperparameters of the algorithm.
The input layer of the neural network is defined by the observation. The output layer is determined by the number of actions (two) and we choose 300 neurons in the hidden layer. The layers are fully connected. The hidden layer has the rectified linear unit (ReLU) as its activation function and the output layer has the softmax function as its activation function [56]. As a cost function we choose the categorical cross entropy [56]. The share of best episodes σ share is always 10%. The number of iterations and number of episodes vary for different settings, see Table I  The code implementation is based on an example by Jan Schaffranek 5 . In case of γ sr = 0.01, for both, j = 2 and j = 3, we find relatively similar distribution of kicks, see panel (a) in Fig. 9. The most striking difference between the two policies for j = 2 and j = 3 are the comparatively strong kicks in the beginning of the sequence. By observing the time evolution of the Wigner function (see Supplemental Material), we find that these kicks basically rotate the state by an additional angle π 2 about the z-axis. This leads to a phase shift of π 2 between the two policies [see panels (