Control of Stochastic Quantum Dynamics with Differentiable Programming

Controlling stochastic dynamics of a quantum system is an indispensable task in fields such as quantum information processing and metrology. Yet, there is no general ready-made approach to design efficient control strategies. Here, we propose a framework for the automated design of control schemes based on differentiable programming ($\partial \mathrm{P}$). We apply this approach to state preparation and stabilization of a qubit subjected to homodyne detection. To this end, we formulate the control task as an optimization problem where the loss function quantifies the distance from the target state and we employ neural networks (NNs) as controllers. The system's time evolution is governed by a stochastic differential equation (SDE). To implement efficient training, we backpropagate the gradient information from the loss function through the SDE solver using adjoint sensitivity methods. As a first example, we feed the quantum state to the controller and focus on different methods to obtain gradients. As a second example, we directly feed the homodyne detection signal to the controller. The instantaneous value of the homodyne current contains only very limited information on the actual state of the system, covered in unavoidable photon-number fluctuations. Despite the resulting poor signal-to-noise ratio, we can train our controller to prepare and stabilize the qubit to a target state with a mean fidelity around 85%. We also compare the solutions found by the NN to a hand-crafted control strategy.


Introduction
The ability to precisely prepare and manipulate quantum degrees of freedom is a prerequisite for most applications of quantum mechanics in sensing, computation, simulation and general information processing. Many relevant tasks in this area can be formulated as optimal control problems, and therefore, quantum control is a rich and very active research field; see [1,2] for two recent textbooks that provide a theoretical background and [3] for a recent review of important issues.
A typical goal of quantum control is to find a sequence of operations or parameter values (e.g., external field amplitudes) such that the quantum system under consideration is maintained in a certain target state or evolves in a desired fashion subject to additional boundary conditions (e.g., the fastest evolution for a prescribed maximum strength of the control fields). In the case of feedback control, the control sequence is determined based on a signal coming from the system [3,4]. The control task and its boundary conditions are typically specified by a loss function. To convert the optimal control problem into an optimization task, one introduces a parametric ansatz for feedback schemes, also known as controllers, and explores the parameter space to minimize the loss function.
Reinforcement learning (RL) [5,6] has been proposed as a suitable framework for the development of control strategies. In this framework, the controller (or agent) optimizes its strategy (the policy) based on a loss function (the rewards) obtained from repetitive interactions with the system to be controlled. In the context of quantum physics, RL has proven useful, e.g., for reducing the error rate in quantum gate synthesis [7], for autonomous preparation of Floquet-engineered states [8], and for optimal manipulation of many-qubit systems [9]. RL is a black-box setting, i.e., the agent has no prior knowledge about the structure of the system it interacts with, and has to develop its policy (explore the space of control parameters) by relying solely on its past interactions with the system. This makes RL very versatile, but requires many training episodes to find a good strategy.
Optimal control design is rarely performed using an (unknown) system in its original location. Instead, one trains the controller on a physical model of the real system. This means that rather than learning how to interact with an unknown environment, one actually starts with a lot of prior scientific knowledge about the system, namely its precise dynamic model. In the simplest cases, one can even solve the optimal control problem for this model analytically. Generally, the use of prior information leads to more data-efficient training [10,11]. In the context of the present paper, we use a physical model of the system to efficiently compute the loss function's gradients with respect to the parameters of the control ansatz. Naturally, having access to the gradients of the loss function can streamline the optimal control design tremendously.
In the case of quantum control, the model consists of a quantum state space and a parametric equation of motion. The dynamics of the system can be solved for fixed values of the parameters of the model and of the controller. Usually, this is done numerically. Then, the most naive way to obtain the loss function's gradients is to solve the dynamics for a set of parameter values in the neighborhood of each point and use finite difference quotients. However, this method is infeasible if the number of parameters is large, it suffers from floating-point errors, and it may be numerically unstable. To circumvent these issues, automatic differentiation (AD) has been proposed as another approach to calculating gradients numerically [12,13], a paradigm also known as differentiable programming (∂P) [14,15]. By backpropagating the loss function's sensitivity through the numerical simulation, one can compute gradients with a similar time complexity to that of solving the system's dynamics [16]. Recently, these techniques have been merged with deep neural networks (NNs) as an ansatz for controllers [10,17,18]. This is possible because the training of deep NNs is based on stochastic gradient descent through a large parameter space and becomes efficient when used in conjunction with ∂P-compatible solvers for the system's dynamics.
In this work, we develop such a physics-informed RL framework based on ∂P and NNs to control the stochastic dynamics of a quantum system under continuous monitoring [2,19]. Continuous measurements, such as photon counting and homodyne detection allow one to gain information on the random evolution of a dissipative quantum system [2]. This information can be used to estimate the state of the quantum system [20][21][22], to implement feedback protocols [23][24][25][26][27], to generate nonclassical states [28][29][30][31], and to implement teleportation protocols [32,33]. Continuous homodyne detection can be realized experimentally in the microwave [34,35] and optical regimes [20,36]. The time evolution of a monitored quantum system is described by quantum trajectories, which are solutions of differential equations driven by a Lévy process.
To illustrate our framework, we focus on a qubit subjected to continuous homodyne detection [34,37] described by a stochastic Schrödinger equation. We engineer a controller which provides a control scheme that fulfils a given state preparation task based on the measured homodyne current. This situation extends an earlier study [10], where it was demonstrated that ∂P can be efficiently used for quantum control in the context of (unitary) closed-system dynamics, i.e., when the dynamics follows an ordinary differential equation. The stochastic nature of the problem analyzed here renders the control task more challenging because the controller must adapt to the random evolution of the quantum state in each trajectory. Moreover, the instantaneous value of the homodyne current does not determine the actual state of the qubit. It is correlated only with the projection of the state onto the x-axis, and this signal is hidden in the noise which dominates the measured homodyne current [38,39]. Thus, the information about the state of the qubit at a given time must be filtered out from the time series of measurement results. This paper is organized as follows: In section 2, we describe the proposed setup of a qubit in a leaky cavity subjected to homodyne detection and derive the stochastic Schrödinger equation that describes its dynamics. We then discuss two ways to use the record of the homodyne detection signal in a feedback scheme to engineer a drive that can be applied to the qubit to perform a desired control task, e.g., state preparation or stabilization. We also introduce the concept of adjoint sensitivity methods that we use to efficiently compute gradients with respect to solutions of stochastic differential equations (SDEs). Section 3 describes the first feedback scheme in detail: here, we assume that the controller has direct access to the quantum state, e.g., through an appropriate filtering procedure applied to the measurement records. We compare three strategies, viz. a hand-crafted control scheme, one in which an NN continuously updates the control drive based on knowledge of the state, and a numerically less demanding one with a piecewise-constant control drive. A qubit coupled to a leaky cavity is continuously monitored by homodyne detection. The radiation emitted by the qubit is mixed with a local oscillator laser with a complex amplitude β at a beamsplitter. The signal J(t), which is the difference of the photocurrents of the two detectors, is fed to a control agent, which applies a drive Ω(t) to the qubit to generate and stabilize a target state. (b) Different structures of the controller are considered in sections 3 and 4, respectively. (c) Example of a typical homodyne detection signal J(t). (d) The detected homodyne signal is proportional to the quadrature ⟨σx⟩, which is hidden in the noise of the measurement [note the different axis scales in (c) and (d)]. The data has been integrated over a measurement interval δt = 10 −3 κ −1 . The parameters of the physical model are ∆ = 20κ, Ω = 2κ. Section 4 presents the second feedback scheme where the measured record of the homodyne current is directly fed to the NN representing the controller. In this setup, the NN must first learn how to filter the data to obtain information on the state of the system. Only after that can it propose an efficient control strategy. We conclude in section 5 and discuss potential future applications.

A qubit under homodyne detection
We consider a driven two-level system with states |g and |e . In a rotating frame, its Hamiltonian reads where σ x , σ z are Pauli matrices, Ω(t) is the Rabi frequency of the driving laser, and ∆ = ω eg − ω laser is the detuning between the qubit and the laser. The qubit can spontaneously decay into a photon field a(t) via the interaction Hamiltonian: where κ is the decay rate. The field operators a(t) and a † (t) satisfy the commutation relation . We assume that the field is initially in the vacuum state, a † (t)a(t ′ ) = 0. Physical examples of such a system include a two-level atom inside a leaky single-mode cavity that can be adiabatically eliminated or an artificial atom, e.g., a superconducting qubit, coupled to a waveguide. The radiation emitted from the two-level system is monitored using continuous homodyne measurement, as depicted in figure 1(a).
In appendix A (see also [2,19]), we show that the evolution of the qubit is governed by a stochastic Schrödinger equation where|ψ denotes an unnormalized qubit state. The instantaneous value of the measured homodyne current J(t) is a random variable satisfying Here, σ x ψ(t) is the expectation value of σ x at time t, and ξ(t) is a stochastic white-noise term satisfying , which stems from the shot noise of the local oscillator. Heuristically, ξ(t) can be considered to be the derivative of a stochastic Wiener increment, ξ(t) = dW(t)/dt, such that the contribution of the noise to the current integrated over a short time interval dt is described by a Wiener process: The ensemble averages of the stochastic Wiener increment dW satisfy E[dW(t)] = 0 and E[dW(t) 2 ] = dt. Several remarks are in order, to better understand the dynamics of the system. First, equation (5) implies that the value of the current over a short interval J(t)dt contains very little information about the state |ψ(t) of the qubit. This can be seen from the (heuristically stated) vanishing signal-to-noise ratio: If σ x ψ(t) were a constant signal, one could simply integrate the current J(t) over a time interval longer than 1/κ to increase the signal-to-noise ratio. However, this is not possible because relaxation changes the state of the two-level system on the timescale 1/κ. Thus, the low signal-to-noise ratio is an intrinsic feature of this homodyne detection scheme. This is illustrated in figures 1(c) and (d), where we show a simulation of the homodyne current J(t) together with the respective value σ x ψ(t) for a single quantum trajectory. Second, equation (3) shows that the infinitesimal time evolution and thus the quantum trajectory is fully determined by the record of the measured homodyne current J t and the values of the applied drive Ω t , which are vectors containing the respective values of J(t) and Ω(t) from the starting time t 0 = 0 until time t. In appendix A.4, we derive a closed-form expression for the operator D t = D t [J t , Ω t ], which gives the mapping: between the states of the qubit at times t 0 = 0 and t. The operator D t [J t , Ω t ] can be interpreted as a filter determining the state of the qubit at time t from the values of the measured homodyne current and the applied drive. Finally, equation (3) does not preserve the norm of the state |ψ , as indicated by the tilde. This is not a problem if one integrates equation (3) numerically, since one can renormalize the state after each time step. For analytical calculations, it is useful to add some correction terms (see appendix A.5 or [2,19]) such that the norm of the state is preserved up to second order in dt: Here, the nonlinear drift and diffusion terms are Note that equation (8) is a stochastic Schrödinger equation in the Itô form with multiplicative scalar noise.

Feedback control overview
In our control protocol, the results of the homodyne detection measurement determine the drive Ω(t) to be applied to the qubit. We consider two different schemes which are outlined in figure 1(b).
In the first scheme, discussed in section 3, we filter the homodyne signal to extract the system's exact state |ψ(t) at time t. Subsequently, the controller receives this state as an input and determines the drive Ω(t) to be applied next. Equations (7) and (A.33) give an explicit filtering procedure D t [J t , Ω t ] that determines the state of the qubit at time t from the records of homodyne measurements J t and the drive Ω t , which are known in the experiment. Since we train the agent on simulated trajectories, we know the system's state at each time step. Therefore, we can skip the filter in figure 1(b) and directly feed back the solution of the SDE solver at each time step to our controller. In other words, we assume perfect filtering at all times. Thus, this situation corresponds to the feedback used in [10] with the exception that deterministic evolution is replaced by an SDE. We will use this control scheme in section 3 to test different backpropagation methods and to compare different control strategies. 3) used to train the controller. In the forward pass, a controller, which is, in this work, implemented by a NN, maps the present quantum state |ψ(t)⟩ (see section 3) or a measurement of the homodyne current J(t) (see section 4) to a drive Ω(t). Then, an SDE is solved to determine the subsequent state and homodyne detection current J(t). A loss function L modeling the state preparation objective and possible constraints is evaluated based on a quantum trajectory, i.e., a sequence of states. In the backward pass, the gradient of the loss function with respect to the parameters θ of the controller is evaluated by (adjoint) sensitivity methods (see section 2.4). This step incorporates physical knowledge of the system into the training process and is numerically more efficient than a model-blind gradient estimation. The gradient of the loss function with respect to the parameters of the controller is used to update the control strategy over a series of training epochs.
In the second scheme, discussed in section 4, the controller obtains the homodyne current record J τ (t) at time t measured over some time interval [t − τ , t]. The NN forming the controller now has to simultaneously learn how to filter the signal from the noise and to predict the next action Ω(t). Such an implementation of the control protocol based only on J(t) is a challenging task because the signal of the system quadrature σ x ψ(t) is hidden in the noise, as discussed in section 2.1.

Workflow
The learning scheme that controls the stochastic dynamics of the continuously monitored qubit based on ∂P consists of three building blocks, as outlined in figure 2: a parameterized controller C, based on a NN, a model of the dynamics (expressed as an SDE), and a loss function.
At the beginning of each run, we initialize the system in an arbitrary state on the Bloch sphere, where |e and |g are the excited and ground states of the qubit in the z-basis, respectively. To ensure that the controller performs optimally for any initial state, we sample the angles ϑ and ϕ uniformly from their intervals [0, π] and [0, 2π], respectively. Depending on the chosen control scheme, see figure 1(b), the controller's input is either the quantum state |ψ(t) (section 3) or the last homodyne detection record in the form of a vector J τ (t) gathered over some time interval [t − τ , t] (section 4). Moreover, in section 4, the controller also receives a vector Ω m (t) of the m last control actions applied prior to the time t.
The controller then maps this input to the next value of the drive, Ω(t). Given |ψ(t) and Ω(t), we use the Runge-Kutta Milstein solver from the StochasticDiffEq.jl package [40][41][42] to calculate the next state |ψ(t + dt) according to the SDE (8). This loop between the control agent and the SDE solver is iterated for all time steps. We store the quantum states |ψ(t i ) and the drive values Ω(t i ) at N uniformly spaced time steps {t i } N i=1 so as to evaluate the loss function. Hereafter, the set {|ψ(t i ) , Ω(t i )} is called the checkpoints. We minimize a loss function of the form [10,12,13,43]: where the terms L µ encode case-specific objectives of the optimization process, see, e.g., [12]. Their relative importance can be controlled by the weights c µ . We choose the weights c µ empirically but, if necessary, they could also be tuned by means of hyperparameter optimization techniques [10]. To enforce high fidelity with respect to the target state over the whole control interval, we include in the loss function the average infidelity of the checkpoints |ψ(t i ) with respect to the target state |ψ tar , This form of the loss function also leads to a time-optimal performance of the controller. Focusing on specific time intervals, the last few steps, for example, the sum in equation (13) can be straightforwardly adjusted. We will consider the target state |ψ tar = |e in the following. In addition to L F , we include the term in the loss function to favor smaller amplitudes of the drive Ω(t i ). In section 4, we will find that this term is important to suppress the collapse of the NN toward a strategy where constant maximal pulses are applied during training. At this stage, we have calculated a quantum trajectory and evaluated its value of the loss function. In the next step, we must update the control strategy to decrease the value of the loss function. The derivative of the loss function L with respect to the parameters of the NN provides a meaningful update rule toward a better control strategy. Thus, we need to calculate the gradient ∇ θ L. This can be computed efficiently using the sensitivity methods for SDEs discussed next.

Adjoint sensitivity methods
The loss function L = L({|ψ(t i ) , Ω(t i )}), defined in equation (12), is a scalar function which explicitly depends on the checkpoints and implicitly on the parameters θ of the controller. In contrast to score-function estimators [44][45][46], such as the REINFORCE algorithm [47], we will incorporate the physical model into the gradient computation, as outlined in figure 2.
AD is a powerful tool to evaluate the gradients of numeric functions at machine precision [48]. A key concept in AD is the computational graph [49], also known as the Wengert trace, which is a directed acyclic graph that represents the sequence of elementary operations that a computer program applies to its input values to calculate its output values. The nodes of the computational graph are the elementary computational steps of the program, known as primitives. The outcome of each primitive operation, called an intermediate variable, is evaluated in the forward pass through the graph.
In forward-mode AD, one associates the value of the derivative: with each intermediate variable υ j with respect to a parameter θ i of interest. The derivativesυ ji are calculated together with the associated intermediate values υ j in the forward pass, i.e., the gradient is pushed forward through the graph. This procedure must be repeated for each parameter θ i , therefore, forward-mode AD scales poorly, in terms of computation time, with an increasing number of parameters {θ i }.
In contrast, reverse-mode AD traverses the computation graph backward from the loss function to the parameters θ i by defining an adjoint process:ῡ which is the sensitivity of the loss function L with respect to changes in the intermediate variable υ i . Reverse-mode AD is very efficient in terms of the number of input parameters because one needs just a single backward pass after the forward pass to obtain the gradient with respect to all parameters {θ i }. Thus, we always implement AD for the NN in the reverse mode. However, reverse-mode AD might be very expensive in terms of memory, because all the intermediate variables υ i from the forward pass need to be stored for the backward pass. Therefore, reverse-mode AD scales poorly in terms of memory if the number of steps and parameters of the SDE solver increases. Whether forward-mode or reverse-mode AD is more efficient to calculate gradients of the loss function depends on the details of the control loop shown in figure 2.

Algorithm 2: Continuously updated
Input: ψ(t0), t0, Ω(t0) = 0 Result: L compute and store checkpoints: As we illustrate now, it is possible to combine different AD methods in different parts of the computational graph. In sections 3.3 and 4, we will use algorithm 1, where the drive Ω(t) is piecewise constant, i.e., a constant value of Ω is applied over N sub successive time steps between two checkpoints. In this case, the memory consumption of the reverse-mode AD of the NN is moderate, since the number of parameters {Ω(t i )} grows only with the number of checkpoints rather than with the number of time steps. In contrast, in the SDE solver, the evaluation of the time evolution between two checkpoints , Ω(t i+1 )} only depends on the single parameter Ω(t i ), which makes forward-mode AD very efficient [50]. Therefore, we nest both methods and use an inner forward-mode AD for the SDE solver and an outer reverse-mode AD for the remaining parts, i.e., the NN and the computation of the loss function.
Restricting oneself to a piecewise-constant control drive, however, prevents the controller from reacting instantaneously to changes of the state. In order to implement a fast control loop, the controller has to be placed in the drift term of the SDE (8). Thus, the parameters entering the solver are the NN parameters θ, see algorithm 2. The continuous adjoint sensitivity method [51][52][53] circumvents the resulting memory issues by introducing a new primitive for the whole SDE in the backward pass of the code. This new primitive is determined by the solution of another SDE problem, the adjoint SDE. Formally, defining the adjoint process as the adjoint SDE problem in the Itô sense satisfies the differential equation: with the initial condition: where the standard conversion factor C IS ψ(t) in equation (18) accounts for the required transformation from the Itô to the Stratonovich sense, and vice versa [54]. The gradients of the loss function a θ (t 0 ) = ∇ θ L with respect to θ are then determined by the integration of with the initial condition a θ (t N ) = 0 Dim[θ] . This continuous adjoint sensitivity method will be used in section 3.2. Further details regarding the adjoint method and our Julia implementation [55,56] within the SciML ecosystem [11,16,40] are discussed in appendix C.2.

SDE control based on full knowledge of the state
We first investigate the scenario in which the controller maps the quantum state |ψ(t) to a new control parameter, C : |ψ(t) → Ω(t) ∈ [−Ω max , Ω max ]. From a learning perspective, this is a major simplification, because the NN does not need to learn how to filter the homodyne current J(t) to determine the state |ψ(t) . From the practical perspective of controller design, this approach assumes that there is already a filter module, which allows one to predict the state of the system from the measurement record and past control actions. In section 2.1 and appendix A.4, we discuss the implementation of such a filter in our case of a qubit subjected to homodyne detection. Note that if the initial state of the qubit is unknown, only a mixed state ρ t can be obtained if the detection record is too short. Nevertheless, it is always possible to obtain a pure state estimate ρ t → |ψ(t) and recover our scenario, e.g., by projecting ρ t onto the surface of the Bloch sphere. Alternatively, one can straightforwardly generalize our approach to the case of a stochastic quantum master equation describing the evolution of the system's density matrix in the presence of homodyne detection, see appendix A.5.
For all numerical experiments, we fix the parameters of the physical model at ∆ = 20κ and Ω max = 10κ. In appendix D, we discuss the variation of these parameters and their impact on the fidelity obtained.

Hand-crafted strategy
The control operator σ x in equation (1) induces a rotation about the x-axis. Therefore, a simple but very intuitive strategy to move the state upward at each time step is to compute the expectation value σ y ψ(t) and to choose the direction of rotation depending on its sign. Specifically, if σ y > 0 (or σ y < 0), the controller should rotate (counter-) clockwise about the x-axis, as summarized in algorithm 3. Figures 3(a) and (b) show the fidelity and the associated drive Ω for this control scheme. Although conceptually straightforward, this hand-crafted control function is very efficient. The mean fidelity over the whole control interval is F hc = 0.90 ± 0.13. We visualize the control strategy in figure 3(c), where we show the applied drive Ω as a function of the state on the Bloch sphere. The Bloch sphere [with spherical coordinates ϑ, ϕ, see also equation (11)] is mapped onto the tangential plane at z = 0, described by polar coordinates (R, Φ), using a stereographic projection: Evidently, the stereographic projection maps the south pole to R = 0 and the north pole to R = ∞. Throughout this paper, we truncate the value of R to an interval [0, 1], so we plot the applied drive values for the states of the southern hemisphere, see figures 3(c) and (d).

Continuously updated control drive
We now apply a NN as the controller. We use a controller which changes Ω(t) at every time step of the solver based on the current state |ψ(t) . This implements a high-frequency feedback loop but renders discrete AD methods very inefficient, since all parameters of the NN contribute to the feedback loop. Consequently, we use the continuous adjoint sensitivity method described in section 2.4 to calculate the gradients of the loss function. Figure 4(a) shows the smooth evolution of the loss function throughout the training of the (fully connected) NN, which converges to a configuration that can quickly reach fidelities with a mean value of about 0.9 and a modest standard deviation, see figure 4(b). The mean fidelity over the whole control interval is F c = 0.90 ± 0.13. Figure 4

Piecewise-constant control drive
We now reduce the control frequency and assume that the controller changes the action Ω(t) only every N sub time steps. This scenario is crucial in many physical situations, where the control loop is not fast enough to follow high-frequency changes in the physical system. In practice, this means that we evolve the state for N sub substeps between checkpoints {|ψ(t i ) , Ω(t i )} using a fixed value of Ω(t i ). The resulting piecewise-constant control scheme allows us to use the discrete forward-mode adjoint sensitivity method through the SDE solver combined with an outer reverse-mode AD for the rest of the control loop, as described in section 2.4. Although we restrict the rate at which Ω(t) can change, we find that the NN converges to a similar learning behavior with large fidelities F(t) similar to those of section 3.2, see figure 4(e). The mean fidelity over the whole control interval is F pw = 0.89 ± 0.10.

Comparison of the control strategies
Based on the results from a set of 256 trajectories, all three control strategies perform nearly equally well in terms of their average fidelities F ≈ 0.9. The piecewise-constant controller slightly outperforms the other two approaches by having the smallest relative dispersion of the mean fidelity F pw , which we attribute to the larger number of training epochs and the larger NN, see table B1.
The hand-crafted strategy achieves a large average fidelity but generates sudden jumps in the drive Ω(t), as shown in figure 3(b). Such a drive is hardly feasible from an experimental perspective. We find that NNs with moderate depths provide a smooth mapping between the input states and the drive, see figures 3(c) vs. 3(d), while maintaining high fidelity in the control interval. The signals generated by these protocols are experimentally more accessible, see figures 4(d) and (f). When necessary, specific terms can be added to L in equation (12) to strengthen various requirements on the controller's performance (e.g., the smoothness of the drive or bounds on the power input) as discussed in section 2.2. This is not the case for hand-crafted strategies, where an efficient implementation of these constraints might be impossible. Furthermore, in some cases, e.g., the mountain car problem, it is not straightforward to come up with any hand-crafted strategy to start with. In contrast, the RL and ∂P approach is easily adjustable to different physical systems.
There are two principal reasons why the fidelity F only reaches 0.9 in our setup. First, F is averaged over time and the controller needs some time to align the qubit to a target state (starting from an arbitrary state). Second, the controlled qubit rotates about an axis in the x-z-plane whose direction depends on the ratio ∆/Ω. Hence, even in the case Ω ∆ when the qubit rotates solely about the x-axis, the drive can only bring the qubit all the way up to the north pole of the Bloch sphere if the current state lies on a great circle perpendicular to this axis. Therefore, the ratio ∆/κ and the maximum value of Ω set a limit on how close the qubit can come to the target state on average. In appendix D, we study scenarios with lower κ and observe that the average fidelity can increase and approach unity. We can thus conclude that F ≈ 0.9 is not a limit of our design but rather originates from the restricted capability of the control operations considered here.

SDE control based on homodyne current
In this section, we will construct a controller which directly obtains the (noisy) measurement record of the homodyne current and determines the optimal control field Ω(t i ) at each time interval [t i , t i+1 ], see algorithm 4. We consider a controller formed by a slightly augmented NN architecture with fully connected layers (see figure B1). The acquisition of input data for the NN consists of the following steps: (a) The controller generates a piecewise-constant drive Ω(t i ) between two checkpoints at times t i and t i + 1 , as described in section 3.3. In the time window [t i , t i+1 ], we integrate the SDE (8) using N sub substeps of length δt, as outlined in figure 5(a). We label these substeps with k. The input that the NN uses to predict the next Ω(t i + 1 ) is the vector J τ (t i ) = [δJ i1 , . . . , δJ iN sub ] T , where τ = N sub δt is the length of the time interval over which we gather the data. Experimentally, δt can be interpreted as the detection time window of the photodetectors. According to equations (4) and (5), the homodyne measurement at the kth substep is The first term corresponds to the quadrature signal σ x ik measured over the detection window δt, which we assume to be approximately constant on timescales of the order of δt. The second term, δW ik ≡ W i(k+1) − W ik , is the Wiener increment during the kth substep. Note that the quantity δJ ik is dimensionless and solely represents the number of detected photons, as discussed in appendix A. (b) In addition, we provide the NN with information about the m last control parameters Ω m (t i ) = [Ω(t i−1 ), . . . , Ω(t i−m )] T . This equips the NN with a 'memory' of its actions, such that it can take into account how the sequence of preceding control drive amplitudes affected the performance. We empirically choose m such that the length of the input vector Ω m (t i ) corresponds to one tenth of the length of J τ (t i ).
Given these inputs, the function of the controller is to provide an optimal mapping C : Figure 5(b) shows the evolution of the loss function L during the learning phase as a function of the training epochs. Note that there are two distinct plateaus. First, after a couple of hundred epochs, the NN develops a general strategy consisting of the application of a periodic drive to prevent the qubit from decaying to the ground state. Figures 5(c) and (d) show examples of the performance of the NN at epoch 400 to illustrate this phase of the training. This strategy is state-independent, yet it manages to maintain the mean fidelity of the simulated trajectories at around 0.5 over the whole control interval. Around epoch 2500, after some transition phase where L oscillates substantially, the NN starts to provide state-sensitive control fields. The final performance of the controller in figures 5(e) and (f) reaches the mean fidelity F J = 0.79 ± 0.17 over the whole control interval. During the last 50 time steps in the control interval, which we specifically focus on during training using an adjusted loss function term of the type shown in (13) for these steps, the average fidelity is F 50 J = 0.86 ± 0.12. At this stage, we infer that the NN learns how to extract the signal from noisy data before it develops a similar learning strategy to that seen in section 3. In contrast to the filter mentioned in section 2.1 and appendix A.4, this universal approach based on ∂P will also work if some parameters of the model are a priori unknown. For example, if the detuning ∆ were unknown, one could train the controller on an ensemble of M randomly chosen parameters {∆ k } M k=1 , such that it learned how to deal with the general situation of arbitrary detuning. In this case, a straightforward filtering of the signals to obtain the state is infeasible, as this would require solving the filter for all possible values of the model's parameters; see equation (A.33). Other options for signal filtering would be (recursive) backward filtering methods [57,58] or recurrent NNs, because their structure allows them to capture temporal correlations in the data [39]. Note that such filter methods are compatible with the two-step control approach described in section 3.

Discussion
In this work, we proposed a framework based on ∂P to automatically design feedback control protocols for stochastic quantum dynamics. As a test bed, we used a qubit subjected to homodyne detection, whose dynamics was given by a stochastic Schrödinger equation. Note, however, that our method can straightforwardly be applied to different physical systems and that it can be generalized to the case of stochastic quantum master equations.
In section 3, we demonstrated that a controller using an NN can be trained to prepare and stabilize a target state starting from an arbitrary initial state of the system when the NN obtains full knowledge of the instantaneous state at any time. The method generates a smooth drive Ω(t) while maintaining high fidelity with the target state over the whole control interval. Additional constraints on the performance of the controller can be implemented by adding further terms to the loss function. This makes the ∂P approach more versatile than tailoring control functions manually, which requires a unique approach for every new system and can even be infeasible for large quantum systems.
The key feature of our ∂P framework is the inclusion of an explicit model of the system's dynamics in the computation of the gradients of the loss function with respect to the parameters of the NN, i.e., the controller. Specifically, in section 3.2, we employed the recently developed continuous adjoint sensitivity method for gradient estimation through an SDE solver, which is memory efficient and, thus, allows us to study a high-frequency controller. One of the authors implemented these new continuous adjoint sensitivity methods in the DiffEqSensitivity.jl package within the open-source SciML ecosystem [55].
In section 4, we showed that the feedback control scheme can be based on the direct provision of a record of the homodyne current measurements to the NN without the need to filter the information using the actual state beforehand. Therefore, the NN must first learn how to filter the input data (with a poor signal-to-noise ratio) before it can predict the optimal state-dependent values of the control drive. Ultimately, the trained NN was able to reach fidelities above 85% in a target time interval for random initial states.
In future studies, the optimization of the loss function based on stochastic trajectories using adjoint sensitivity methods could be compared to alternative approaches. First, the solution to the stochastic optimal control problem in the specific case of Markovian feedback (as in section 3) is a Hamilton-Jacobi-Bellman equation [11,54]. The solution of this partial differential equation, with the same dimensions as the original SDE, may directly give the optimal drive [59]. However, solving this partial differential equation with a mesh-based technique is computationally demanding, and mesh-free methods, e.g., those based on NNs, also require a (potentially costly) training procedure [60]. Second, the expected values of the loss functions could be optimized by leveraging the Koopman expectation for direct computation of the expected values from stochastic and uncertain models [61]. Additionally, one could approach this control problem by using an SDE moment expansion to generate ordinary differential equations for the moments and apply a closure relationship [62]. Additional research is required to ascertain the efficiency of these approaches in comparison to our method.
The results reported in this paper imply that ∂P is a powerful tool for the automated design of quantum control protocols. Further experimental needs, e.g., finite time lag between the measurement and the applied drive, finite-temperature effects, or imperfect homodyne detection, can be straightforwardly incorporated into this method. Thus, our work introduces a new perspective on how prior physical knowledge can be encoded into machine learning tools to construct a universal control framework. Besides the control application demonstrated here, the ∂P paradigm can also be adopted to solve other inverse problems, such as estimating model parameters from experimental data [63][64][65]. An interesting perspective for future work is to extend our framework to control-assisted quantum sensing and metrology.

Data availability statement
The data that support the findings of this study are openly available at the following URL/DOI: https:// github.com/frankschae/Control-of-Stochastic-Quantum-Dynamics-with-Differentiable-Programming [66].

Appendix A. Continuous homodyne detection
A.1. Quantum trajectories: monitoring the spontaneous emission of a qubit Consider a two-level atom interacting with a free photon. In the case of discrete photon modes, the interaction is given by the usual HamiltonianH int = ig(σ +â −σ −â † ), whereâ † andâ are the bosonic creation and annihilation operators, respectively, fulfilling the commutation relation [â,â † ] = 1, and where g is a coupling constant with the energy dimension. In contrast to the notation used in the main text, in this appendix, we will mark operators by a hat to distinguish them from scalars. This Hamiltonian results from the dipole interaction written in the formH int ∝ (â +â † )σ y and application of the rotating wave approximation. However, when addressing a decay to the continuum, the photons are represented by quantum field operators with the commutation relation [â t ,â † s ] = δ(t − s), which thus have units of the square root of energy (or, equivalently, the inverse square root of time). The interaction Hamiltonian takes the formĤ where κ is the decay rate, again, with a energy dimension such as g.
The field propagator in the interaction picture can be formally written asÛ τ = T e −i´t +τ tĤ int (s)ds , where T is the time-ordering operator. ExpandingÛ δt for short times δt, we find Here, the second-order terms provide an important O(δt) contribution. Higher-order terms only contribute according to O(δt 3/2 ) and can be safely neglected, and we also assume that δt is short on the timescale of the internal qubit dynamics. Assuming that the free field is originally in the vacuum state |0 , we can writê is a properly normalized state of the field, and |e denotes the excited qubit state. The simplest way to monitor the qubit state |ψ is then to continuously measure the intensity of the outgoing field. This defines a Poissonian process, as the probability of detecting a photon in a time interval δt reads and the probability of no detection is p 0 = 1 − p 1 . Thus, we obtain the standard unraveling of the spontaneous decay process: (A.5)

A.2. Weak homodyning
An alternative to photon-counting detection is to mix the outgoing light with coherent light from a local oscillator laser using a beamsplitter, and to measure the intensities at both output ports of the beamsplitter.
To start with, we first consider the situation where the intensity of the local oscillator field is comparable to that of the emitted light. It is given by the expansion of a coherent state using only vacuum and single-photon states where the single-photon state of theb-mode is defined identically to that of modeâ. Mixing the two states using a 50:50 beamsplitter giveŝ Introducing the photocurrent variable q = n − m, which measures the intensity difference of the two outputs, we see that there are three possibilities, q =−1, 0, 1. The respective probabilities of the three outcomes are given by the norms of the three different terms in equation (A.7), The three possible states of the system after the measurement are proportional to By setting β = 0 (i.e., no mixing with the local oscillator field), we recover the previous case where the qubit simply has a chance to decay: and p q=1 + p q=−1 = p 1 . The only difference is that the emitted photon is randomly split between the two detectors, with no information about the state of the qubit contained in the sign of q.

A.3. Strong homodyning
Finally, we consider the situation treated in the main text. The standard homodyne measurement requires mixing the signal with a strong coherent local oscillator field |β| 1: Recall that, here, the Fock states |n =b †n √ n! |0 are defined using the creation operatorb † = 1 √ δt´t +δt tb † s ds, which describes the field impinging on the lower port of the beamsplitter during the interval δt [see figure 1(a)]. For modesâ andb to match, the local oscillator laser β has to be in resonance with the drive laser Ω because, in the lab frame, the field a t picks up the phase ofσ − rotating at the frequency of the drive laser.
When the modes match, the 50:50 beamsplitter transformation takes the form In particular, this impliesÛ BS |0, β = β √ 2 , β √ 2 , i.e., a coherent state |β of mode b is split into two coherent states at half the intensity when mixed with the vacuum state |0 of mode a by a 50:50 beamsplitter. Using the last two expressions, we can write down the overall state of the qubit, the spontaneously emitted radiation, and the homodyne laser field after the beamsplitter. To do so, let us computẽ which is the operator mapping the state of the qubit at time t to the state of the qubit and the two detected modes at time t + δt. We can further simplify this expression by noting that wheren =â †â is the photon number operator. Labeling the photon number operator for theb mode usinĝ m =b †b , we can then rewrite equation (A.13) in a very intuitive form: Again,Ũ δt |ψ t directly gives us the state of the qubit and the detected modes, while n, m|Ũ δt |ψ t = √ p n,m |ψ t+δt | n,m (A. 16) gives the conditional state of the qubit together with the probability of detecting the n and m photons respectively. In the measurement process, the superpositions of states with different photon numbers collapse such that the state after the post-measurement is a classical statistical mixture of the possible outcomes: One easily sees from equation (A.15) that |ψ t+δt | n,m = |ψ t+δt | n−m , i.e., only the difference (n − m) between the photon counts reveals information about the state of the qubit, while the sum (n + m) only describes the shot noise of the local oscillator. It is therefore sufficient to keep the difference q = n − m and discard the sum, which defines the state: In a slight abuse of notation, we can formally introduce the joint quantum state of the qubit and the different in the counts, q, asÛ δt |ψ t witĥ q |q = q |q , which gives rise to the same post-measurement state. Here, where I n (z) is the modified Bessel function of the first kind.
Next, we make use of the fact that the distribution of q on the right-hand side of equation (A.20) is closely approximated by the normal distribution N (0, β 2 ) in the limit β 1. Therefore, we can replace the state Φ β of an integer-valued q in equation (A. 19) with of a continuously valued and normally distributed q, and we also introduce a continuously valued operatorq.
In the regime of interest β 1, the actual value of β does not play a role. To get rid of it, recall that β 2 = Iδt is the intensity of the local oscillator laser in the time window δt, so it is more convenient to work with the laser power I, which is independent of the choice of the time window δt. Then, we can rescale the photon count difference to to eliminate the laser intensity. Equation (A.19) now reads where the initial state |Φ of the rescaled j can be obtained from equation (A.21) and satisfies This also allows us to define the homodyne current of the main text, J = j/δt, as the photon count difference per time. At this point, note that equation (A.23) already implies equation (3). Let us now show how the measured value of j is distributed. The marginal distribution of j after the measurement reads From P(j), one easily deduces the expected value and the variance of j: Clearly, it has the form of a Wiener process with drift: where δW is the increment of a Wiener process for an interval δt (a normally distributed random variable with a zero mean and a variance δt). The last step is to include the Hamiltonian of the qubit H = 1 2 (∆σ z + Ω(t)σ x ) and set dt = δt such that I −1 dt κ −1 , ∆ −1 , Ω −1 in order to get (A.28)

A.4. Formal solution of the stochastic dynamics as a filter
The expression for the infinitesimal time evolutionÛ dt we just derived can be composed for all time intervals to define the evolution over a long period [0, t]: where each time step introduces a new quantum system for the photon detection difference measured during the corresponding infinitesimal interval. The joint state of the qubit and all values of the observed homodyne current for s ∈ [0, t] at the final time t readŝ Hence, for any fixed values of the measured current J(s) = j s /ds, we can find the (unnormalized) state of the qubit that is conditioned on these outcomes: and c J = s j s |Φ s is a scalar that is independent of the input state |ψ 0 . Thus, the state of the qubit at time t conditioned on the homodyne detection record reads and for mixed initial states,ρ t ∝D tρ0D † t . The expression of the operator: can be thought of as a filter relating the record of the drive fields Ω t and the values of the measured homodyne current J t to the map between the states of the qubit at times 0 and t, represented here by a two-by-two complex matrix. In a real experiment,D t can be computed by, e.g., discretizing the integral. Notably, even if the initial state of the qubit is unknown, e.g.,ρ 0 = 1 21 , after a certain characteristic time the state:ρ becomes pure. This is because the stochastic dynamics essentially decouples the state of the system at time t from its state in the remote past. On the other hand, the measured homodyne current J t reveals information about the unknown initial state and we have just shown how to filter this information, as for any two initial states |ψ 0 and |ϕ 0 .

A.5. Derivation of the norm-preserving stochastic Schrödinger equation
We start with the stochastic Schrödinger equation ( Table B1. Hyperparameters employed to train the neural networks in the case of continuously updated control parameters (section 3.2), piecewise-constant control parameters with knowledge of the state |ψ⟩ (section 3.3), piecewise-constant control parameters with knowledge of the homodyne current (section 4), and the closed-system dynamics described by the Schrödinger equation (appendix C.1). For all simulations, the number of checkpoints is N = 150. The value of N sub gives the number of solver substeps between checkpoints. The coefficient cF50 refers to the modification of the loss function (13) which is limited to the last 50 steps of the control interval. The specifications of the solvers are provided in the SciML documentation [40]. 'LLS n' , 'LLA n' and 'LLC n' denote the nth linear-layer of the state-aware, action-aware, and combination-aware networks, respectively. The number of input and output channels of the layers are specified in brackets. on the homodyne signal J(t), but it does not preserve the norm ofρ during the time evolution. This can be compensated for by adding a correction term: which cancels the last term in equation (A.37) without introducing any new norm-nonconserving terms. On the level of the stochastic Schrödinger equation, this term can be generated by adding a contribution, which gives rise to equation (8) of the main text.

Appendix B. Neural network architectures and hyperparameters
The hyperparameters used in the control tasks are summarized in table B1. The architecture of the NN used in section 4 is shown in figure B1. In all NNs, we use ReLUs as activation functions for the hidden layers and the softsign activation function for the last layer. Modifications of this simple architecture, e.g., through the application of recurrent neural networks to capture temporal correlations in the homodyne current for the SDE control in section 4, might be essential for the control of complex many-body quantum systems.
In this work, we used NNs as universal function approximators. While this approach is very general, other controller choices, e.g., Chebychev polynomials or Fourier basis expansions, could boost the performance for low-dimensional inputs, as in the case of a qubit, because they can be optimized to the problem at hand. The use of sparse identification for the dynamic system method [67][68][69] or symbolic regression tools [70] could further allow one to replace the trained NNs by a symbolic description based on a pre-defined library of operators. state-aware action-aware combination-aware Figure B1. Scheme of the NN used in section 4, which consists of three segments of fully connected networks. The first one, the state-aware NNsa, obtains the vector J τ (t i ) which bears information about the state of the controlled system in the form of its quadrature ⟨σx⟩ over the time interval [t i − τ, t i ]. Similarly, the action-aware NNaa takes as an input a vector Ωm(t i ) of the last m actions taken prior to the time t i . The outputs of NNsa and NNaa are concatenated into a vector which enters the last segment, which we refer to as the combination-aware NNca. The final output is the next drive value Ω(t i + 1 ) to be applied. The parameters of the network can be found in table B1.

Appendix C. Continuous adjoint sensitivity method for ODEs and SDEs
In this appendix, we discuss the continuous adjoint sensitivity method for ordinary differential equations (ODEs) and its generalization to the SDEs used in the main text. In appendix C.1, the continuous adjoint sensitivity method for ODEs is used to compare the stochastic control scenario discussed in section 3.2 with the associated unitary control scenario in the case of a closed system. In appendix C.2, we provide an intuitive understanding of the stochastic adjoint process and discuss technical details regarding the continuous adjoint sensitivity method for SDEs and its implementation [55]. In all implementations, we use an isomorphism to map the complex amplitudes of the quantum state to real numbers, as required by the AD backend [71].

C.1. Quantum control of a qubit in a closed system
In this section, we aim to control the closed-system dynamics of a qubit given by the Schrödinger equation: with the Hamiltonian: where ∆ is the qubit transition frequency [10]. As in section 3.2, we choose an NN with parameters θ as the controller ansatz and we allow the NN to change the control drive Ω(t) at every time step, based on the state |ψ(t) . The initial states are uniformly distributed on the Bloch sphere, see equation (11). We compute the forward pass, equation (C.1), with the adaptive Tsitouras 5/4 embedded Runge-Kutta (Tsit5) scheme as implemented in the DifferentialEquations.jl package [40]. Given this solution, we use the loss function (12) with weights c F = 1, c Ω = 0. As discussed in section 2.3, it depends on the checkpoints at times As discussed in section 2.4, the continuous adjoint sensitivity method circumvents the memory issues of discrete reverse-mode AD and scales better with the number of parameters than forward-mode AD. To derive this adjoint method, one first adds a zero to the loss function, equation (12), and rewrites it as a time integral: where we inserted L F , as defined in equation (13), and introduced the Lagrange multiplier λ, such that I(θ) = L(θ) and ∇ θ I(θ) = ∇ θ L(θ). After an integration by parts and re-arrangement of terms for ∇ θ I(θ), one finds that computing the gradients ∇ θ L requires an evaluation of the time evolution of the quantity a ψ (t) = λ † (t). This leads to the gradients ∇ ψ(t) L of the loss function with respect to the state |ψ(t) for all times t and is called the adjoint process: The associated adjoint ODE problem satisfies the differential equation [11,51,[72][73][74]: with the initial condition: This adjoint ODE is defined backwards in time from t N to t 0 . To compute the vector-Jacobian products in equation (C.5), one needs to know the value of the state |ψ(t) along its entire trajectory, which was computed in the forward pass. Thus, we must store these states or recompute them by solving the ODE backwards in time starting from the final value |ψ(t N ) , together with the adjoint process. This does not introduce a significant memory overhead. Computing the gradients ∇ θ L requires yet another integral, which depends on the original and the adjoint process, |ψ(t) and a ψ (t), respectively. With the initial condition: Therefore, the gradients of the loss function ∇ θ L with respect to the neural network parameters θ can be obtained by a single (adjoint) ODE with an augmented state given by |ψ(t) , a ψ (t), and a θ (t).
Because of the reversion of the ODE, equation (C.7) may be numerically unstable. Therefore, we use an interpolating adjoint algorithm to increase stability [11]. When solving the augmented state backwards in time, we recompute the forward pass sequentially for all time intervals [t i , t i+1 ] between two checkpoints. Then, fourth-order interpolations of the recomputed forward pass are used to compute the vector-Jacobian products for the reverse pass.
The results of the closed-system control task are shown in figure C1. We observe a very fast and smooth learning process with our physics-informed RL framework. The target state |e is an eigenstate of the Hamiltonian (C.2) for Ω = 0, therefore, the control drive Ω is switched off once the target state |e is reached. Figures C1(d) and (e) show that the control strategy, illustrated on a Bloch sphere and using a stereographic projection, resembles the stochastic case discussed in the main text.

C.2. Technical details of the continuous adjoint sensitivity method for SDEs
To generalize the continuous adjoint sensitivity method for ODEs to Itô SDEs, one first needs to determine how the sample path of an SDE can be reversed, i.e., how one can reconstruct the forward pass of the state |ψ(t) from time t 0 to t N by a reversed time evolution from t N to t 0 launched at |ψ(t N ) . The reversion of an SDE: d|ψ =K ψ(t) dt + M ψ(t) • dW(t), (C.10) defined in the Stratonovich sense, as indicated by the • is given by [52,53] with noise values W(t) identical to those sampled in the forward pass. This closely resembles the reversion in the case of an ODE shown in equation (C.7). To restore the noise, we use a dense noise grid of the noise values used in the forward pass. The memory overhead caused by using a noise grid could be traded against speed by using a virtual Brownian tree [52] or a Brownian interval [53], which would enable the reconstruction of W(t), and only require the storage of very little information, such as the seed of the pseudo-random number generator used. Despite the allocation of the noise values, the continuous stochastic adjoint sensitivity method is still much more memory efficient than discrete reverse-mode AD backpropagation through the solver operations. From the reverse Stratonovich SDE, equation (C.11), we can straightforwardly obtain the reverse Itô SDE: of the monitored qubit, equation (8), where the standard conversion rule: 13) in equation (C.12) accounts for the required transformation from Itô to the Stratonovich sense and vice versa [54].
Analogously to the ODE case, taking the scalar loss function L [equation (12)], the adjoint process: to compute the gradients of L with respect to the state |ψ(t) is now a strong solution [54] of the adjoint Itô SDE: Again, the value of the state |ψ(t) of the forward pass is seen to be embedded in the vector-Jacobian products within the backward pass and, therefore, knowledge of the state |ψ(t) along its trajectory is necessary. Using equation (C.12), we can recompute the state |ψ(t) without having to store the full trajectory of the forward pass.