Efficient Algorithms for Optimal Control of Quantum Dynamics: The"Krotov'' Method unencumbered

Efficient algorithms for the discovery of optimal control designs for coherent control of quantum processes are of fundamental importance. One important class of algorithms are sequential update algorithms generally attributed to Krotov. Although widely and often successfully used, the associated theory is often involved and leaves many crucial questions unanswered, from the monotonicity and convergence of the algorithm to discretization effects, leading to the introduction of ad-hoc penalty terms and suboptimal update schemes detrimental to the performance of the algorithm. We present a general framework for sequential update algorithms including specific prescriptions for efficient update rules with inexpensive dynamic search length control, taking into account discretization effects and eliminating the need for ad-hoc penalty terms. The latter, while necessary to regularize the problem in the limit of infinite time resolution, i.e., the continuum limit, are shown to be undesirable and unnecessary in the practically relevant case of finite time resolution. Numerical examples show that the ideas underlying many of these results extend even beyond what can be rigorously proved.

The key to unlocking the potential of quantum systems is coherent control of the dynamics -and in particular optimal control design. The latter involves reformulating a certain task to be accomplished in terms of a control optimization problem for a suitable objective functional.
One approach to solving the resulting control optimization problem is direct closed-loop laboratory optimization, which involves experimental evaluation of the objective function (see e.g. [36,37]). This approach has been applied to a range of problems, and in some settingts, e.g., in quantum chemistry, where high fidelities are not critical and the estimation of expectation values for large ensembles is fast and inexpensive, this approach is both feasible and effective. In other situations, however, in particular for complex state engineering and process op- * Electronic address: sgs29@cam.ac.uk † Electronic address: plbd2@cam.ac.uk timization problems, each experimental evaluation of the objective function may require many experiments and expensive tomographic reconstruction, making it highly desirably to have pre-optimized control designs based on a model of the systems, even if imperfections in the model might necessiate a second stage of adaptive closed-loop optimization to fine-tune the controls. Model-based optimal control relies on solving the resulting control optimization problems using computer simulations of the dynamics and numerical optimization algorithms. Efficiency is crucial as solving the optimization problem requires the numerical solution of the Schrodinger equation many times over, which is generally expensive for realistic models and practical problems, and the amount of simulation is required is therefore a main limiting factor determining which physical systems the technique can be applied to. In this context two main strands of optimization algorithms can be distinguished, namely, concurrent-in-time and sequential-in-time. The first can be readily understood using general results from numerical analysis and optimization theory. The second, motivated by control of dynamical systems, is often formulated in terms of iterative solution of a set of Euler-Lagrange equations. Despite being widely used to solve control optimization problems for quantum dynamics in the aforementioned applications, its algorithmic performance does not have such a well established theory, and many key issues such as its convergence behavior, the effect of discretization errors and optimal update formulas have not been extensively studied. This motivates us to address these issues in this article.
Although the theory can be generalized to open systems governed by both Markovian [38] and non-Markovian dynamics [39], our analysis in this article shall focus on Hamiltonian control systems, i.e., systems whose evolution is governed by a Hamiltonian H f dependent on external control fields f = f (t), which determines the evolution of the system via the Schrodinger equation where U f (t) is a unitary evolution operator and I the identity operator on the system's Hilbert space H, which we assume to have dimension N < ∞ here. The evolution of pure-state wavefunctions |Ψ 0 ∈ H is given by |Ψ f (t) = U f (t)|Ψ 0 , and for density operators ρ 0 (unittrace positive operators on H representing mixed states or quantum ensembles) by ρ f (t) = U f (t)ρ 0 U † f (t), U † f (t) being the adjoint of U f (t). In this semi-classical framework the control f (t) is a classical variable representing an external field or potential such as a laser, microwave or radio-frequency (RF) pulse or an electric field created by voltages applied to control electrodes, for example. The dependence of the Hamiltonian H f on the control can be complicated but often it suffices to consider a simple form such as a linear perturbation of the Hamiltonian, H f = H 0 + f (t)H 1 , for example. Although at any fixed time t the Hamiltonian depends only on a single parameter f (t), if H 0 and H 1 do not commute, i.e., [H 0 , H 1 ] = 0, varying a single control over time enables us to generate incredibly rich dynamics and effectively provides us with an unlimited number of parameters, a powerful idea, which is developed within the theory of non-linear open-loop control. We can exploit this freedom to manipulate the dynamical evolution of the system to suit our needs and achieve a desired goal. The specific control objectives are as varied as the applications, and many different types of control objectives have been considered, but most fall into one of the following categories.
State-transfer problems involve designing a control f to steer a system from an initial state to a target state and come in two flavors, pure-and mixed-state transfer problems, formulated in terms of wavefunctions |Ψ and density operators ρ, respectively. For optimal control purposes they are usually formulated in terms of maximizing the overlap with a desired state |Φ or σ, or socalled transfer fidelity Maximizing the transfer fidelity is equivalent to minimizing the distance d(A, B) = A − B S induced by the real Hilbert-Schmidt inner product A|B = Tr(A † B) for operators A, B on H, which for wavefunctions can be expressed as | Φ|Ψ | 2 , in terms of the standard inner product, and we can equivalently express the problem in terms of minimization of an error functional where the constant E 0 = Tr(ρ 2 0 ) + Tr(σ 2 ) takes into account the conservation of the linear entropy under unitary evolution.
Trajectory optimization problems have a similar flavor but instead of minimizing the distance from a desired state at a final time T , we aim to minimize the distance of the system's trajectory from a target trajectory |Φ d (t) or σ d (t) over time, leading to an objective functional of the form ) dt for density operator trajectories.
Observable optimization problems involve optimizing the expectation value of an observable Q (Hermitian operator on H) either at a final time T or over time [0, T ], and also come in pure and mixed state versions, leading to the respective target functionals to be maximized The last two of these are just simple generalizations of F 2a and F 4a since density matrices are Hermitian operators on H. Observable and trajectory optimization problems involving linear combinations of various objectives can obviously also be considered.
Unitary gate optimization problems involve minimizing the distance from a target gate V ∈ U(N ) or equivalently maximizing the gate fidelity Unitary gate optimization problems and pure-state transfer problems can be formulated using absolute values instead of the real part which corresponds to optimizing the overlap with the target state or gate modulo a global phase factor, which is usually irrelevant. Mathematically, this corresponds to restricting the state space to CP N −1 instead of unit vectors in H or the projective unitary group PU(N ) instead of U(N ). Historically, the first objective considered in this context was F 5 [40], with further developments for this case carried out in [41] and [42]. In the same series of papers, [43] considers F 1b and [44] introduces F 2 in the context of general dissipative evolution. The method was applied to gate problems, using F 7 in [45] and extending this to (the square of) F 7b in [46]. Later [47] considered the two quite general types of objectives F 1 + F 3 and F 5 + F 6 .
While the objective functionals defined above correspond to many different control problems, a commonality between all of them is that they are simple functionals of the evolution operator U f (t), in fact, with the exception of F 7b , all target functionals are linear or bilinear functions of U f (t), and therefore many properties can be derived from analysis of the latter.

II. ITERATIVE SOLUTION SCHEMES FOR CONTINUOUS-TIME CONTROLS
The optimization problems defined in the previous section usually cannot be solved directly by analytical means. Practical schemes for finding solutions therefore generally involve iterative algorithms. A large class of iterative solution schemes that have been proposed for problems of the types above can be described in simple terms by considering a pointer p moving back-and-forth within the time interval [0, T ], overwriting the value of the field at that point. More specifically, one usually starts with an initial trial field f (t) = (f 1 (t), . . . , f M (t)), and then sweeps p forward through the whole time interval, while updating the value of f m (p) to where η is a suitable real parameter and w m (p) a suitable real weight function. Here the limit in question is taken from the right and the functional derivative of F with respect to the field, δF(f ) δfm(p) , is evaluated at the current field f and in direction of the delta mass at p. Then p is swept backwards through the time interval, while updating f m (p) to with the limit now taken from the left. This forward and backwards sweeping is repeated until some form of convergence is achieved. Of course, we can equally well start with the backward sweep. The notion of "overwriting the field" is made mathematically rigorous by introducing two fields g andg, which are solutions to initial and final value operator differential equation problems, respectively, and taking the actual field f to be equal to g on the interval [0, p) andg on (p, T ]. For example, in the gate optimization case we iteratively solve the initial value problem and the final value problem starting with some initial trial fieldg (0) . There is a subtlety in the cases involving pure-state observables, which are not immediately reducible to an ODE formulation. These cases are usually dealt with by reducing them to pure-state transfer or trajectory tracking problems with a target state Φ| = Ψ 0 |U g (T ) † Q or trajectory Φ(t)| = Ψ 0 |U g (t) † Q(t), where g here is the state of the field the last time the pointer hit p = T . The updates of Φ| or Φ(t)| when updating g at p = T can only increase the fidelity, in the pure state observable case, due to the inequality and similarly for the tracking problem, where the current field f is used to calculate Φ new | and g is used for Φ old |.
For the typical choices of the objective functional F listed above, this scheme can be shown to yield a monotonic increase of the regularized objective function between sweeps (and in fact throughout progression of the pointer p) for any positive functions w m (t) and any choice of η, η ∈ [0, 2] [42]. As F(f ) is uniformly bounded, the sequence J (n) is bounded and must converge to some value J * . However, the best we can hope for is convergence to a critical point of J(f ), i.e., a point for which the variation of J(f ) with regard to independent variations of the fields f m (t), vanishes, which happens if f m (t) = 1 On the other hand, a necessary condition for the functional F(f ) to have an extremum is This shows that this update scheme generally will not converge to a critical point, much less extremum, of the actual objective F(f ), as the only critical point shared by the actual objective F and the regularized objective J is the trivial solution f = 0. In fact, convergence in the stronger sense of convergence of the field iterates f (n) to some field f * that satisfies the critical point condition (12) does not follow trivially, and has only been shown under certain conditions such as sufficiently large penalty terms [48], for which the resulting converged fields tend to be far from the global optimum of the unpenalized objective function, as can be seen by comparing the respectively critical point conditions (12) and (13). Furthermore, numerical solution of the initial and final value problems generally requires some form of time discretization. For a fixed time step ∆t the change of the regularized objective J depends on the choice of the weights w m of the penalty terms and monotonic increase is not guaranteed. When the weights are too small relative to the time step ∆t, the changes in the field amplitudes can become too large and as a result J may decrease. An example of this behavior is shown in Fig. 1 for a model system consisting of a linear arrangement of five qubits with uniform, nearest neighbor Heisenberg coupling subject to a fixed global magnetic field in the x-direction and five local Z-controls, one for each qubit, leading to a Hamiltonian of the form whereσ (n) x etc are the usual tensor products of Pauli matrices, e.g.,σ (1) x =σ x ⊗ I ⊗ I ⊗ I ⊗ I andσ x ,σ y andσ z the the standard Pauli matrices and I the 2 × 2 identity matrix. For the following simulations we choose the constants Ω = 10 and J = 1.
The convergence plot shows the value of the actual objective F (n) = 1 32 Tr(V † U (n) (T )), the unit gate fidelity, and the regularized objective J (n) = F (n) − C (n) with the penalty term C (n) = 1 2 m f (n) m 2 w with uniform weights w m (t) = λ, as a function of the iteration number n for ∆t = 0.01 and different values of λ. For the smaller values of λ, J (n) , after increasing monotonically for a few dozen iterations, starts to decrease as the cost term begins to dominate. One would usually terminate the algorithm at this point but forcing it to continue shows that despite the decrease in J (n) , the value of the actual objective functional F (n) continues to increase. Monotonicity of J (n) can be recovered by increasing the weight of the penalty but at the expense of very low final fidelities, which casts doubt on whether optimizing a regularized objective is sensible at all if the real objective is to maximize the fidelity. This problem has also been observed in the literature and as a remedy, a modified version of the regularized objective function has been employed [49]. It can be shown that if the "reference functions" a m (t) are chosen to be the values of the fields f m (t) in the previous iteration in the iterative update scheme above, giving rise to a dynamic cost term then we obtain the modified update rules for the forward and backward sweep which lead to a monotonic increase in the actual objective F for η, η > 0. If the sequence of field iterates f (n) converges to some field f * then we further have f → 0, i.e., the cost term vanishes in the limit, lim n→∞ C (n) 2 = 0, thus the resulting field f * approaches a critical point of actual objective F(f ). However, such convergence is not guaranteed.

III. OPTIMIZATION OF OBJECTIVE WITHOUT PENALTY
The previous section shows that optimizing a regularized objective with a fixed cost term is problematic in that (i) the critical points of the regularized objective differ from those of the actual objective and (ii) in the discrete time setting even monotonicity cannot be guaranteed. The former issue can be addressed by introducing a variable penalty term but at the expense of compounding technical problems about monotonicity and convergence. We shall now show that the introduction of a regularized objective is unnecessary, and in fact undesirable in particular in the discrete time setting.
As observed above, a necessary condition for the fidelity F(f ) (infidelity or error E(f )) to assume a maximum (minimum) at some f = f * is that f * be a critical point of F(f ), i.e., that the variation of F(f ) with respect to f vanish for f = f * . As all the objective functionals defined above are simple functionals of the evolution operator U f (t), their respective critical points are defined in terms the critical points of U f (t), which can be found by rewriting the Schrodinger equation in integral form (18) as can be verified simply by differentiating both sides, Iteratively substituting (18) into itself yields a perturbative series expansion similar to a Taylor expansion for ordinary functions The (n + 1)th term in the expansion (19) can be bounded by 1 n! ∆H n 1 , where ∆H 1 refers to t 0 ∆H(τ ) dτ for any chosen unitarily invariant norm. This shows that the maps f → U f (t) have functional derivatives of all orders for all t and are in a well-defined sense real analytic. By the polarization method, these expansions determine all mixed (functional) derivatives of both U f (T ) and U f , with the n th order derivative in directions f 1 , . . . , f n bounded in norm by C n n k=1 f k 1 in both cases. Assuming is a vector of independent controls f m (t) in a suitable function space such as L q [0, T ] for q > 1, for example, defining the local variations of the Hamiltonian with respect to the controls H m (p) = δH f (p) δfm(p) , we see that the variation of the evolution operator with respect to local variations of the controls are given by from which we can effectively compute the variations of all of the objective functionals defined in the previous section, giving for instance, Given any F that is a sum of multi-linear terms in U f (T ) and U f that are bounded in all entries of U f , we can easily devise schemes that lead to a monotonic increase of F. Let b be a positive function that vanishes outside the interval [−1, 1] and integrates to 1, and let b s, up to an O(s 2 ) error, as measured in any L q norm with q < ∞. Applying the product rule to our general F with the approximations (23) and (24), dividing by s, then letting s → 0, shows that the value δF(f ) δfm(p) is always welldefined and yields an expression for it. The exact sense in which this quantity matches the derivative of F evaluated at a δ-mass is that b is a positive mollifier. Denoting the addition of αb s to the mth field f m by f +αb s I m , a Taylor expansion to first order gives for each α. This shows that for any α of the same sign as , assuming the former is non-zero, there is a sufficiently small scale s for which adding αb s to the field leads to an increase in F. Thus, if we add a sequence of such increments αb s,p k centered at different times p k , the corresponding sequence F (k) will be monotonically increasing. For a fidelity function F, which is a continuous function from the compact domain U(N ) to R, the range of F is bounded. Thus F (k) is a uniformly bounded, monotonically increasing sequence, i.e., convergent. Convergence of F (k) to some value F * is not a very strong property, however; what is more interesting is whether we converge to a global maximum of the fidelity, or equivalently, whether the infidelity/error E goes to 0, and the rate of convergence. Using this update rule, making bigger changes to each f m in its gradient direction should in principle offer larger FIG. 2: Static (red, increasing) and stateful (blue, decreasing) upper bounds on the asymptotic rate of convergence for backand-forth sweeping in the continuous limit s → 0 for a single control field show that there is an asymptotic rate that cannot be exceeded no matter what search length α is used.
increases in the fidelity F(f ) for the same displacement incurred by the pointer p. On the other hand, we can expect larger choices of α to directly lead to higher amplitudes of f , which in turn induce more oscillation in U f , the gradients δF(f ) δfm(p) and therefore in the fields. So it would seem that in choosing α, we are making a tradeoff between high amplitude and oscillation in the fields, which are generally undesirable features, and fast convergence of the algorithm. To make this intuition about speed rigorous, we can prove that there is an upper bound on the instantaneous rate of convergence which increases with the search length α and tends to zero as α vanishes. However, this is a static analysis that cannot take into account the evolution of the algorithm. Once we move to a more sophisticated 'stateful' analysis, we can derive a different bound on the asymptotic rate of convergence which is actually decreasing in the search length α. The resulting bounds on the asymptotic convergence rates in the continuous limit are shown in Fig. 2. The non-trivial stateful bound in particular shows that larger search lengths, which are bound to result in greater speeds towards the start of the algorithm, will tend to slow down optimization in the long run, at least past a certain value of α.

IV. FIELD DISCRETIZATION AND PRACTICAL OPTIMIZATION SCHEMES
The previous section shows that we can in principle easily monotonically increase the objective function by iteratively updating the controls in a sequence of forward and backward sweeps. In practice solving the optimal control problem numerically requires discretization of time. This is often done implicitly but we shall see that the choice of discretization restricts the class of mono-tonic schemes that are available compared to the continuous case. It is therefore desirable to explicitly discretize the problem and derive optimization schemes for this case. We can do this by choosing a basis function b, with a step size s and a set of positions p k . The most common choice is to partition of the interval [0, T ] into a fixed number of intervals of size ∆t = T /K and restrict the controls f m (t) to be piecewise constant functions where χ k is the characteristic function of the interval In this case we have where I m indicates that we are adding the basis function to the field f m (t). For a given fixed ∆t the O(α 2 ∆t 2 ) term in Eq. (27) need not be negligible and α must be chosen carefully to ensure F(f + αχ k I m ) ≥ F(f ). In theory choosing the time step ∆t and search length α small will ensure F(f + αχ k I m ) ≥ F(f ), i.e., a monotonic increase in the objective functional, but this will result in tiny increases in the objective function in each step and thus slow convergence. For the discretized version of the problem one can actually prove stronger convergence results in the sense that under relatively mild conditions, the sequence of field iterates f (n) must either converge to a critical point f * of the fidelity function or diverge to infinity (see appendix B).

A. Time Resolution and Gradient Accuracy
The first critical choice we have to make is the time resolution or time step ∆t, which effectively determines the finite-dimensional subspace of the infinite-dimensional function space L q [0, T ] we choose to restrict the optimization to. Considering that we started with a continuoustime optimal control problem, it may seem natural to choose small time steps ∆t to approximate the continuous case, and it is certainly true that choosing ∆t too large may prevent us from being able to reach high fidelities by restricting the search space too much. In general, a minimum requirement for controllability is that the dimensionality of the search space M × K be no less than the dimension of the state space. For optimization of fivequbit unitary gates, for example, the state space is the unitary group U(32), which has dimension 32 2 . Thus we will require at least M K ≥ 1024 and higher time resolutions may be necessary to achieve high fidelities, although the fidelities considered satisfactory in practice may be low in this context. But aside from such considerations, small time steps are not necessarily a good idea. Simple analysis shows that the computational cost per iteration for a fixed problem and system size N is determined by the number of time steps K, and thus ∆t = T /K if the target time T is fixed. Higher time resolutions, i.e., smaller ∆t, are therefore computationally more expensive. Of course, this cost may be offset by larger gains per iteration. However, we shall see that the discrete rate of convergence is always of order ∆t, and for unitary gate optimization problems with fixed proportional search length the upper bound on the rate of convergence is of order ∆t 2 , for example, suggesting that larger ∆t may actually facilitate faster convergence. Another reason why very small time steps are often undesirable are the characteristics of the fields, e.g., to avoid excessively complex and noisy solutions.
Choosing larger ∆t requires careful reconsideration of the gradient computation, however. Assuming for simplicity that the total Hamiltonian is linear in the controls with time-independent H m , we have δH[f ]/δf m = H m and Eq. (19) gives immediately from which we can calculate the various gradients, af- For small ∆t the integral defining J mk can be approximated by replacing the integrand by its value at some point, e.g., the right endpoint The accuracy of this first-order approximation depends on ∆t and the eigenvalues of the total Hamiltonian where B = −iH(f (t k )) and For Hamiltonian systems γ(z) can be evaluated via the eigendecomposition of the skew-Hermitian matrix B = V ΛV † , where Λ = diag(λ r ) and λ r purely imaginary. This allows us to evaluate γ(∆t ad B )(−iH m ), noting that where v r |J mk |v s are determined by γ(z) evaluated at the eigenvalues of the adjoint ad B times ∆t, which are determined by the differences of eigenvalues λ r of B, The first order approximation γ(ω rs ∆t) = 1 is off by close to min{∆t|ω rs |/2, 1}. Thus, to ensure that the standard approximation is reasonably accurate we require ∆t < ad B −1 where A is the operator norm, max |eig(A)|. In problem 1 it can be shown that ad B ≈ 100, with the dominant contribution being H 0 , which shows that for ∆t = 0.01 standard approximation is over 95% accurate while for ∆t = 0.1, x can be as large as 10, i.e., γ(ix) is far from 1 and the error in the gradients is huge, and we are in fact performing more of a random walk than a gradient search! A histogram of the actual distribution of the gradient overlaps for problem 1 in Fig. 4 One possibility to evaluate this series is to truncate the infinite sum at some N − 1 ≥ 1 and invert the order of summation to reduce the number of required matrix products to 3N −4. Any such sum of the first N terms of the series yields an approximation to the exact gradient expression of order N in ∆t. For N even we can reduce the number of required products to 2N − 2 by summing N/2 terms of the form for suitable choices of a j , r j . To recover the first N terms in the series, the coefficients a j , r j must satisfy for n = 0, . . . , N − 1. This relation uniquely determines the r j as the nodes, and a j as the corresponding weights of Gauss-Legendre quadrature over the interval [0, 1].
The series expansion is preferable to the eigendecomposition for pure-state or state vector optimization problems; in particular for a fidelity function such as F 1 , F 1b or F 5 , whose gradient in step k is expressible in terms of the real inner product of the state |Ψ f (t k ) and some vector Φ|, as Φ|J mk |Ψ f (t k ) . This expression can be approximated to order N by first computing B k |Ψ f (t k−1 ) and Φ f (t k−1 )|B k for k = 1, . . . , N − 1 and then applying either of (38) or (39). These procedures use only matrix-vector or vector-vector operations, avoiding the need for relatively expensive matrix multiplications and an eigendecomposition. Specifically, Eq. (38) needs M N + 2N − 2 matrix-vector products and N ((N + 1)/2 + M ) vector operations to evaluate for all controls, while Eq. (39) achieves the same order of approximation with only M N/2 + 2N − 2 and (N + M )N/2 operations respectively. The series expansion therefore can be applied to state transfer problems for high-dimensional systems with sparse Hamiltonians, where the eigendecomposition is infeasible. However, additional measures such as scaling and squaring may be needed for larger time steps to ensure that the series approximation can be truncated for small N . Another alternative are finite difference approximations to estimate the gradients. These issues are further explored in the context of spin dynamics for large-scale systems in [50].
We can also derive an efficient series approximation for density matrix and unitary gate optimization problems with fidelity functions F 2 , F 7 or F 7b . The gradients on step k of these fidelities can all be expressed as Tr(AJ mk ) for some skew-Hermitian matrix A, which is [ρ f (t k ), Q f (t k )] for F 2 , and the skew-Hermitian part (X −X † )/2 of X = U f (t k )V † U f (T, t k ) for F 7 or the phase multiple of X for F 7b . Using the fact Tr(W [X, Y ]) = Tr([W, X]Y ) iteratively, we can re-write Tr(AJ mk ) as which when truncated to its first N terms, requires N −1 matrix multiplications (plus N + M negligible elementwise matrix operations) to evaluate for all controls, noting that the commutator [X, Y ] of skew-Hermitian matrices X, Y is the skew-Hermitian part of 2XY . Formula (35), however, has the advantage of being exact, and its numerical cost being independent of ∆t. Furthermore, once the eigendecomposition of H[f (t k )] is known, the computation of the evolution operators U f (t k , t k−1 ) becomes trivial. This makes it suitable -and in many cases probably preferable -for use in density matrix and unitary gate optimization problems.

B. Variants of Sequential Update
The straightforward discrete analogue of the continuous forward and/or backward sweeping is sequentially updating the fields f (t k ) for k = 1 to k = K (or k = K to k = 1) according to the rule where ∇ k F = ( ∂F ∂f mk ) M m=1 is the gradient vector at time t k , and iterating the procedure, starting with an initial trial field f (0) as before. As in the continuous case, we have the option of sweeping forward and backward, continually updating the fields in both directions, as e.g., in [41] and the generalized scheme proposed by [42], or updating the fields only in one direction, usually the forward sweep, as in the PK-Krotov version of the iterative update scheme discussed in Sec. II, where η = 1 and η = 0. Intuitively one might expect that updating the fields in both directions should accelerate convergence, but convergence analysis shows that this is not the case. Continually updating the fields in both directions in fact reduces the asymptotic rate of convergence. This is illustrated in Fig. 6, which shows that the asymptotic rate of convergence is substantially greater for the forward-only update than for forward-backward update, in particular for small ∆t, i.e., updating on the forward sweep only is preferable to updating on both sweeps. This can intuitively be explained as follows. The sequential update scheme aims to minimize the gradient at each time step. By continuity, minimizing the gradient at time t k tends to decrease the gradient at the subsequent time step t k+1 . As the magnitude of the gradient is the main factor limiting the increments in the fidelity at each time step, we quickly reach an asymptotic regime. If we only update in one direction, however, there is a step jump at every iteration, whenever we switch from t = t K to t = t 1 , which allows the gradient to rebound, facilitating larger gains in subsequent steps.
These observations suggest that updating the fields in a strictly sequential manner is not the optimal strategy, and that convergence might be accelerated by introducing step jumps. For example, instead of updating the fields sequentially in a single forward or backward sweep, we could update half the fields f (t k ) in the forward sweep and the other half in the backward sweep. We could choose the fields to be updated in the forward sweep randomly or choose to update all odd time slices f (t 2k−1 ) in the forward sweep, and the even ones in the backward sweep. The gains of such schemes per single time slice update must be offset against increased computational overheads associated with such non-sequential updates, chiefly additional matrix multiplications to propagate the changes. However, there is one simple variations of strictly sequential update that can be implemented without increasing the total number of operations (matrix exponentials, matrix multiplications and gradient evaluations) per single iteration: if we update the first K/2 time slices consecutively in the forward sweep, propagate without updating to k = K, and update the second half of the field consecutively in the backward sweep and then back-propagate from k = K/2 to k = 1 without updating. Fig. 6 shows that this split update strategy outperforms the forward-only update although the gains are smaller than for back-and-forth update versus forward-only update, and the advantage of the split update scheme diminishes for larger time steps, not too surprisingly, as for fewer and larger time steps the local gradients do not die off as fast, and therefore the rebound effect induced by the switch-overs is reduced. The cost per iteration for all update schemes was about ≈ 3.05 seconds for ∆t = 0.01 (K = 3000) using the first-order gradient approximation, which is accurate for this time step, and ≈ 0.34 seconds for ∆t = 0.1 (K = 300) using the exact gradient formula [55]. The total operation count (matrix multiplications, propagation steps, gradient evaluations) per iteration for K = 300 is approxi-mately 1/10 of the number of operations for K = 3000. The time per iteration for K = 300 is slightly greater than one tenth of the time per iteration for K = 3000, 3.05/10 = 0.305 < 0.34. The additional cost reflects the fact the exact gradient evaluations are slightly more computationally expensive than the first-order approximation.

C. Dynamic Search-length Adjustment
Another crucial parameter in the gradient-based sequential optimization is the choice of the search length α. Fig. 7 shows that the rate of convergence depends strongly on α even if all other parameters are the same, and we therefore require a way to choose a suitable search length based on a simplified local model of the function to be optimized. As previously explained, a quadratic approximation, sayF (α), to the fidelity change in some direction ∆f k is often sufficient for the purposes of our optimization problems. Such a second-order model F is determined by matching two quantities in addition to the obviousF (0) = 0 = F (0), for which it is natural to choose the derivativeF (0) = F (0) and valueF (α 0 ) = F (α 0 ) at some α 0 . Assuming that F (0) = ∆f T k · ∇ k F(f ) is strictly positive, such as with ∆f k = ∇ k F(f ) the nonzero gradient, all possibleF are equivalent up to scale (and a sign), so that we can just consider an appropriately invariant quantity e.g. ξ = 1 − F (α0) F (0)α0 . Sincẽ F (α) = F (0)(α − ξ α0 α 2 ), for ξ > 0 thenF has its second root at α = α0 ξ , so its maximum at α * := α0 2ξ , while otherwiseF is simply unbounded. If the quadratic approximation is accurate, (2ξ) −1 is then close to the factor by which we must multiply the current search length α 0 in order to maximize the fidelity gain F in the current step. Note that this choice of new search length α * makes sense even in general, since ξ measures the relative error in the linear expansion F (0)α of F (α) at α 0 and shrinks or grows α 0 according to whether and how much this error is above or below one half. We are then aiming for the largest α for which the gradient based model is still relevant, so again the largest reliable fidelity gain, although in the general case, when ξ −1 is large or negative, it is safer to let α * be some fixed multiple, say 2, of α 0 . Fig. 8 though, with F (α) andF (α) computed for a randomly chosen time step for problem 1, shows that the quadratic model to be an excellent approximation here, as the theory would lead us to expect.
These considerations suggest that the locally optimal update strategy is to set the search length α to α * in each step. However, this strategy has one problem: at the local maximum α * the derivative of the second order approximationF (α) vanishes, and thus the derivative of F (α) will be approximately zero as well. By continuity of the gradient this tends to ensure that the gradient at the next time step t k+1 will be small, especially for small ∆t. Therefore, the most greedy local update strategy may not be the best in the long run as it rapidly kills off the gradients, thus reducing future gains and decreasing the rate of convergence in the longer run. This effect will be most pronounced for the back-and-forth update, the most greedy update strategy, but it is still significant for forward-only and even split update. This suggests an alternative strategy for choosing α. Instead of choosing α = α * , we may wish to choose α to be slightly larger or smaller than α * . Fig. 8 shows that this will not reduce the local gain very much but has a significant impact on the gradients. Indeed, Fig. 9 shows that the median fidelity averaged over 100 runs using the split-update strategy with the most greedy choice of search length α = α * is slower in the long run than two variants of search length adjustment motivated by the above considerations. Variant 3 in the figure is based on a deliberate overshooting strategy, choosing α = 1.25α * in each time step. Variant 1 is aimed at slowly changing the search length until it falls within a desirable range [r 1 , r 2 ]α * around the maximum, and trying to keep it in this range. If the current search length α < r 1 α * then we increase α by a fixed factor α 1 , if α > r 2 α * then we decrease α by a factor α 2 . This ensures that whenever α falls outside the desirable range, it is increased or decreased in each step until it falls back in the desirable range. We chose r 1 = 2 3 and r 2 = 4 3 and α 1 = 0.99, α 2 = 1.01. The rvalues are motivated by the quadratic model, while there is no simple rule for choosing α 1 and α 2 . Our choice of factors very close to 1 may appear strange and does reduce gains during the first few-hundred time steps if the initial choice of α is far from the optimum value, but even such a small factor allows α to increase 20-fold in a single iteration with K = 300 steps (1.01 300 ≈ 20), and gradual changes can facilitate convergence of α to a near optimal value, especially for gate optimization problems. Fig. 9 suggests that this strategy is clearly better than the most greedy one, though slightly worse than the systematic overshoot strategy when considering only the median fidelity over 100 runs. Systematic overshoot can result in large changes in the search lengths especially initially, leading to instabilities and occasional failure, however. This is evident in Fig. 10, which shows the evolution of the search lengths as a function of the total number of time steps executed for three different search length strategies. In all three cases the search length α quickly approaches a limiting value. The limiting values of α for variant 1 and the systematic overshoot strategy (variant 3) above are very close, and about 30% larger than the limiting value for the most greedy strategy (vari- Another reason why avoidance of large changes from one time step to the next is desirable is computational efficiency. Usually, in optimization problems one would apply any change in the search length immediately, but this is computationally expensive as it requires the computation of a matrix exponential exp(−i∆tH[f (t k )]) for the new field f (t k ), and the gain of increasing the step length for any given time interval is usually small, especially when the search length is close to its optimum. To avoid such overheads one may choose not to apply the search length change at the current step, but only at the next time step. This is not too unreasonable as continuity considerations suggest that the optimal search length should not vary too much from one time step to the next, and it ensures that the computational overhead of the dynamic search length adjustment is negligible and the computational cost per iteration is constant as it would be for a fixed search length. For sequential optimization over many time slices avoiding the computational overhead of multiple fidelity re-evaluations at each time step is usually preferable over the small gain achieved by applying the search length change to the current time interval, and for unitary gate optimization problems in particular, the search length usually quickly approaches an optimal value and varies relatively little after this, as shown in Fig. 10.

D. Higher-Order Methods
In the previous two sections we have considered changing the fields locally in direction ∆f k of the gradient ∇ k F(f ) using a quadratic model for the fidelity change F along that line as a function of the search length α. In general though, there is nothing special about the gradient direction, unless the local Hessian is a scalar multiple of the identity. Thus we should do better if we replace the simple gradient update by a Newton update step: using the full matrix of second order partial derivatives H k (Hessian), provided that H (k) is strictly negative definite, given that we are maximizing. If the Hessian H (k) is indefinite or positive definite, the Newton step should never be used since it would take us to some irrelevant saddle point, or worse the minimum, rather than the desired maximum, of the quadratic model. In these cases, the quadratic model has no unconstrained maximum, and as it is anyway only accurate in a neighborhood of the original f k , a trust-region method (see Appendix A), which restricts attention to suitably small changes in f k , should be used. For sequential update the Hessian matrix for the kth time interval is H ∂f mk ∂f nk , which is an M × M matrix, M being the number of controls. We can easily derive an analytic expression for it from Eq. (19); taking F = 1 N Tr(W † U f (T )), for example, we obtain where J (k) mn is the double integral Assuming that we have already computed the exact gradient, then the eigendecomposition of iH[f (t k )] and v r |H m |v s are known for all H m , thus we can in principle evaluate the exact Hessian without too much additional computational effort, but the computational overhead of evaluating the Hessian and inverting it still needs to be carefully weighed against the potential gains.
For small ∆t we can approximate the double integral where τ k = t k − 1 2 ∆t = t k−1 + 1 2 ∆t and {H m , H n } = H m H n + H n H m is the anti-commutator, which yields thus cycling products under the trace, we obtain So if the control Hamiltonians H m are orthonormal with respect to the usual Hilbert-Schmidt inner product, i.e., Tr(H m H n ) = δ mn then we expect the Hessian H (k) to approach a multiple of the identity − 1 N I, at least for sufficiently small ∆t and fidelity sufficiently close to its maximum of 1. In this limit the Newton update reduces to the greedy gradient update with the search length α * based on a quadratic model of F (α) discussed in the previous section, since the gradient and Newton directions then coincide. Thus, if the local Hessian is close to a scalar matrix then evaluation and inversion of the Hessian simply adds extra computational overhead over the greedy gradient update. Furthermore, considering that the greedy gradient update is suboptimal globally, the Newton method may actually achieve worse convergence per iteration in the long run than the modified gradient update with overshoot, although the Newton method could be adapted to incorporate a scaling factor γ to achieve a similar deliberate over-or undershoot effect. These observations are confirmed by Fig. 9, showing that the sequential Newton update does not perform well in the long run for problem 1, which clearly satisfies Tr(H m H n ) = δ mn . Close inspection shows that the Newton update outperforms the gradient update for the first few iterations, consistent with Fig. 11, which shows the relative error of the scalar Hessian approximation H (k) − βI / H (k) where β here is the average of the diagonal elements of H (k) , as the function of the total number of time steps executed. As expected, the scalar matrix approximation is a very poor fit initially but after approximately 3,000 time steps (10 iterations with K = 300 time steps) the error of the scalar approximation is approximately 5%. Despite the fact that our time steps ∆t = 0.1 are not small (max |ω nm |∆t ≤ 1), and (48) is therefore not a very good approximation and the Hessian in the limit is not exactly diagonal, after a few iterations the diagonal elements are still sufficiently small to be negligible. This illustrates an important difference between approximating the gradient and Hessian -for both of these, it is the relative error which determines how accurate the step (44) will be, but contrary to the gradient, the Hessian does not tend to vanish at high fidelities, so that cruder approximations suffice to usefully estimate it.
The condition Tr(H m H n ) ∝ δ mn , implying H (k) → βI for gate optimization problems, is clearly satisfied for problems involving qubits or spin-1 2 particles with multiple independent local controls such as X (n) , Y (n) or Z (n) , and can always be made to hold by an application of the Gram-Schmidt orthonormalisation process to the control matrices. This argument does not apply however, to pure-state transfer, tracking or observable optimization problems, for which the Hessian in the limit can be arbitrary. For instance, if we consider the simplest case, F 1 (f ) = Φ|Ψ f (T ) , then the local Hessian for the kth time slice, assuming ∆t not too large, is This expression will vanish for all τ k and regardless of the initial and target state if H m and H n anti-commute, but in general it need not vanish even at the global maximum, and {H m , H n } = 0 for all m, n > 0 is a much stronger condition than orthogonality, which is generally not even satisfied for spin systems with independent local controls on different qubits because e.g., This suggests a dynamic choice of update rule depending on the type of problem considered. For gate optimization problems it may be useful to do a few trust-region update steps initially, possibly switching to a simplified Newton update as the Hessian approaches a diagonal matrix, before finally switching to a gradient update with a search length of 1 N , or determined as described in the previous section. For other optimization problems such as state transfer or observable optimization, trust-region update is likely to be advantageous, although the added computational cost of evaluating the Hessian must be taken into account. This cost can be amortized, however, by exploiting the similarity between H (k) at adjacent ks for a given field along a sweep, and the fact that each H (k) individually converge as the field converges. Fig. 9 also suggests that sequential update algorithms initially lead to much larger gains in the fidelity than the most common concurrent update strategy based on a quasi-Newton algorithm with BFGS Hessian update [50]. However, the initial advantage of the sequential update diminshes as the optimization proceeds and the concurrent update overtakes the sequential varieties at high fidelities. The issue of comparison of sequential and concurrent update algorithms is explored in detail in [51], which confirms this observation for a range of gate optimization problems, and suggests the development of hybrid strategies.

V. MONOTONICITY AND RATE OF CONVERGENCE
If we allow the step size s to be arbitrarily small, or in the extreme case let it vanish so that we are considering the instantaneous or continuous method, we have seen that analysis of the sequential scheme reduces to that of δF(f ) δfm(p) , because for all the fidelity functions we are considering, and even more generally, the fidelity varies in a linear fashion with respect to such local changes in the fields. Once we move to a fixed step size however, an accurate approximation to F(f + αb s I m ) can require arbitrarily high order derivatives of F, and the most we can say in general regarding the s → 0 regime is that the number of derivatives required to achieve a given accuracy for α scales as 1/s. This is a weak result, however, which does not reveal much more than we had already established about different choices of α. To get stronger results we need to distinguish at least two cases. The first is the unitary gate problem of F 7 or F 7b and the second the pure state problem of F 1 , which using the adjoint representation encompasses F 2 and therefore also F 5 and F 1b as special cases of this latter. Contrary to the vanishing s situation the analysis for finite step size s is quite context sensitive and it is convenient to describe our results for the single control case before generalizing to several controls.

A. Quadratic Structure
Let us consider, at a given value of the single field f , how the fidelity F| α := F(f + αb s,p k ) varies as the f k component of the field, corresponding to the basis function b s,p k , is changed by an amount α. For all fidelity functions, integrating up the lower bound on their second derivative gives the quadratic lower bound for q equal to some global constant q b . This immediately means any α between 0 and − 1 ∂f k must lead to an increase in F, so that we have already identified a whole class of schemes monotonically increasing in F. In the unitary cases, what is interesting is that we can also find an upper bound of this form, with q = q a , and such that q a → q b as E, s → 0. The actual fidelity along this local change in the field F| α is therefore increasingly constrained as we approach the asymptotic instantaneous regime, so that an increase in F can only happen for α between 0 and ∂F(f ) Over this α interval of interest, the difference between the bounds, therefore also the error incurred by the second order Taylor expansion of F| α about 0, is E O( √ E + s). More generally, in the unitary multiple control cases, we have that the local Hessian entries This offers a classification of those α leading to monotonically increasing algorithms which coincides with the one of the previous section for fixed α and arbitrarily small s, but is also applicable to the practically relevant context with both α and s fixed. It is also interesting to note that as F| α is linear for arbitrarily small s, it is natural to add to it a purely quadratic cost term C to ensure J = F + C has a unique global maximum with respect to changes in f by b s -but for s fixed, such an alteration of the objective is inappropriate as it interferes with the already quadratic nature of F| α . In practice we also find that its second order Taylor expansion is a very good approximation to F| α even when s and especially E are quite some distance away from the limiting value of 0, and it therefore does not seem worth considering higher order expansions. Going beyond second order would also be quite problematic for more than one control as we would not have a reliable way of finding even local maxima based on this information.

B. Asymptotic Rate of Convergence
One guiding principle of numerical optimization is to be greedy: we wish to update the variables whenever enough information can be obtained to do so intelligently and aim to induce the largest possible increase in the objective. This would point towards back-and-forth sweeping, going to the maximum in f k on each step, as the most efficient strategy as forward-only sweeping wastes the opportunity to increase the fidelity when propagating the backwards ODE. In the single field and unitary cases of the previous subsection the second order Taylor expansion generally provides a good estimate for the location of the maximum of F| α closest to 0, and using this α to update f k gives a canonical optimization algorithm from the set of all possible strategies leading to a monotonic increase in the fidelity.
Unfortunately, as we have seen, this is not the best strategy as it is susceptible to rather extreme slowdown in the long run. The problem is that the fidelity is effectively quadratic and therefore going for the nearest maximum of fidelity is equivalent to making the local gradient as close to zero as possible. Moreover, as the Hessian converges to a fixed value in the limit the increase in fidelity achievable in this way is proportional to the gradient norm squared. Since the gradient ∂F(f ) ∂f k is the continuous function δF(f ) δf k (t) integrated against b s,p k , it cannot change much as we step to the next basis function b s,p k+1 centered at an adjacent point p k+1 , as is always the case for back-and-forth sweeping. Therefore taking the largest gains available on the current step, as with the canonical greedy algorithm, precludes large gains being made on subsequent steps. It is not immediately obvious, however, what the effect of this will be overall. To answer this question, we derive bounds on the rate of convergence, in particular the asymptotic rate, as we are already in the regime of small infidelity E throughout. This stateful bound on the asymptotic rate of convergence is compared to the static bound in Fig. 12 for an illustrative choice of the step size s and every valid search length α. The combined bound reproduces the bimodal profile of asymptotic rate vs search length observed in Fig. 7 (right) -the rate must vanish towards the ends of the interval of search lengths making fidelity increase, but it must also be small for greedy search lengths in the middle of the interval.
In the remaining pure state, density matrix and observable cases, the situation is less decisive; in particular, the local Hessian need not converge to any predetermined value as E and s vanish. Naturally the fidelity with respect to local change in some f k can only be accurately approximated up to the nearest local maximum by its second order Taylor expansion when the Hessian is not too small as otherwise this quadratic model does not even have a clear maximum. In the asymptotic instantaneous regime of small E and s, however, we do have that when the Hessian is small, the local gradient must be too, implying that substantial increases in fidelity require large changes be made to f k . Our lack of certainty in the asymptotic local Hessian values precludes rigorously extending the clean picture from Fig. 12 to these cases, but the intuition behind it is equivalent. We must choose at each step between maximizing the immediate fidelity gain and restricting future gains by having a small gradient, and possibly introducing undesirably large peaks in the field by making large changes to the field amplitude. The rate model is obtained by a two parameter least-squares fit of (53) to the rate data, which itself comes from forward differences of the infidelity logarithm. The infidelity model is based on the same r * and r0, combined with a fitted value for the additional scale factor involved. The plotted run is the one with median sum of squares error in the infidelity model across a set of 100 runs.

C. General Rate Behavior
In the asymptotic E → 0 regime the behavior of discretized sequential optimization methods is determined by their linearization as discrete dynamical systems, independently of the specific objective or fidelity, number of controls, or type of sweep. This viewpoint motivates the following ansatz to describe the (instantaneous) rate of convergence at high fidelities: where n is the iteration number, r * the asymptotic rate and (r * + r 0 )/2 the initial n = 0 rate. Upon integration this provides a model for log(E) up to a scale factor, and hence a model for F with only 3 parameters. While this model is motivated by analysis of the asymptotic regime it appears to approximate both the rate and fidelity remarkably accurately down to the first iteration as illustrated in Fig. 13.

VI. CONCLUSIONS
In the context of quantum control, the method due to Krotov is usually presented in a continuous formulation and only discretized a posteriori for numerical use. However, we have seen how even the fundamental monotonicity property is jeopardized if the discretization scale ∆t is not taken into account when choosing the update rule. It is thus natural to view the discretized method, which is a sequentially updating optimization algorithm, as more fundamental and consider its continuous analogue as an instantaneous limit of this. The fidelity with respect to local changes in the field made in each step of such a sequential algorithm is essentially a quadratic function, which becomes linear in the instantaneous limit. Addition of a penalty in the continuous Krotov method can therefore be seen as a way of making the objective function quadratic, from which a canonical update formula emerges. Such a penalty is unnecessary for the discretized method, however, and in fact undesirable from both a theoretical and practical point of view.
At this stage a large class of monotonic update schemes with different sets of search lengths are available and in the literature the choice is typically left to the user. However, the search length is an important parameter that strongly influences the performance characteristics of the algorithm and guidance in selecting it is thus critical. The instrumental notion in doing so is the asymptotic rate of convergence, which had previously been shown [52] to be qualitatively at least linear, but for which no quantitative estimates were available. At first glance the natural choice of update scheme is the greedy one, maximizing immediate fidelity gains, but the asymptotic rate of convergence analysis shows that we cannot extrapolate from the initial rate of convergence, and proves this to be a poor strategy in the long run. Fortunately, the reason behind this failing also emerges from the analysis and we are able to offer modifications to the greedy search length or back-and-forth sweeping that enhance performance. The analysis explains why forward-only sweeping appears to be the preferred strategy in the literature, and suggests further improvements such as a split sweep that logically extend the advantage of forward-only over back-and-forth sweeping.
In discretizing the continuous method it is also common to use δF δf (t k ) ∆t, where ∂F ∂f k is the derivative in the instantaneous limit. However, this is only an approximation that is liable to break down and corrupt the algorithm, especially towards low infidelities E or larger time steps ∆t. We address this issue by outlining the exact method and various series expansions appropriate for computing these gradients ∂F ∂f k for the most common choice of piecewise constant controls for each of the fidelity functionals under consideration. In selecting an update direction and search length one is naturally lead to use second derivative (Hessian) information, for which an exact formula is also available. In contrast to the situation for concurrent update optimization algorithms, the analysis also shows however that using the full local Hessian generally does not result in a performance improvement compared to a dynamic search length adjustment based on a quadratic model, at least for unitary gate optimization problems.
Looking forward, the general formulation of sequential methods applied to a wide range of control optimization problems that arise in the context of quantum control, as well as simplified convergence results should enable a streamlined application of these methods to optimizing other fidelity functions. The fact that any parametrization of the fields in terms of localized functions can be used for the sequential optimization opens the way for more problem-adapted choices than the standard top-hat function. Finally, the study of the convergence rate initiated herein should enable further development of heuristics to make more efficient choices of search lengths and sweeping patterns, and the development of hybrid methods that take advantage of the rapid improvements attainable by sequential update in the initial phase of the optimization to find suitable candidate fields with moderately high fidelities before switching to alternative strategies such as concurrent update to avoid the convergence slowdown of sequential update methods in the asymptotic regime. Taking together such improvements in the efficiency of the algorithms employed for dynamic control optimization should facilitate the application of the method to a wide range of coherent control problems and more realistic systems with larger Hilbert space dimensions or systems involving many qubits. strictly increasing between x and x = x + 1 β ∂F ∂xi e i and the value of F at x is at least F (x) + 1 2β ∂F ∂xi 2 .
Next notice that updating x n along any coordinate can only alter the derivative ∂F ∂xi by at most β|x n+1 n −x n n | ≤ βγ|∂ n F (x n )|, so over several iterations we have for any non-negative k. Letting k i + 1 be the smallest index within 1, . . . ,K such thatẽ n+ki+1 = e i , therefore But this is exactly the primary descent condition of [54], and by the main result in this paper, the sequence x n either diverges, i.e., x n → ∞, or converges to some point x * . In any case, if F (x n ) remains bounded we have implying ∂ n F (x n ) → 0 and by the earlier bound, dF (x n ) → 0 as n → ∞; in particular when x * exists, dF (x * ) = 0.
The same argument holds for objectives with or without a penalty term. However, if we add a standard weighted norm squared penalty term C, this norm of the controls is guaranteed to be uniformly bounded, so there is no way the controls can fail to converge at all.