Optimal control gradient precision trade-offs: application to fast generation of DeepControl libraries for MRI

We have recently demonstrated supervised deep learning methods for rapid generation of radiofrequency pulses in magnetic resonance imaging (https://doi.org/10.1002/mrm.27740, https://doi.org/10.1002/mrm.28667). Unlike the previous iterative optimization approaches, deep learning methods generate a pulse using a fixed number of floating-point operations - this is important in MRI, where patient-specific pulses preferably must be produced in real time. However, deep learning requires vast training libraries, which must be generated using the traditional methods, e.g. iterative quantum optimal control methods. Those methods are usually variations of gradient descent, and the calculation of the fidelity gradient of the performance metric with respect to the pulse waveform can be the most numerically intensive step. In this communication, we explore various ways in which the calculation of fidelity gradients in quantum optimal control theory may be accelerated. Four optimization avenues are explored: truncated commutator series expansions at zeroth and first order, a novel midpoint truncation scheme at first order, and the exact complex-step method. For the spin systems relevant to MRI, the first-order truncation is found to be sufficiently accurate, but also up to five times faster than the machine precision gradient. This makes the generation of training databases for the machine learning methods considerably more realistic.

One method, called gradient ascent pulse engineering (GRAPE), uses a piecewise-constant pulse approximation that is well suited for MRI 17 .Here, the term gradient is in relation to optimization and not to be confused with the magnetic field gradients.GRAPE was initially designed using the gradient ascent method (linear convergence), and then extended to quasi-Newton (superlinear convergence) methods 18 , and Newton-Raphson (quadratic convergence) methods 19 .The latter two require very accurate gradients [18][19][20][21] .Importantly, tailored RF pulses in MRI often need to be computed in real time -the patient is inside the scanner.However, GRAPE uses iterative optimization that has no wall clock time guarantees: for realistic systems, it can become impractically a) msv@cfin.au.dk; † These two authors contributed equally b) david.goodwin@partner.kit.eduslow.
We have recently proposed a neural network based method for generating RF pulses to alleviate the run time problem [22][23][24] .This DeepControl framework requires training libraries with hundreds of thousands of RF pulses generated, e.g., with optimal control, but a trained neural network predicts a pulse quickly, and with a hard guarantee on the wall clock time.
These vast libraries require an improvement in our ability to run optimal control simulations and calls for optimization of the numerical efficiency of the GRAPE algorithm, in particular, in the part that deals with the gradient calculation.
In this paper, we report some efficiency improvements in the GRAPE gradient calculation process.We analyze the trade-offs between CPU time and convergence rate for four different levels of approximation to the GRAPE gradient: standard zero and first order approximation, a novel midpoint first order approximation proposed here, and the exact, complex-step gradient.

II. OPTIMAL CONTROL THEORY
The equation of motion of a system of uncoupled spin-1/2 particles, ignoring relaxation, can be formulated as: Ṁ(r, t) = Ω(r, t)M(r, t) arXiv:2107.00933v1 [physics.med-ph] 2 Jul 2021 where M(r, t) is an instantaneous magnetization vector at a spatial location r, and Ω(r, t) is the dynamics matrix containing external magnetic fields, defined below.
By the nature of this study, we introduce discrete notation from this point.The spacial dimension is separated into P different locations, with 1 < p < P , and for the p th location r (p) = [x (p) , y (p) , z (p) ] ( is the vector transpose operation).The control of these dynamics is in the form of a piecewise-constant control waveform with a fixed total duration T , divided into N time steps of duration ∆t = T /N , denoted a time slice.The magnetization state we are attempting to control, M Hence, for the state vectors 0 < n < N , or in other words the n th control propagates the system from the (n − 1) th state to the n th state.The task for an optimal control problem is to bring this final magnetization state as close as possible to a desired target state by maximizing the overlap of a target magnetization state, M  1) are governed by Ω, and the piecewise-constant formulation of this is z , and the gyromagnetic ratio, γ.Xu et al. 15 have shown how to extend these equations to multiple RF channels, 1 < l < N Tx , i.e., parallel transmit (pTx) with complex coil sensitivity patterns, s (p) l : Here, c j,l,n ∈ {c u,l,n , c v,l,n } are the RF control sequences to be optimized to achieve the aim, and all other parameters are considered fixed.For generality, we introduce the control c j,l,n , with the subscript j as reference to either u or v, when it is possible to imply to either.Discrete solutions to Eq. ( 1), in the piecewise-constant approximation, can be written as where is a given initial state of the system and U (p) n is a time-propagator describing a rotation on the Bloch sphere, which can be calculated with the matrix exponential: Note, the magnetic field gradient waveform is specified for the entire pulse duration and kept fixed, although it could also be included as a control to be optimized.
Our performance functional, describing our aim, is the projection of M (p) N , onto the target state M (p) target , summed and normalized over all spatial points: Thus, J is the quantity we want to maximize and is termed the fidelity.Further to this fidelity measure, which is dependent on only the final state of the system and termed the terminal cost, an additional term can be included which is termed a running cost and depends on the state of the system over all of the pulse duration.The running cost can be regularization terms, to penalize untargeted control behavior, e.g., excess power or jagged waveforms 19 , or boundary constraints.In this work we use simple boundary constraints, which are detailed in 14 .

III. FIDELITY GRADIENT CALCULATIONS
The formal theory of optimal control introduces an adjoint state of the system, L (p) n , with the detailed derivation published elsewhere 25 .This adjoint state of the system can be interpreted as the propagation of the target, backwards in time from t = t N to t = t 0 , and for a particular time slice this is where target .The adjoint state is used in calculation of the fidelity gradients, ∇J u,l,n and ∇J v,l,n , with respect to c u,l,n and c v,l,n , enabling the use of gradientfollowing numerical optimization methods to maximize J.
The scope of this study is to cast light on a number of different strategies for calculating ∇J u,l,n and ∇J v,l,n .Common prerequisites for the different gradient strategies are a forward time-propagation of the magnetization from the initial to the final state for all positions by Eq. (5), and a backward time-propagation of the adjoint state for all positions by Eq. (8).

A. Standard approximate gradients
Derivatives of functions of matrices, f (Ω), are called directional derivatives, or Gâteaux derivatives, which are defined in Ref. 26 as where Θ is the operator matrix for the control in question and elaborated below.Eq. ( 9) is a similar form to the finite difference equation and the Taylor series truncated to first order and should be considered the formal definition of the derivative of the function of a matrix.
For the dynamic system of Eq. ( 6) we are considering here, the directional derivative here can be written as 27 where {A, B} denoted the anti-commutator of the matrices A and B. The first order approximation, r = 1, corresponds to equation ( 12) in the original GRAPE paper 17 .For the controls c u,l,n and c v,l,n 15 , the corresponding control operators are In addition, if we optimized the magnetic field gradient waveform, e.g.G g,n with g = x, y, z, its control operator is In our vectorized model, Eqs. ( 2) to (8), we consider the zeroth, r = 0, and first order, r = 1, gradient approximations of Eq. ( 11): where the big-O-notation qualifies the error, which depends mainly on the size of the time step 18 .
Had we included regularization in Eq. ( 7) this would also be present in the gradient terms above 19 .
The zeroth and first-order gradient approximations are widely used and fast to compute 28 , but for a quasi-Newton method they would quickly corrupt the Hessian estimation and slows the super-linear convergence to linear, especially for long time increments 18 .De Fouquieres et al. 18 show how gradients computed to increasing precision improve the performance of quasi-Newton methods.

B. Midpoint approximate gradients
In this paper, we propose a midpoint variant of the standard approximate gradient to partly overcome the convergence problem of the standard approximate gradients.The midpoint variant was inspired by the central finite-difference gradient, although it has a deep theoretical basis in the exponential midpoint method 27,[29][30][31][32] The midpoint method for this manuscript is which is an average of the (n − 1) th and n th gradient elements; centralizing the derivative to the midpoint of the (n − 1) th and n th time point.The propagator at t 0 is the identity matrix so, when expanded, 0 .Although this may seem a simple extension to Eq. ( 16) without obvious benefit, since both have an error O(∆t 2 ), a deeper read to the literature shows the O(∆t 2 ) error term in Eq. ( 16) has a constant hidden in the O-notation.The hidden constant depends on the time derivatives of M and the bounds of the Ω [30][31][32] .The midpoint method in Eq. ( 17) does not suffer from this error dependency 29 and can be important when space discretization is used 27 .Furthermore, Eq. ( 15) and Eq. ( 16) are good approximations only when solutions are mildly oscillating 33 , which is not expected to be the case for MRI.Although not included in the analysis that follows, the inclusion of gradient control will accentuate this last point, as the gradient controls introduce another highly oscillating part to the Bloch equations -particularly during the start of the optimization.It should be emphasized that the exponential midpoint method 27,29-32 is designed for approximation of time propagators, which is not the case in this study; time propagators in this study are exact, and the derivatives of time propagators, D (p) j,l,n defined by Eq. ( 11), are approximated using a midpoint method.

C. Exact Gradient
Floether et al. 34 presented a method to compute exact control gradients using an auxiliary matrix approach 35 , and Goodwin et al. 20 extended this to the second order derivatives needed for a Hessian.We recently adopted the exact gradient calculation in our optimal control framework, and described it in Ref. 14 .Briefly, The gradient in Eq. ( 18) is exact to machine precision 20 .The essence of this exact approach is the derivation of D (p) j,l,n in Eq. ( 18).This can be obtained from the upper right corner of an auxiliary matrix 20,34,35 but for real matrices Ω (p) n and Θ (p) j,l , the compact method of complex-step approximation 36 can be used to calculate the directional-derivative, D (p) j,l,n .This method gives the propagator and derivative as the real and imaginary parts of a complex matrix, respectively: where n and Θ (p) j,l must be real matrices, which can be the case in the single-spin model.
Considering any one given spatial point, p, the complex-step differentiation involves 3 × 3 complex matrices instead of 6 × 6 real matrices.With that, there is potential to spare memory allocation.Complex-step differentiation is also known to have better roundoff error tolerance in finite precision arithmetic 36 .

D. Accuracy tests
The relative accuracy of gradients calculated with the methods in Eq.( 15), Eq.( 16), and Eq.( 17), is shown in Figure 1.The error in the accuracy is calculated with and averaged over 20 instances of random pulses.Each pulse has a duration of T = 6.39 ms, with the random distribution between ±1 kHz for RF controls, c u and c v , and ±160 kHz for magnetic field gradient controls, c G .
The tests were done firstly (top, Fig. 1) with time slice, ∆t, ranging logarithmically in 20 steps from 63.9 ns to 0.639 ms for an offset of B (p) z = 0 Hz.Secondly (bottom, Fig. 1), the offset was picked randomly in a bandwidth of 160 kHz for a fixed ∆t = 10 µs.The number of offset points is approximately one point per 2.1 kHz of bandwidth, and they are spread randomly within the given bandwidth.The initial state of the system is all spins having positive z-magnetization, and the target state is Relative error Relative accuracy of approximate gradients compared to the exact gradient in Eq.( 18).Both plots show the relative accuracy of the standard zeroth order (STD-0), Eq.( 15), first order (STD-1), Eq.( 16), and first order midpoint (MID), Eq.( 17 all spins having positive x-magnetization i.e. a 90 • rotation around the y-axis of the Bloch sphere.The accuracy comparison in Figure 1 shows that the MID gradient gives an increase in accuracy, even though the method of calculation in Eq.( 17) requires only the insignificant computational cost of a matrix addition per gradient element.The dashed vertical line in the top panel of Figure 1 indicates the time slice, ∆t = 10 µs, used in the results that follow.This time slice is chosen as it is physically relevant to the MRI hardware, but also from approximately this time slice the MID gradient becomes increasingly more effective than the STD-0 and STD-1 gradients for decreasing time slices.The pulse duration, bandwidths, and control amplitudes were based on the settings described in the next section.

IV. METHODS
An optimal control theory framework was implemented in MATLAB (Mathworks, Natick, MA, USA), with hard constraints on the RF pulse amplitudes using the fmincon optimization function supplemented with a performance gradient function and the necessary housekeep-ing procedures.We focus here on the gradients required by the quasi-Newton option of fmincon, although we are not limited to GRAPE.
A common feature for these calculations is that each direct or adjoint state must be known at a given time slice, n, and/or at the adjacent time slice, n − 1. Conducting full magnetization forward propagation, and adjoint state backward propagation solves this issue.In principle, however, the magnetization state can be calculated as gradient element calculations proceed, i.e., U (p) n can also be extracted from Eq. ( 19), in the same way as propagator recycling implemented for Hessian calculations 38 , possibly with efficient matrix caching 19 .
The most efficient code for parallelizing the propagation and gradient elements is specific to computing architecture.The two most obvious parameters considered are the amount of CPU cores available, and the storage size needed for all matrices from forward and backward propagation.Our computations were performed on a 28-CPU core, Intel Xeon Gold 5120, 2.2 GHz workstation with 384 GB of RAM.
The EXACT gradient of Eq. ( 19) was assessed regarding parallelization efficiency.An L-curve analysis is shown in Figure S1 in the Supporting Information.The EXACT gradient was on the present workstation most efficient with parallelization on around 23 out of the 28 available workers.Accordingly, the reported computation times for the EXACT gradient are with 23-fold parallelization.The STD-0, STD-1 and MID gradients are efficient without parallelization, and we report computation times for the fastest (non-parallelized) implementations we could develop.
We evaluated optimization performance with the four different gradients by computing single-channel 2DRF pulses with a library of total 150 different flip-angle (FA) maps of three different target categories: 1. 50 binary images, where ten randomly placed points were dilated and merged to a single binary shape with no holes.These are denoted BW (blackwhite), see Figure S2.
Images were taken from the ImageNet database 39 , and the nominal target FA was 30 • .The target magnetization x-and z-component levels were dictated by the image intensity levels multiplied with the nominal FA.The target y-component was 0. The initial magnetization was longitudinal.The Gr category represents our ability to induce spatially variant FAs, while the binary categories are two random ad hoc ways to obtain shaped excitation patterns, mimicking anatomic regions of interest.
The gradient waveform accommodating the spatial selectivity of the 2DRF pulses formed an inward 16-turn spiral in the excitation k-space (full Nyquist sampling) with a duration of 6.39 ms, a field of excitation of 25 cm, and limited to 40 mT/m amplitude and 180 T/m/s slew rate 11 .The common time slice for RF and field gradients was 10 µs.RF pulses were constrained below 1 kHz (nutation rate) amplitude using the boundary option implemented in fmincon.The spatial grid was 64×64 with a mask formed by a centrally placed super ellipse, which reduced the number of spatial points to 2919.The quasi-Newton method with STD-0, STD-1, MID and EXACT gradients was limited with the algorithm termination conditions of reaching 100 iterations or a minimum functional/step-size convergence tolerance of 1 • 10 −6 .

V. RESULTS
For all three categories (BW, GrBW and Gr), all optimizations with approximate gradients converged.The optimizations with EXACT gradients were stopped in 32% of all cases by the iteration limit.Table I lists iteration and computation time statistics (mean and standard deviation) for the three target categories (BW, GrBW and Gr) pooled together.Of interest (see the diagonal entries), the MID and STD-1 gradients spend on average roughly the same amount of iterations and time to finish, around 14 iterations in 15 seconds, respectively.The EXACT method spends on average 74 iterations and nearly 300 seconds to finish.
In comparison (see the off diagonal entries), the MID gradient is not outperformed by the EXACT gradient until it reaches 8 iterations or 33 seconds, which is equivalent to around 9 seconds for the MID gradient.However, as shown in Figure 2 the MID and EXACT gradients have near identical performance.This is supported FA maps displayed in Figure 3.
Judged from the NRMSE of FA maps (MID against EXACT), we compared the best, closest-to-mean, and worst cases for converged pulses.As reference, we also show the results from the corresponding STD-0 and STD-1 gradients, as well as the difference maps, highlighting the increasing errors using less precise gradients.
Figure 4 shows the NRMSE histograms of all pulses, where the right skew reveals that the mean NRMSE is close to the best cases, and the worst cases are rare.The case numbers are sorted based on the NRMSE values from the optimization, i.e., the error of the Bloch simulated magnetization with respect to the target magnetization, based on the EXACT gradient results.For the Gr target category, the entire pattern is considered as inside the ROI, thus there are no plots for outside areas.In the same category, the expected mean and standard deviation FA inside the ROI is not expected or supposed to be 30 • and 0 • , respectively, but may vary according to the gray levels.

VI. DISCUSSION
We have presented a midpoint first order gradient approximation, denoted MID, with a similar computation time to a standard first order gradient approximation computation, and the accuracy similar to the exact, complex-step gradient for typical MRI settings.
The average time per iteration for the present settings increase by a factor of around 1.01 going from the standard first order to the midpoint gradient method, however by a factor of five going from the standard first order to the exact, complex-step gradient bearing in mind the exact, complex-step gradient exploit 23-fold paralleliza-tion.With four CPU cores, typically available on a laptop, we estimate the complex-step gradient will further require a factor of two of computation time.
The standard and the midpoint gradients are efficient without parallelization, and easily vectorized.Several designs for the approximate gradients have been tested for the RF pulse design.However, we could not find any implementation, where either gradient becomes faster from parallelization for the present application: the bottleneck of transferring data to / from the workers dominates the workload.For single-channel RF optimization and the spatial grid size (P ) and number of time-steps (N ) as reported above, the standard gradients are slightly faster Table I.Iterations and computation times (mean ± standard deviation).Diagonal entries with two mean ± standard deviation pairs correspond to total use of a given method, i.e., computation times include gradient and approximate Hessian estimation, step size search, hard-constraints, and all other internal routines of the interior-point algorithm used within MATLAB's fmincon function.The upper triangular entries include three mean ± standard deviation pairs (from top to bottom): 1) the number of iterations it takes for the method in the column to outperform the method in the row considering the NRMSE value; 2) the approximate time it takes the column method to overtake the row method; 3) the approximate time the row method was overtaken by the column method.The approximate times are estimated by the ratio of iterations needed for crossing and the total number of iterations multiplied by the total computations times for all iterations.Target categories (BW, GrBW and Gr) are pooled for the measures in this being run in a for-loop over time steps, rather than as a more vectorized version.There was no significant difference between the for-loop and vectorized versions of the midpoint gradient.As a side-note, when running multichannel RF optimization (not the case herein), both the standard and midpoint gradients benefit from the vectorized version and the time difference between them diminishes.However, there remains a for-loop over the number of RF channels currently.Accordingly, approximate gradient computation times were reported herein for the fastest, non-parallelized implementations we have found.
Optimization speed is inherently important for in vivo experiments, but also for generating AI training libraries.
Considering the vast training libraries required for DeepControl 22,23 , the non-parallelized, vectorized computation of the standard and midpoint approximate gradients is not a problem, when we exploit all available CPU cores with an outer parallelization over many independent pulses.
It is beyond the scope of this work to investigate how the qualities of the different gradients influence the Deep-Control framework, but we do expect infidelities of a given library to propagate through to the final trained neural network, and that predicted pulses never perform better than the representative training library.This study leaves the question of what is more important, fast computation time or pulse perfection, as an additional option for the user.We also propose the midpoint gradient method for rapidly producing a good initial guess that may be handed over to a slower, exact gradient method for finalizing.
While it is possible to further improve the accuracy of the standard or midpoint approximate gradient by lowering the time slice, this may not be feasible on realistic hardware.The accuracy of the midpoint gradient corresponds roughly to that of the standard 0 th order gradient, STD-0, with a halved time slice.
This was found by running the same experiments (data not shown) with the STD-0 for a time slice of 5 µs instead of 10 µs.
The choice for target categories reflect our previous [22][23][24] and future DeepControl experiments, which will be described in a subsequent publication.We acknowledge that routines operating in the so-called small-flip-angle regime pose a robust alternative to optimal control in terms of speed and accuracy due to the linear Fourier relation existing between the pulse waveform and the excitation pattern.As shown in Ref. 22 , we trained networks for both small and large flip angles, with Fourier and optimal control based algorithms generating the libraries.However, we note that the optimal control framework we use enables any arbitrary FA (and flip phase) for each individual spatial and spectral position.It is due to our DeepControl experiments that we chose in this study to target a nominal FA of 30 • (fair to say in the small-flip-angle regime) together with optimal control, which was not strictly necessary for this FA, but chosen for several other reasons, one being because our DeepControl experiments benefit from hard pulse constraints, which we have so far only implemented in our optimal control framework.Yet, hard constraints do exist in various other pulse designs 10,40 .It is however important to mention that the individual gradient computation times do not change for other, say, large flip angles.

VII. CONCLUSION
Pulse waveform optimization gradient calculation using a midpoint first order approximation was evaluated and found to provide a significant efficiency improvement.This illustrates clearly that the current trend of using exact gradients in all situations should be revised -tailored approximations are much cheaper and can be just as accurate.Approximate gradients have been observed to make quasi-Newton pseudo-Hessian less efficient at accelerating convergence, but we have not encountered this phenomenon here, likely because the terminal fidelity requirements were less stringent than they are in quantum technology applications.
The midpoint gradient method has been tested herein for 2DRF pulses, with the intent to create a large neural network training library, but it is also applicable to other pulse types, e.g.multidimensional parallel transmit pulses of any target flip angle, and other control types, e.g.field gradients and novel multi-channel shim waveforms 41 .These applications have a large number of channels, and would benefit significantly from the fast gradient computation and the relatively high accuracy of the midpoint approximate gradient method outlined in this paper.plementary material.
Our code repository at https://github.com/madssakrewill contain the proposed gradient implementations including capabilities for parallel transmit.
] , will have N + 1 temporal instances for each spatial position, with an initial state M which includes the RF fields in the Zeeman rotating frame, B (p) x,n and B (p) y,n , the magnetic field gradient, G n = [G x,n , G y,n , G z,n ] , the magnetic field inhomogeneity, B (p)

D Θ (e Ω∆t ) = ∆t 0 ee
Ωs Θe −Ωs ds e Ω∆t Ωs ds e Ω∆t = Θ, Figure1.Relative accuracy of approximate gradients compared to the exact gradient in Eq.(18).Both plots show the relative accuracy of the standard zeroth order (STD-0), Eq.(15), first order (STD-1), Eq.(16), and first order midpoint (MID), Eq.(17), gradients.(Top) Errors as a function of time slice, ∆t, with the dashed vertical line being the ∆t used in the study.(Bottom) Errors as a function of offset for ∆t = 10 µs, corresponding to the dashed vertical line in the top plot.Both plots show the error when field gradient controls (cG) are used together with RF controls (cu, cv) with dotted lines, and when it is just RF controls (as in this study) with full lines.

Figure 2
Figure 2 shows the NRMSE and FA statistics from all Bloch simulations.The cases are sorted by the NRMSEs of the EXACT gradient) within each target category.All FA maps are shown in Figure S2-4 in the Supporting Information.

Figure 2 .
Figure2.Plots of NRMSE and FA statistics for the different target categories: BW (left column), GrBW (center column), Gr (right column).The case numbers are sorted based on the NRMSE values from the optimization, i.e., the error of the Bloch simulated magnetization with respect to the target magnetization, based on the EXACT gradient results.For the Gr target category, the entire pattern is considered as inside the ROI, thus there are no plots for outside areas.In the same category, the expected mean and standard deviation FA inside the ROI is not expected or supposed to be 30 • and 0 • , respectively, but may vary according to the gray levels.

Figure 3 .
Figure 3. FA maps of the best (left), closest-to-mean (middle) and worst (right) cases, when contrasting the NRMSE of the FA maps of the MID gradient with respect to the EXACT gradient.The corresponding STD-0 and STD-1 results are shown as well.Differences are shown in the lower part.Only pulses from converged solutions were considered for this Figure.

Figure 4 .
Figure 4. NRMSE histograms contrasting the MID gradient against the EXACT gradient, for the BW (top), GrBW (middle) and Gr (bottom) target categories.The best, closest-tomean and worst case examples shown in Figure 3 are herein signified by the dashed lines.The histograms contain all cases, i.e., also pulses terminated by reaching the maximum number of iterations, why the histogram edges extend below and beyond the best and worst case levels, respectively.

Figure 8 .
Figure 8. Flip-angle (FA) maps in the Gr category.From left to right: Target FA maps (Gray-scale images were normalized and scaled flip-angle wise by the nominal FA of 30 • .The super-ellipsis mask was then applied); FA maps produced by the EXACT, MID, STD-1 and STD-0 gradient methods, respectively.