Parallel time integration using Batched BLAS (Basic Linear Algebra Subprograms) routines

We present an approach for integrating the time evolution of quantum systems. We leverage the computation power of graphics processing units (GPUs) to perform the integration of all time steps in parallel. The performance boost is especially prominent for small to medium-sized quantum systems. The devised algorithm can largely be implemented using the recently-specified batched versions of the BLAS routines, and can therefore be easily ported to a variety of platforms. Our PARAllelized Matrix Exponentiation for Numerical Time evolution (PARAMENT) implementation runs on CUDA-enabled graphics processing units.


Introduction
The last decade has seen the advent of quantum technologies. Significant advances have paved the way for promising applications in computing, sensing or communication. The time evolution of quantum systems is governed by the Schrödinger equation with a time-dependent Hamiltonian Efficiently solving eq. (1) is key to understanding the systems at hand, and to their successful technological application. Substantial effort has been put into simulating large quantum systems (thousands of degrees of freedom), by exploiting specific properties that allow a reduction of the Hilbert space. In addition, significant effort has been put on porting such simulations to GPUs.
Nevertheless, a lot of research focusses on small to mediumsized quantum systems. Although a simulation of such systems can be easily done on modern CPUs, computational expense scales with the number of time steps. As an example, consider an electron spin simulated in the laboratory (non-rotating) frame. Here, GHz control fields are applied for multiple microseconds leading to the need for integrating over tens of thousands of time steps. Often, this problem can be tackled by a suitable approximation (e.g. the rotating frame [1]). However, such an approximation may not always be convenient or even possible: for instance, when studying non-secular effects like the Bloch-Siegert shift, the strong driving regime, complex modulated waveforms, or when performing a validation of the control software of an experiment. Accurate simulation rather than approximations (with sometimes elusive side effects) builds extra confidence in the correctness. Efficiency is key, too: when developing new pulse sequences for quantum control, it is desirable to run simulations at interactive speeds, such that the designer can quickly iterate different parameters. Furthermore, applications in the field of quantum optimal control rely on the fast evaluation of a time-dependent Hamiltonian.
With the recent advent of artificial intelligence research, new powerful tools have emerged. For suitable tasks, a modern GPU offers the computational power of a supercomputer from the mid 2000s, off-the-shelf and at a fraction of the cost. Like supercomputers, GPUs focus on massive parallelization. The implementation of matrix-matrix multiplication in suitable hardware structures, combined with fast memory access, allows for a fast and parallelized tackling of a variety of computational tasks.
We present an approach for solving the time-dependent Schrödinger equation in a form that is frequently encountered in experimental realizations of a variety of physical quantum systems. Our approach leverages the parallelization of GPUs for integrating the steps of the time propagation in parallel and relies on a fast memory connection. We use the recently standardized Batched BLAS routines [2] and a minimal set of custom functions. Therefore, our approach can be ported to a variety of platforms including, e.g., GPUs of other vendors or Field Programmable Gate Arrays (FPGAs). This paper is structured as follows: First, we describe the concept of slice-wise propagation and the parallelization of the calculation. Then, we focus on the two underlying problems to be solved. This is followed by details about the implementation using the Batched BLAS functions. Furthermore, we improve the convergence order by extending the approach to a Magnus integrator. Finally, we showcase the runtime and the convergence using a suitable example of a driven two-level system.
respectively. Here, |ψ denotes the system state of a pure quantum system and ρ the density matrix of a mixed state. One approach to treat arbitrary time-dependent Hamiltonians is to'slice' the Hamiltonian, known also as Euler's method. During the finite duration ∆t of each slice, the Hamiltonian is assumed to be stationary. For sufficiently small equidistant time steps, the sequence with converges to the true solution of the original problem. If the dimension of the Hamiltonian is large, Krylov methods can be employed to simplify the exponentiation. Here, we instead focus on small, dense matrices. The algorithm thus involves two steps: (i) calculating the matrix exponential efficiently for many matrices and (ii) executing the matrix multiplication of the individual slice-propagators. In the following subsections, we will discuss the chosen approaches for each of the two aspects.
We assume that the time-dependent Hamiltonian takes the form where H 0 is typically named drift Hamiltonian and H i and c i (t) denote control Hamiltonians and control coefficient array. The control terms represent the coupling of the quantum system, e.g., time-varying magnetic or electric fields (control fields). The decomposition in eq. 5 does not trade any generality: H i are simply a basis of the H(t) − H 0 space. Numerous experimental situations however conform very well to the form of eq. (5), with a number N of control fields that is much smaller than the total dimension of the Hamiltonian space. The overall structure of the approach is shown in Figure 1: We transfer a minimal amount of data to the GPU, the control arrays and the Hamiltonians. Then we expand the Hamiltonians for each time step in the GPU memory. Subsequently we exponentiate all slice Hamiltonians in parallel. For the final propagator, we reduce the slice propagators by repeated pair-wise matrix multiplications.

Matrix exponential
The numerical computation of the matrix exponential has been treated extensively by Moler and Van Loan in their famous '19 dubious ways' paper [3]. Moler Figure 1: Parallel time integration of the PARAMENT integrator. We upload a minimal amount of data to the GPU and propagate all time steps synchronously. The final propagator is obtained by iterative pair-wise multiplication of the slice propagators. In the case of the Magnus implementation, the expansion step also calculates the coefficients for the commutator terms.
on, e.g., Taylor or Padé approximations, (2) Methods relying on ODE solvers, (3) polynomial methods which are typically not very attractive due to their high computational cost, (4) matrix decomposition methods, (5) splitting methods like the powerful scaling-and-squaring technique and (6) Krylov methods, which are interesting for large matrices. On CPUs, matrix decomposition methods are particularly powerful due to well-established implementations like the Schur decomposition in LAPACK using the ZGEES or CGEES function [4]. This approach is especially competitive for Hermitian matrices like Hamiltonians, where the decomposed matrix is always diagonal. It is used e.g. in the MATLAB software package. In general, for non-Hermitian matrices, most modern implementations (e.g. in Python and MATLAB) combine the scaling and squaring technique with a series method, either Taylor or Padé. Typically the matrix is first scaled down until its norm is sufficiently small so that its exponential can be well approximated by a (reasonably) truncated Taylor or Padé approximation. The exponential of the original matrix can be recovered by raising the small-norm exponential to the correct power. For our implementation of the highly-parallelized calculation of the matrix exponential we choose a different series approach based on the Chebyshev polynomials. The expansion of the matrix exponential in a Chebyshev series in the context of time propagation of quantum systems goes back to Tal-Ezer and Kosloff [5]. The expansion of the complex exponential in a Chebyshev series can be obtained by starting with the expansion of e iωx on the unit interval x ∈ [−1, 1] where the coefficients are readily obtained from Abramovic Stegun (9.1.21) and T k (x) and J k (ω) denote the Chebyshev polynomials and the Bessel functions of the First Kind respectively. An approximation of e −iζ for the interval ζ ∈ [α, β] can be obtained by a straightforward affine-linear transformation of x. The scalar expansion (6) can be extended to matrix argu-ments and reads in our case for the slice propagators [6] e −iG ≈ e −i(α+β)/2 Here, G = H∆t is Hermitian and we suppose that the spectrum of this matrix lies between α = λ min (H∆t) and β = λ max (H∆t). For the transformed expansion, the coefficients a k read For the numerical evaluation of the matrix exponential, we will truncate the series in (7) at m max . Lubich [6] estimated the error for approximating the matrix exponential by using a Chebyshev series truncated after the m'th term to We use (9) to determine m max under the requirement that is smaller than the machine precision of the respective datatype. This gives us constrains for the maximum norm that the argument can take, c.f. Table 1.
The series in (7) can be efficiently computed by the Clenshaw algorithm [7].
The number of necessary steps (and thus the number of matrix multiplications) grows at least linearly with the matrix norm. For our application, this is not a major problem as the matrix norm per time slice ||H∆t|| is expected to be small if the time step is sufficiently short. In a general context, Chebyshev-series-based matrix exponentiation has been successfully combined with the powerful scaling-and-squaring approach [8]. This approach brings down the number of necessary matrix multiplications to a growth that is only logarithmic in the matrix norm. It is worth noting that the Chebyshev approach is also very appealing for problems with sparse Hamiltonians, as it relies only on matrix products of the form sparse × dense. This circumstance could be further leveraged for bigger Hamiltonians. Nonetheless, our approach requires a priori knowledge about the expected spectrum of the Hamiltonian. As the boundaries α and β of the Eigenvalue range are often not available, we estimate them by using a suitable matrix norm (Gershgorin's theorem).

Reduction by matrix multiplication
After obtaining all slice propagators, we multiply them together to obtain the full evolution, c.f. eq. (3). We can again parallelize this step by leveraging the associativity of the matrix multiplication. Our slice propagators U (i) are all quadratic and have the same dimensions. Therefore, we simply compute pairwise products of consecutive slice propagators: .., U (n−1) · U (n) . By repeating this scheme log 2 (n) times, where we halve the number of remaining time slices with every round, we compute the overall product. As we will see in the next section, the pair-wise matrix multiplications can be easily parallelized using the Batched BLAS functions. Although there exist more sophisticated approaches to this problem in the literature [9,10], during our numerical tests it turned out that this simple approach is sufficient as the major computational cost is given by the exponentiation step.

Implementation
The main idea of our implementation is that the computationally expensive exponentiation performed by algorithm 1 is run in parallel for all time steps. As all time steps have approximately a similar spectrum we can choose global α, β and m max that are suitable across all time steps. Together with the fact that all matrices have the same shape, this makes the algorithm ideally suitable for a Single Instruction Multiple Data (SIMD) platform, such as a GPU. Furthermore, algorithm 1 has been designed in a way that it can be implemented by only a small subset of the BLAS functions (batched and non-batched) which will bear the majority of the computational workload.

Batched BLAS routines
The Batched BLAS routines have been standardized in 2017 [2] and were created with the intend to maintain high compute resource utilization when dealing with small-to-medium-sized matrices. Looking at the ubiquitous General Matrix Multiply (GEMM) routine, the batched version performs the operation for a batch of length k in parallel. It is obvious that the Batched BLAS functions can be used to parallelize the matrix exponentiation step. However, the flexibility of the GEMM routine allows us to reuse the GEMM function for nearly all steps in Figure 1. As A and B are pointers, the strided batched GEMM routine can also be reused for the pairwise reduction operation, where we double the memory stride to twice the propagator size. A and B can point to the same region in memory with only an offset of one matrix between them. Depending on the implementation, special attention may have to be brought to possible bottlenecks induced by memory missalignments. However, our tests with the NVIDIA cuBLAS implementation (see below) did not reveal major negative effects.
Similarly, the addition of the Bessel coefficients to the diagonal elements of the propagators during the exponentiation step (lines 5 and 6 in algorithm 1) can be implemented using  Table 1: Maximum norms of the exponent of a matrix exponential that guarantees machine precision when calculating the matrix exponential by using the Chebyshev expansion (7) truncated at m max . Here, we approximated the spectral range of G with α = −||G|| and β = ||G|| and solved (9) for ||G||. For calculations in singleprecision floating-point format (FP32) we required < 2 −24 , for the double-precision floating-point format (FP64) < 2 −53 . the batched version of AXPY (scalar × vector multiplication, "a · x plus y"). By this, the full integration can be done solely with BLAS routines, without any custom compute kernels required. This makes the approach easily portable to a variety of systems.

GPU implementation using NVIDIA cuBLAS
We implemented the proposed integration scheme using the CUDA platform and by using the cuBLAS library. The batched version of the AXPY routine had to be implemented by a custom CUDA kernel as it is not yet part of the cuBLAS library (as opposed to other GPU frameworks like AMD's rocm). To reduce the memory requirements for working arrays, we implemented two Chebyshev iterations per loop iteration in algorithm 1, c.f. Appendix.
The performance of the integrator benefits from the fast memory connection on the GPU and the sheer number of compute processors available. On a Quadro P2000 GPU, for 80'000 time steps and a 12 × 12 system, the exponentiation step is by far the most time consuming (∼ 80% of the total run time). The NVIDIA visual profiler (NVVP) reports that we make good use of the available resources (a memory bus utilization of 55%, a compute resource utilization of 85%, and an occupancy of 90%). During the subsequent reduction step, these numbers are slightly reduced (55%, 75%, and 35%). This may indicate room for optimization, but any improvements here are unlikely to improve total runtime significantly, since time expended for the reduction step is short already. It is worth noting that memory misalignment problems can decrease the efficiency of the Batched BLAS functions when e.g. setting the memory stride to unusual values. However, for the cuBLAS library we do not observe this to be a major problem.
The user provides the control amplitudes as an array sampled at equidistant time points. From the control amplitude array we compute the actual exponents −iG, optionally by averaging the control vector over three points according to Simpson's quadrature rule. This makes sense in the context of a higher-order Magnus integrator (see section 4 below). Otherwise, we simply set G = H∆t with H as per equation (5).
The exact computation of the spectral boundaries α and β can be expensive. We instead use the operator norm ||G|| 1 as an inexpensive upper bound. We use a single value for α and β for all time steps, so we further bound the norm by the triangle inequality: ||G|| 1 /∆t < ||H 0 || 1 + N i=1 |c i (t)| ||H i || 1 . By requiring the user to keep the control amplitudes c i (t) ∈ [−1, 1], rescaling the control Hamiltonians if necessary, we hence define

Magnus integrators
It can be shown that the error of the above integrator in a single time step scales as O(∆t 3 ), making it a second order integrator over the full integration time T = N∆t. The order of the convergence can be improved significantly by Magnus expansion [11,12]. Magnus integrators have been successfully used for solving the Schrödinger equation in various implementations, including on GPUs [13]. However, these typically again focus on large Hamiltonians, rather than the small to mediumsized problems with long time horizon that we are considering.
Magnus proposed that the solution U(t) to any timedependent ODE (like the Schröedinger equation) can indeed be written as the exponential of some matrix Ω(t). He provided a series expansion of that exponent, i.e.
where [A, B] − := AB − BA is the commutator of the matrices A and B. The interpretation is obvious: The term Ω 1 (t) averages the Hamiltonian over the time slice. By replacing the integral with the mid-point rule, we recover the naive approach where we assume the Hamiltonian to be constant over the whole time slice. The higher-order terms Ω 2 (t), Ω 3 (t), ... describe the effects when the Hamiltonian at time t 1 does not commute with itself at a later time t 2 .
For constructing a fourth-order integrator, it is sufficient to only include terms up to Ω 2 , and then to approximate the integrals with quadrature rules of sufficiently high order [14]. The inclusion of this extra term neatly integrates with the method described above. As we will show next, including Ω 2 is equivalent to simply adding extra control terms in equation (5). The extra terms correspond exactly to the commutators of the existing control Hamiltonians.
Inserting equation 5, we find The matrix G is still of the same form as H in eq. (5). The matrices marked with an asterisk now correspond to the new effective control Hamiltonians. Also note that compared to eq. (5), we have now doubled the time step to 2∆t. This is because Simpson's rule requires an extra sample in the middle of the interval. By exponentiating −iG instead of −iH∆t we thus double the time step, yet also increase the order of convergence of the integrator from 2 to 4. The price to pay is that the transform introduces additional control Hamiltonians. It maps N original control Hamiltonians (and their corresponding control amplitude array) to 3/2N + N 2 /2 effective control Hamiltonians. This can be costly if the initial number of control Hamiltonians is already large.
Note that the transform H → G is entirely contained in the 'expand' step in figure 1. The time expended here is usually negligible though, as the computational bottleneck is the subsequent exponentiation. In practice, we find that the improved convergence allows for much fewer timesteps (for a given target accuracy), such that both the expansion and propagation steps are proportionally faster to compute. 2 Finally, an extension to higher-order Magnus terms is possible and integrators of 6 th or 8 th order can be obtained, although the number of necessary commutator terms grows very fast. Here, the recently developed commutator-free Magnus integrators [15] may provide an alternative.

Numerical experiments
We first test the performance by propagating several dense Hamiltonians as a function of matrix size. We used an NVIDIA V100 GPU. The results are shown in Figure 2. Next, we compare it against a fully parallelized CPU implementation, running on an Intel i9-9900 and linked against Intel MKL. The CPU implementation combines the expansion and propagation step in Fig. 1 into a single parallelized function. This improved the performance due to more efficient caching as a direct port of the GPU code would be an unfair comparison. For a 12 × 12 Hamiltonian system (encountered often in nitrogen vacancy research) and 80'000 time steps, we find that the GPU runs 50 times faster for single precision (FP32) and 25 times faster for double precision (FP64). In the regime of a small number of time steps and a small matrix dimensionality, the CPU is faster as the "outsourcing" of the computation to the GPU comes with several overheads. This is highlighted in Figure 3. For small matrices, we see an approximately logarithmic increase in the runtime, due to the extra kernel launches required in the "reduce" step.  Next, we test the rate of convergence. We used the model of a qubit driven with a circularly polarized excitation field, a problem for which an exact analytical solution exists. The studied system Hamiltonian is where the σ i denote the Pauli matrices. Figure 4 shows the resulting accuracy when increasing the number of time steps while keeping the total evolution time fixed. We clearly observe that for both, FP32 and FP64, we achieve a better convergence when implementing a Magnus integration scheme while the computational cost per time step only increases marginally. For the sinusoidal drive, our implementation reaches machine precision for ≈ 10 2 time steps per oscillation cycle in the case of FP32, and for ≈ 10 4 time steps per cycle in the FP64 case. Furthermore, we see that when truncating the Magnus expansion at Ω 1 , increasing the quadrature order alone does not improve the convergence. We observe that the error reaches a minimum after a certain number of timesteps. Finer timesteps do not improve the accuracy. Past this point, the error per time slice is limited by the machine precision, which accumulates over an increasing number of slices. With single-precision arithmetic, the achievable error is on the order of 10 −5 . This result might seem unsatisfying at first. However, this precision still by far exceeds the accuracy achieved in many experimental realizations of the quantum systems that the algorithm is designed to simulate. For instance, in the field of Nitrogen Vacancy (NV) center research, available Signal-to-Noise Ratio (SNR) frequently limits the practical experimental accuracy to ∼ 1%. In turn, this means that even consumer GPUs that throttle FP64 performance can provide acceptable performance.

The PARAMENT library
We provide the full-GPU integrator as a C library. We name the library PARAMENT (PARallelized Matrix Exponentiation for Numerical Time-evolution) and available for download 3   The usage model follows the steps of Figure 5. First, the user initializes the integrator by calling the Parament create() function. It returns a handle to a newly created context. The context stores the state of the integrator. Multiple contexts may exist at any time, and used independently; however they are not thread-safe nor reentrant. The context must eventually be released by calling Parament free().
Next, the Hamiltonians are uploaded to the GPU using the Parament setHamiltonian() function. Here, the user also decides on the use of the Magnus expansion. When enabled, PARAMENT will then calculate the necessary commutators (effective control Hamiltonians) and upload them to the GPU as well.
To obtain a propagator, the user calls the Parament equiprop() routine with the coefficient arrays. The Hamiltonians persist between propagations, so that repeated runs (e.g. with different control fields) are possible. Lastly, the implementation exposes several helper functions which allow tweaking the underlying numerics, e.g., the selection of a different m max . For a full description of available functions, the user is referred to the documentation of PARAMENT bundled together with the source code, or available on the project website.

Python
While written in C++, the PARAMENT library can easily be used together with other programming languages, including Matlab or LabView. For Python, we provide a reference binding, called pyparament. This use-case was the initial motivation for the development of PARAMENT: A fast lab-frame propagation in interactive compute sessions, e.g., during the development of new microwave control schemes, possibly in Jupyter notebooks. Applications include, for instance, testing of advanced control schemes e.g. by frequency-modulating the microwave pulses [16]. It truly embraces the mindset of 'interactive super-computing' [17].
The source code is available on the PARAMENT Github, or on the PyPI package repository.

Applications and outlook
The presented speed-up of lab-frame simulations will facilitate testing new control schemes during the sequence design of quantum control experiments. Due to implementation with only a few BLAS functions, our scheme can be ported easily to a variety of platforms, including AMD GPUs and FPGA devices.
The applications of our integration scheme, however, go beyond what it was initially designed for. It is generally suitable for studying the evolution of small-to-medium-sized quantum system under complex-modulated control fields. It can be used to quickly determine the Floquet states and quasi energies of strongly periodically driven system, by diagonalizing the propagator [19].
As the size (degrees of freedom) of the simulated system increases, the presented approach requires vastly more memory. Fortunately, modern GPUs provide ample amounts thereof. If that is insufficient, PARAMENT could easily be adapted to scale across multiple GPUs. We hope to encourage the implementation of Batched BLAS routines that natively support multi-GPU calculations via fast GPU-to-GPU buses like the NVIDIA NVLink.
A more visionary application could be the verification of an experimental control software and hardware stack. The quantum system could be replaced by PARAMENT, a fast digitizer, and a fast arbitrary waveform generator (AWG). If the latter two have direct access to GPU memory, performance may be sufficient to fully emulate the quantum system in near-real time.
Our approach is not limited to quantum systems. It can be applied to any differential equation that can be cast into the form of equation (1). Of course, computational efficiency is best if the problem can be formulated with only a few 'control Hamiltonians', such that limited upload rate from host computer to GPU does not affect the overall computation time.
Finally, the fully GPU-based integration algorithm can be used as an essential building block for fully GPU-based optimal control optimization routines like the GRAPE algorithm [20]. The optimizer must evaluate the time-evolution operator in every step of the optimization routine. We are confident that GRAPE can be greatly accelerated if built around PARAMENT. Depending on how large the norms of the slice Hamiltonians are, here the introduction of a scaling and squaring approach might be needed.

Appendix A. Pseudo-code of the BLAS-implementation
Here, we describe the main integration routine presented in the main text. We do not include the Magnus expansion as this can be seen as adding effective control Hamiltonians and amplitudes to the problem. Those can be obtained according to eq. (14). The goal is to provide an easy way to port the integration parallelization to other platforms that offer Batched BLAS routines. // D1 c o n t a i n s as first matrix the p r o p a g a t o r Listing 1: Implementation of the central integration routine in a C-style pseudo code