Very accurate time propagation of coupled Schr\"odinger equations for femto- and attosecond physics and chemistry, with C++ source code

In this article, I present a very fast and high-precision (up to 33 decimal places) C++ implementation of the semi-global time propagation algorithm for a system of coupled Schr\"odinger equations with a time-dependent Hamiltonian. It can be used to describe time-dependent processes in molecular systems after excitation by femto- and attosecond laser pulses. It also works with an arbitrary user supplied Hamiltonian and can be used for nonlinear problems. The semi-global algorithm is briefly presented, the C++ implementation is described and five sample simulations are shown. The accompanying C++ source code package is included. The high precision benchmark (long double and float128) shows the estimated calculation costs. The presented method turns out to be faster and more accurate than the global Chebyshev propagator.

V n,1 (t)V n,2 (t) . . .Ĥ n (t) Each ψ n corresponds to a particular wavefunction at n-th electronic potential, whileV j,k correspond to the coupling elements between the corresponding levels andĤ i (t) is the Hamiltonian at a given level. The solution to such system provides us with an understanding of fundamental quantum processes. For almost all of these processes closed form solutions do not exist. Instead of these the numerical algorithms are being developed to simulate the quantum processes from first principles. This paper describes a general semi-global algorithm for various classes of problems including time-dependent Hamiltonian, nonlinear problems, non-hermitian problems and problems with an inhomogeneous source term [1,2,3]. The novelty is both the C++ implementation (10× faster than Octave) of this algorithm as well as the addition of the ability to calculate multiple coupled electronic levels in high precision (long double and float128 types having 18 and 33 decimal places respectively).
The main advantage of the presented algorithm is the ability to significantly reduce the numerical error in time propagation up to the level of the Unit in the Last Place (ULP) error 1 of the floating point numerical representation (see Sections 3 and 6, compare also with [4]). This is made possible by iteratively applying the Duhamel's principle until the required convergence criterion is met forĤ(t). Additionally, the error tolerance can be used to control the calculation speed.
This ability to produce results with errors in the range of the numerical ULP error 1 of used precision, together with high precision types long double (18 decimal places) and float128 (33 decimal places) and a fast C++ implementation means that the code presented in this work can be used to obtain reference results in many difficult simulation cases.
This work is divided into the following sections. In the next section the theoretical introduction to the semi-global method [1,2,3] is presented. In Section 3 the high-precision calculations are briefly substantiated and described. In Section 4 the technical details of the C++ implementation are discussed. In Section 5 a validation of the current implementation is performed. In Section 6 the computation speed benchmarks are presented. In Section 7 the accompanying C++ source code package is described and finally, the conclusions are in Section 8.

The semi-global method
The time-dependent Hamiltonian is useful for ultrafast spectroscopy, high harmonic generation or coherent control problems. In some cases the Hamiltonian may become nonlinear depending explicitly on the state ψ(t). Such case occurs in mean field approximation, in the Gross-Pitaevskii approximation [5], time-dependent Hartree [6,7,8,9] and time-dependent DFT [10,11,12,13]. Another complication may arise when adding a source term to the Schrödinger equation, such as in scattering problems [14]. All these cases will be possible to calculate with the semi-global [2,3,15] method ported from Matlab/Octave [2,3] to C++ and described in this paper.
The global scheme 2 described in [16,17] assumes the knowledge of the eigenvalue range of the Hamiltonian (i.e. the E min and E max ). Usually, such knowledge is missing, especially for time-dependent or non-Hermitian problems. To overcome this difficulty the method below is implemented with the Arnoldi approach. The main advantage of this approach is that the algorithm determines the energy range while constructing the Krylov space. A variation of semi-global method also allows the Chebyshev approach where the energy range is required [2], this approach is available in the attached C++ source code but is not discussed with detail in this paper. For reference see [2,3].
Several following sections summarize the detailed derivations presented in [2] and thus are not a novelty in this work, however, they are a crucial part to the final, novel, Section 2.7 where the method discussed here is extended to several coupled time-dependent Schrödinger equations (such as Eq. 2). This final extension and its novel high precision (up to 33 decimal digits) implementation in C++ allows to simulate complex multilevel diatomic molecular systems. Another novelty is that the presented C++ code is 10× faster than the original Matlab/Octave code [2] (see Section 6).
In other, less sophisticated, algorithms the method to overcome the Hamiltonian time dependence is to use a small timestep and assume constant Hamiltonian in the single step. This becomes equivalent to the method being first order in time and there occurs a loss of precision which was gained by using a higher order method. More sophisticated methods such as Magnus expansion [18] or high order splitting [19] do not assume stationary Hamiltonian, but these methods are still local methods (they require relatively small ∆t) with a limited radius of convergence.
The method presented here is a semi-global method, because it combines the elements of local propagation and global propagation methods [15]. A fully global time-dependent method was also developed, but it turned out to be too expensive computationally [20]. The semi-global algorithm described here is efficient with respect to accuracy compared to the numerical effort.

Establishing notation
The time-dependent Schrödinger equation Eq. 1 is rewritten in a discretized form: the time derivative of a discrete state vector ⃗ ψ of finite size 3 is equal to a matrix (with time dependence) operating on the same vector. Let us introduce a more general version of Eq. 3 2 ,,global" refers to the algorithm being independent to the size of the timestep, contrary to typical Taylor based methods where the power n in ∆t n refers to the order of the method. It treats ,,globally" the entire time span of the calculation. 3 in the notation of ⃗ ψ the tilde • indicates that it is discretized, while vector ⃗ • indicates that it is a state vector. by adding a source term ⃗ s and by adding a dependence of H on ⃗ ψ(t): This is in fact a general set of ordinary differential equations (ODE). For convenience let us define: by incorporating the imaginary unit and Planck's constant into Thus we obtain the following set of ODEs to solve: and given the initial condition ⃗ ψ 0 def = ⃗ ψ(t = 0) the method described in this section will allow propagation of the discrete state vector to the next timestep ⃗ ψ(t + ∆t).

Short summary for time-independent Hamiltonian
The short summary in this subsection only serves the purpose of introducing simpler versions of formulas which are extended to time-dependent Hamiltonian in the following subsections 4 . The topics discussed in Sections 2.2-2.4 are about solving the same mathematical problem of approximating ⃗ u = f (A)⃗ v. Both of them can be solved by Polynomial expansion or Arnoldi, where the Arnoldi algorithm is just a special case of the polynomial approximation.
We shall emphasize the main perk of all global time propagation methods: that the evolution operator is expressed as a function of a matrix Hamiltonian expanded over the energy spectrum using a chosen polynomial basis, thus allowing arbitrarily large timestep. Let us for a brief moment consider again a simpler case without the time dependence and without the source term: where G 0 def = G(t = 0) loses time dependence. The evolution operator is then expressed as an expansion in a truncated (to K number of terms, see Tab. 1 on page 6 for a summary of parameters in this method) polynomial series. Hence the function f (x) = e x∆t (∆t is a parameter) is approximated as: where P n (x) is a polynomial of degree n (e.g. a Chebyshev polynomial, like in [16], but other polynomials can also be used here) and a n is the expansion coefficient. Then applying this evolution operator: 4 The content of this subsection is not implemented in the C++ code.
we obtain the solution 5 at time t + ∆t. The emphasis here lying on the ,,global" property of the method: the error does not depend on the timestep ∆t. The solution is obtained directly at the final time t + ∆t, which can be arbitrarily large. We shall note however that the expansion Eq. 9 has to be accurate in the eigenvalue domain of G 0 . We might consider the following polynomials P n : (a) Use the Taylor polynomials P n (x) = x n and expand the evolution operator in a Taylor series (a common approach in the local time integration methods), but it is a poor choice: they are not orthogonal. To the contrary: as n increases they are getting more and more parallel in the function space.
(b) Use the Chebyshev polynomials P n (x) = T n (x). The fact that they are orthogonal to each other provides two useful properties: (1) the series converges fast and (2) the expansion coefficients a n are given by a scalar product of the P n (x) with f (x). This approach is used in [16].
(c) Use the Arnoldi approach which works well when the spectral range of the Hamiltonian is unknown. It uses the orthonormalized reduced Krylov basis representation. This method will be summarized in Section 2.6 and it is used in the presented here semi-global approach for timedependent Hamiltonian.
We shall note that in Eq. 9, we obtain the solution only at the chosen time t + ∆t. It is desirable to follow the evolution of the physical process at a smaller timestep, so that the time dependence of the Hamiltonian (which is introduced in the following subsections) can be more accurately captured. It is possible to obtain these intermediate time points ∆t j ∈ [0, ∆t) via negligible additional cost: using the same matrix vector operations P n ( G 0 ) (Hamiltonian acting on the wavefunction) but with different precomputed scalar coefficients a n, j (where j corresponds to an intermediate time point in the evolution): where M is the number of additional intermediate time points in the solution. It is possible, with low computation cost, because expansion of function in the P n basis has different coefficients a n, j but has the same P n ( G 0 ) ⃗ ψ 0 evaluations. The Eq. 10 is in fact an expression for the evolution operator acting on the wavefunction, which for the purpose of the next section we will denote as: 5 It should be clarified here, that the Eq. 9 (in which the operation of the function of matrix on a vector uses a polynomial expansion approximation of the stationary evolution operator) and the first term in RHS of Equations 20 and 26 (in which the Arnoldi algorithm is used to approximate a different function of a matrix which operates on a vector) are actually both a special case of the solution of ⃗ u = f (A)⃗ v which is a general problem in numerical analysis. The two problems can be solved by the Arnoldi approach (see also Restarted Arnoldi [21]) or a polynomial expansion. The Arnoldi algorithm is a subtype of the polynomial expansion, because it is a polynomial interpolation at estimated eigenvalues.
Hence Eq. 10 can be also written as: 2.3. Source term with time dependence On our way towards the full time dependence in Eq. 6, we will now add the source term with time dependence to Eq. 7: We can integrate this equation using the Duhamel principle, which provides a way to go from the solution (Eq. 10) of the homogeneous Eq. 7 to the solution of the inhomogeneous Eq. 13 like this: Above we took the advantage of being able to extract from the Duhamel's integral the U 0 (t) part. So the only part which needs to be solved explicitly is the remaining integral. To do this we will assume that ⃗ s(τ) (after discretization: ⃗ s(∆t j )) can be expressed as a polynomial function of time. It is a bit of a simplification, but later on we will be able to decide how many elements M (see Tab. 1) in the series the algorithm will use, thus being able to directly control the accuracy of the solution: It shall be noted here that a Chebyshev approximation of ⃗ s(t) is used, which is next converted to a Taylor form as in Eq. 15. By this way a much faster convergence is achieved, which is an advantage of the local aspects of the semi-global method.
Together with the desired error tolerance (to be introduced in Section 2.4) and the parameter K (to be introduced in Section 2.6) we have all the parameters which govern the accuracy of the solution. The meaning of these parameters is summarized in Section 2.8.
Let us now go back to calculating the integral present in Eq. 14. Plugging Eq. 15 into Eq. 14 yields: Which with the following definitions of f m ( G 0 , t), ⃗ w m and ⃗ v j : following the derivation in [2] the solution can be written as: where the f M ( G 0 , t) is acting on ⃗ v M and the calculations are actually performed in the discretized spectrum z ∈ σ( G 0 ) (see Section 2.6). Now, since ⃗ v j satisfy the recurrence relation: the overall computational cost of Eq. 20 is reduced to M + K matrix vector multiplications 6 .

Introducing time-dependent Hamiltonian
In the case of time-dependent Hamiltonian: or rather: the Duhamel principle does not yield a closed form solution.
Instead an iterative procedure can be used to obtain better and better approximations of the solution. First let us move the time dependence from G(t) to ⃗ s by defining an extended source term 7 ⃗ s e : where¯ G(t) def = G(t)− G avg and G avg is average time-independent 8 component of G. The equation to be solved now has the following form: We can use the previous solution Eq. 20 to approximate the time evolution of ⃗ ψ(t): where this time the ⃗ v j are computed from ⃗ s e . It means that ⃗ v j depend on ⃗ ψ(t) which is still unknown, however the solution can be obtained via iterations. Upon first evaluation, we either extrapolate from previous timestep ∆t (by putting t + ∆t into Eq. 26) or when jump-starting the calculations we use ⃗ ψ 0 . Next, in each successive evaluation, we use the approximation from the previous iteration within the timestep ∆t (which spans M time points). We repeat the iterative procedure until the convergence criterion at sub-step M is met: It means that this method has a radius of convergence that directly depends on the timestep ∆t covered in the single iteration, and contains M time points. Too large ∆t will cause the successive iterations to diverge, this is the reason why this method is not a fully global method but a semi-global method. The useful result of this situation is that one can directly control the computation cost by setting an acceptable computation error ε. For reference solutions it can be set to ULP numerical precision, for faster calculations it can be a larger value. The novelty in this work is that it works also for higher precision types such as long double or float128 with 33 decimal places (Table 8), thus enabling very accurate calculations.
Since we have put the dependence on ⃗ ψ(t) into ⃗ s e ( ⃗ ψ(t), t) it is also computationally inexpensive to put this dependence into the Hamiltonian hence the method described above works also for Eq. 4. So putting it all together, the solution to Eq. 4: is following: where iterations are performed until the convergence condition Eq. 27 is met. Thanks to being able to extrapolate ⃗ ψ prev into the next timestep ∆t by putting t + ∆t into the above equation and with a good choice of M, K parameters usually one iteration is enough to achieve desired convergence.

Chebyshev time points spanning ∆t and Newton interpolation
When interpolating a function (in this case it is a vector approximation of ⃗ s e at the time points), an equidistant set of points is not a good choice: the closer to the boundary of the interpolation domain the less accurate is the interpolation. This effect is known as the Runge phenomenon. A much better set of points is with points becoming denser closer to the edges of the domain. It is called the Chebyshev sampling and it is required to perform the Chebyshev approximation of ⃗ s e in time. Such a set of points is chosen with the following formula: see example for M=11 in Fig. 1. These points are used as the time points ∆t j , to interpolate ⃗ s e , spanning the whole single ∆t discussed in preceding sections. The Chebyshev approximation mentioned here is different from the one discussed in Section 2.2 which approximates the function of a matrix that operates on a vector. A Newton interpolation of function f (t) at points ∆t j is defined as: where a n are coefficients of the expansion and R n are the Newton basis polynomials defined as: R 0 (t) = 1 and R n (t) = n−1 j=0 (t− ∆t j ). During the process of calculating the Newton interpolation of f (t) the a n coefficients of the expansion are calculated as the divided differences of the interpolated function f (t) (see [22] sections 25.1.4 and 25.2.26 on pages 877 and 880; also see [2] appendix A.1). It is used in Eq. 8.

Arnoldi approach
Calculation of the evolution operator in [16] method requires the knowledge of a spectral range of this operator. That method cannot be used when it is impossible to estimate the eigenvalue domain. The difficulty with such estimation grows when the eigenvalues are distributed on the complex plane, which The number of interior time points in each timestep, Figure 1 † . Eq. 15, Eq. 18 ε Tolerance: the largest acceptable computation error. Eq. 27 ∆t The length of the timestep interval (it contains M time points spanning ∆t).
Eq. 29 § See footnotes 5,9 for details. K is defined as in Section 2.2, but in the code the Arnoldi algorithm from Section 2.6 is used (with the same meaning of K). † ⃗ s e ( ⃗ ψ(t), t) is expanded over M time points hence it uses M polynomial terms in the expansion to cover all time points inside the timestep.
is the case with absorbing boundary conditions (see Section 2.9). And almost all interesting use cases of time propagation (e.g. a multi-level diatomic molecular system evolving under a laser impulse) require absorbing boundary conditions. In such, quite common, situation the Arnoldi approach comes to the rescue, because it works without required prior knowledge of the eigenvalue domain.
The Arnoldi approach calculates the f M ( G avg , t) ⃗ v M in Eq. 28 in the following manner: (c) Rescale the eigenvalue domain during the process using method [23] by dividing by the capacity of the domain to reduce numerical errors.
(d) Perform the calculation in the reduced Krylov basis representation then transform the result back to original positional representation of ⃗ v j using the transformation matrix Υ.

Extension to coupled time-dependent Schrödinger equations
The critical observation to extend this semi-global algorithm to an arbitrary number of coupled Schrödinger equations (see Eq. 2) is that this method is independent of the Hamiltonian used. There is no requirement that this Hamiltonian is a single-level system or several coupled levels. The method uses the iterative procedure to obtain the solution. The iterations are performed via invoking theĤ(t) acting on ψ until a convergence criterion is met: the difference between the wavefunction from previous iteration and current iteration is smaller than predetermined tolerance ε. This iterative procedure is executed inside a single timestep ∆t over the M time points.
During this calculation all the expansion coefficients, both with respect to K (Eq. 8) and with respect to M (Eq. 15) are stored inside a matrix of size N × K and N × M respectively, where N is the number of grid points in the discretized wavefunction. The usual matrix algebra is used in the C++ implementation, hence preserving the original code of the implementation when moving to several coupled Schrödinger equations is desirable. And it is possible because the algorithm itself is agnostic to the Hamiltonian: it only invokes the H(t)ψ in the computation. The trick lies in the memory layout of the storage: the coupled wavefunctions are stored one after another inside a single-column vector. A system of k coupled wavefunction uses k N grid points. The system simply becomes larger and the semi-global algorithm uses matrices of sizes (k N) × K and (k N) × M respectively without being informed how the information (stored in the column vector of size k N) is used by the Hamiltonian. The technical implementation details are discussed in Section 4.1. Additionally, the same approach can be used when adapting semi-global algorithm to solving higher dimensional systems with more spatial directions, such as a system of three atoms using Jacobi coordinates and with coupled Schrödinger equations therein.

Summary of parameters used by the semi-global method
The semi-global time integration method uses following parameters: K introduced in Eq. 9 to compute the evolution operator, M introduced in Eq. 15 to help with the convergence of the iterative process, ε -the error tolerance (Eq. 27) and the ,,global" timestep ∆t (Eq. 29) over which the converging sub-iterations are being computed. The short summary of these parameters is listed in Table 1.

Absorbing boundary conditions with a complex potential
An optimized complex absorbing potential is used [2,24,25]. In the damping band 10 a sequence of square complex barriers are placed, each of them having their own reflection and transmission amplitudes for a plane wave [25]. The parameters of each barrier are optimized with respect to the cumulated plane wave survival probability of all barriers. The typical characteristic of such a barrier is that it damps momentum within a certain momentum range and this range pertains to the current problem being calculated. After the optimization procedure is complete the complex potential is added to the potential used in the time propagation. The Arnoldi approach works correctly with complex potentials (see Section 2.6) and this is why it is used by the semi-global time propagation method.

Calculations in higher numerical precision
There is a need for higher precision computations in quantum dynamics, especially for attosecond laser impulses where the simulated time span is very short and the timestep is very small [26]. A very high temporal resolution is necessary to describe such a system. It stems from the simple fact that with small timesteps (which are necessary to describe a quickly changing electromagnetic field) there are a lot of small contributions from each timestep. Upon adding these contributions many times, the errors will accumulate 11 . And only higher precision calculations can make the errors significantly smaller.
High-precision computation is an attractive solution in such situations, because even if a numerically better algorithm with smaller error or faster convergence is known for a given problem (e.g. Kahan summation [27] for avoiding accumulating errors 11 ), it is often easier and more efficient to increase the precision of an existing algorithm rather than deriving and implementing a new one [28,29]. However, switching to highprecision generally means longer run times [30,31] as shown in the benchmarks in Section 6.
Nowadays, high-precision computing is used in various fields, such as quantum dynamics for physical and chemical purposes [32,33,34,35], long-term stability analysis of the solar system [36,37,28], supernova simulations [38], climate modeling [39], Coulomb n-body atomic simulations [40,41], studies of the fine structure constant [42,43], identification of constants in quantum field theory [44,45], numerical integration in experimental mathematics [46,47], three-dimensional incompressible Euler flows [48], fluid undergoing vortex sheet roll-up [45], integer relation detection [49], finding sinks in the Henon Map [50] and iterating the Lorenz attractor [51]. There are many more yet unsolved high-precision problems in electrodynamics [52]. In quantum mechanics the extended Hylleraas three electron integrals are calculated with 32 digits of precision in [34]. The long range asymptotics of exchange energy in a hydrogen molecule is calculated with 230 digits of precision in [35]. In quantum field theory calculations of massive 3-loop Feynman diagrams are done with 10000 decimal digits of precision in [44]. Moreover the experiments in CERN are being performed in higher and higher precision as discussed in the "Welcome to the precision era" article [53]. It brings focus to precision calculations and measurements which are performed to test the Standard Model as thoroughly as possible, since any kind of deviation will indicate a sign of new physics.
Consequently, I believe that in the future, high-precision calculations will also become more popular and necessary for the development of femto-and attosecond chemistry. Following two use cases come to mind: (a) In order to properly describe a rapidly changing electric field of a laser impulse in a typical femtosecond and attosecond chemistry problem a small timestep is necessary [54]. In such situation the accumulation of machine errors can sometimes be a problem which can be readily solved by high precision.
(b) High precision offers great resolution in the spectrum obtained from the autocorrelation function which is commonly used to numerically identify the oscillation energy levels [55]. In case where there are many overlapping levels due to multiple coupled potentials a such situation can occur in which two very close oscillation energy levels can only be resolved when there are more significant digits available than in the typical double precision. The caveat being that a very small timestep is necessary because it determines the resolution of the Fourier transform used on the autocorrelation function. A work exploring this possiblity is currently in preparations to publish.
Therefore, I have implemented the algorithm presented here in high-precision. In Section 6 I show the performance benchmarks with additional details. See also my work [56] for highprecision classical dynamics.

Implementation of TDSE for time-dependent Hamiltonian
The C++ implementation of the iterative process in Equation 28 is shown in the Listing 1 of the function SemiGlobalODE ::propagateByDtSemiGlobal [57].
First the simulation parameters (see Tab. 1) are assigned to local variables. A helper class LocalData in line 6 contains variables local to this function as well as methods operating on them. It was introduced here in order to improve clarity of the rest of the SemiGlobalODE::propagateByDtSemiGlobal by splitting the calculation into several logical steps, described below. A matrix Unew (variable ⃗ ψ new in Eq. 28) in line 9 is created to store M (M is Nt ts in the code) wavefunctions 12 , one for each time point (Figure 1). Additionally one extra M+1 th wavefunction is created for the purposes of estimation of the interpolation in time. The self convergent iteration process (Eq. 28) spans whole ∆t time period divided into M time points (Figure 1). The guess wavefunction d.Ulast in line 12 (variable ⃗ ψ prev in Equation 28) takes value from the extrapolated Uguess which was prepared at the end of the previous timestep (line 60 or 63) 13 . In case when there was no previous iteration the ⃗ ψ(t = 0) is used for all M time point wavefunctions, this causes the main convergence while loop to be executed about 2 times more. The value ⃗ ψ new (∆t 0 ) at first time point is the final value from previous timestep (line 13). The ⃗ v j vectors (Eq. 19) are created in line 14, their first component is just the wavefunction ⃗ ψ 0 (Eq. 18). In the main convergence loop they are calculated in line 21. Remaining preparation lines concern counting the number of iterations it took Eq. 28 to converge and tracking and estimating the calculation error. The tol+1 in line 16 is to ensure that the first execution of the while loop always takes place. In next executions of this loop the value from Eq. 27 is used and compared against ε tolerance.
In the main while loop (line 19, Eq. 28), first the extended inhomogeneous source term ⃗ s e ( ⃗ ψ(t), t) from Eq. 24 is calculated in line 20 taking into account all of the time dependence of the Hamiltonian. Next the ⃗ v j vectors are calculated for all M. The Newton interpolation polynomial (Section 2.5) is used and the divided differences calculations are performed in the process. This is followed by a check for numerical divergence (line 22). Next a lambda function for the G avg (line 24) is created to be used in the calculation of the first term f M ( G avg ) in Eq. 28. In line 27 code branching occurs depending on the type of the calculation method. The Arnoldi method is described in this paper (Section 2.6), the Chebyshev method (line 41) is discussed in detail in [2] and is also possible to use here although it is less useful because of the need to provide the energy range of the Hamiltonian. In line 29 the representation of G avg in the orthogonalized Krylov space is stored in the Hessenberg matrix and its eigenvalues are found (line 30). This is the place in the Arnoldi algorithm which finds the range of the energy spectrum of the Hamiltonian and makes it possible to calculate using complex potential and simplifies a lot the usage of this algorithm. One additional point in the spectrum named avgp (line 31) is used in order to track the calculation error and compute the energy spectrum capacity [23]  and assigning ⃗ ψ prev to ⃗ ψ new . As mentioned earlier the lines 41-47 perform the same calculation, but with Chebyshev approach [2], notably the energy range min ev and max ev (used in line 44) has to be provided. 13 When jump starting the calculations Uguess equals the initial wavefunction (see Eq. 18). The assignment to Uguess is performed in the class constructor, hence it is not shown in the Listing 1.
When the iterative process is complete a check is performed whether the used number of iterations was enough for convergence (line 52) and maximum estimated errors are stored (line 53). Next, the total number of iterations is stored (line 55) in order to track the computational cost. Next, the solution is stored in a class variable fiSolution (line 56). Finally the wavefunction is extrapolated for the next timestep ∆t (line 60 and 63) and the estimated errors are checked and stored (line 67).
The It shall be noted here that the Listings 2, 3 and 4 are example implementations of a complete Hamiltonian together with coupled electronic levels and a custom kinetic energy operator, which uses FFT. The semi-global algorithm in Listing 1 can be supplied by the user with an entirely different set of these functions implementing a different Hamiltonian and kinetic energy operator. The possibilities include (1) curvilinear coordinates with more degrees of freedom than the single degree of freedom used here, e.g. Jacobi coordinates (2) a grid with varying distance between the points. The only requirement being that the implemented Hamiltonian correctly ,,understands" the wavefunction and deals with it. The semi-global algorithm does not need to know the details how the wavefunction is dealt with by the Hamiltonian as long as it is converted to VectorXcr (,,flattened", see Sections 2.7, 4.1 and the file README.pdf in the accompanying source code package).
In the presented numerical implementation the numerical damping absorbing boundary conditions from Section 2.9 are encoded inside the complex potential (lines 15 and 19 in Listing 2) during the calculations.

Coupled time-dependent Schrödinger equations
In Section 2.7 I explained how it is possible to adapt this algorithm to arbitrary amount of coupled electronic states by storing all levels inside a single table.

Validation of the C++ code for semi-global algorithm
To validate the implementation of the semi-global time propagator I have reproduced the results both from [2] and from [58,59] 16 . The following tests are provided in the accompanying C++ code: (a) Atom in an intense laser field, Section 5.1 (accompanying file test atom laser ABC.cpp).
(d) Gaussian packet in a forced harmonic oscillator, supplementary materials of [2] (found in the accompanying file test source term.cpp).
(e) Forced harmonic oscillator with an arbitrary inhomogeneous source term, supplementary materials of [2] (accompanying file test source term.cpp).
The tests with a Gaussian packet in a forced harmonic oscillator and an oscillator with an inhomogeneous source term are example simulations found in the supplementary materials of [2] and I include them in the C++ code, but do not discuss them here. Suffice to say that I have reproduced them exactly.

Validation of time-dependent Hamiltonian using an atom
in an intense laser field Here I present the reproduction of results of a model atom in an intense laser field which was presented in [2]. I shall note that these are not new results, since validation of an algorithm has to be run on an example for which the results are already known and verified. In this case I am comparing the electronic wavefunction of an atom in a laser field after evolution for 1000 a.u. (a time of about 24.2 fs), with the reference result found in the supplementary materials of [2]. In the calculations I have used the following parameters (Tab. 1): K = 9, M = 9, ε = 2 × 10 −16 (for double precision 17 ) and ∆t = 0.025 a.u.
In this test the central potential is represented by a simplified Coulomb potential without the singularity (hence it is a model one dimensional atom): This simple model is for example used in the context of intense laser atomic physics. The electric field of the laser impulse used is following (Fig. 2): 170 cos(0.06(t − 500)). 16 Additionally I have compared the simulations of a two level NaRb system with the WavePacket software [60,61,62]. 17 for long double ε = 1 × 10 −19 and for float128 ε = 2 × 10 −34 .  This laser impulse with ω = 0.06 a.u. is similar to the wavelength of a Titanium-Sapphire laser which is λ = 760 nm. The sech 2 = 1/ cosh 2 envelope is similar to the actual envelope found in the laser pulses. The maximum amplitude ζ max = 0.1 a.u. represents the intensity of about I max = 3.52×10 14 W/cm 2 . This term, together with the dipole approximation x ζ(t) is added to the Hamiltonian which reads: For the purposes of numerical simulation this Hamiltonian is modified by adding complex absorbing boundary conditions (Section 2.9). Their effect can be seen on the left and right edges of Fig. 3 where the wavefunction is diminished. The initial wavefunction is the ground state of the model atom. Figure 3 shows the time evolution of the wavefunction over the time of 1000 a.u. The maximum intensity of the laser impulse is centered on 500 a.u. (Fig. 2) and it can be seen on Fig. 3 that this is when the wavefunction starts to undergo a rapid change and some parts of the wavefunction start the process of dissociation. When the calculations reach the 1000 a.u. point in time I compare the results with the reference solution found in the supplementary materials in [2] and the maximum difference at a grid point is smaller than 8 × 10 −15 , well within the range of the numerical ULP error 18 . It means that my implementation of the semi-global algorithm completely reproduces the reference results.

Validation of coupled Schrödinger equations using standard benchmarks
To validate the present implementation, in this section, I will reproduce the solution of two problems for a system of coupled Schrödinger equations featuring nonadiabatic quantum dynamics. They were originally introduced in [58] and afterward carefully analyzed in [59]. They are considered to be the standard benchmark for coupled systems. In these problems the diagonal of the diabatic potential energy surfaces V 11 (R) and (a) a single crossing as in Fig. 4a and (b) a dual crossing as in Fig. 7a.
In the crossing region there is a strong coupling V 12 (R) between the two levels. In adiabatic representation the potential energy surfaces E 1 (R) and E 2 (R) undergo respectively a single (Section 5.2.1 and Fig. 4b) and dual (Section 5.2.2 and Fig. 7b) avoided crossing. The nonadiabatic coupling matrix elements D 12 (R) between the two levels are relatively large (Fig. 4b and  Fig. 7b).
The calculations are performed diabatically because the coupling elements are smaller and the Hamiltonian assumes a simpler form [63,54] 19 . To gain more insight from the wave packet dynamics on each of these levels the results are presented here in adiabatic representation. The two levels E 1 (R) and E 2 (R) do not cross and the evolution of wavefunction probability distributions (Figs. 5 and 8) on the lower and the higher electronic surface is easier to interpret. The transmission and reflection probabilities (Figs. 6 and 9) on E 1 (R) and E 2 (R) are calculated as integrals (on a discrete grid) of the single coupled wavefunction in the nuclear coordinate range satisfying R > 0 and R < 0 respectively. To obtain these results the wavefunctions are converted from diabatic representation to adiabatic representation by first performing the diagonalization of the potential matrix V(R) for each value of R. The obtained eigenvalues are the adiabatic potential energy surfaces E 1 (R) and E 2 (R) [59]. Next, 19 See Eqs. 2.12 and 2.100 in [63] for adiabatic TDSE: iℏ ∂ψ ∂t = − ℏ 2 2m (∇ + τ) 2 ψ (where τ is the nonadiabatic coupling matrix) and Eqs. 2.22 and 2.115 for diabatic TDSE: iℏ ∂χ ∂t = − ℏ 2 2m ∇ 2 + V χ; it can be seen that the position of τ is rather unfortunate in the adiabatic representation. the matrix of two eigenvectors ϕ 1 (R) and ϕ 2 (R) (following the method presented in [59]): is used to compute the nonadiabatic coupling matrix elements D 12 (R) with: Above the electronic integrals become a dot product or a sum in a two state basis. Finally to obtain the adiabatic wavefunction ψ(R, t) from the diabatic one, χ(R, t), the following transformation is used: or more specifically, since in these two examples we are dealing with a two level system 20 : The simulations begin with the single wavefunction placed on the lower energy surface E 1 (R) assuming shape of a Gaussian wavepacket 21 : 20 Alternative method of obtaining the transformation matrix P(R) is to use the Eq. 3.45 from [63]: β(R) = − 1 2 tan −1 V 12 (R) V 11 (R) and then use the rotation matrix from in Eq. 3.44 [63]. 21 To be precise in the code the following formula is used: The initial values of parameters assumed in each simulation are listed in Tab. 2. They were chosen so as to best represent the undergoing evolution of the coupled system dynamics and to completely reproduce the results in [58,59].

Single avoided crossing
In the first standard benchmark [58,59] the potential matrix elements in diabatic representation V i j (R) are defined as follows (Fig. 4a): with the parameters assuming values: A = 0.01, B = 1.6, C = 0.005 and D = 1.0. In this example the diabatic surfaces cross at nuclear coordinate R = 0 and a Gaussian off-diagonal potential is assumed centered at this point. The adiabatic surfaces E 1 (R) and E 2 (R) (Fig. 4b) repel each other in the strongcoupling region and a large nonadiabatic coupling element D 12 (Eq. 35) appears at the avoided crossing. Figure 5 shows the time evolution of the probability distributions (squared amplitude of the wavefunction |ψ(R, t)| 2 ) on each of the adiabatic electronic surfaces. And Fig. 6 shows the transmission and reflection probabilities (integrals for R > 0 and R < 0) evolving over time.
In the high momentum case (Fig. 5a) the packet has enough energy to put about 32% of the population on the upper energy surface. Entering higher level E 2 (R) caused the wave packet to lose energy, it has smaller momentum and is propagating slower as can be seen by the time labels put beneath the center of each packet in the Fig. 5a. Figure 6a shows the transmission and reflection probabilities for high momentum case. It can be seen that whole packet passes through the crossing point and reflection vanishes over time. The final population is about twice higher on the lower electronic surface than on the upper one.
In the low momentum case (Fig. 5b) the packet doesn't have enough energy to populate the higher energy surface. Note the vertical scale on the upper level ρ 2 in Fig. 5b. After going up, the packet almost does not move forward and instead it is leaking back to the lower level in both directions. It can be seen in Fig. 6b that the reflection weakly increases over time while transmission on both electronic surfaces slowly decreases. Most (91%) of the final population resides on the lower surface.
The obtained results are in very good agreement with [58,59]. Please note that I obtained these results using a different, , as it is the solution of a free propagating Gaussian wave packet, in Eq. 38 it was assumed that t 0 = 0. arguably more precise, time integration algorithm than the one used in [58,59]. Also see Section 5.2.3 for precision and performance comparison of these calculations against the global Chebyshev propagator [16,17].

Dual avoided crossing
In the second standard benchmark [58,59] the potential matrix elements in diabatic representation V i j (R) are defined as follows (Fig. 7a): with the parameters assuming values: A = 0.1, B = 0.28, C = 0.015, D = 0.06 and E 0 = 0.05. In this example the diabatic potentials cross each other twice and a wide Gaussian offdiagonal potential is assumed. The adiabatic surfaces exhibit  Table 3: The parameters used in precision and performance tests of semi-global method (see Table 1).   [16] compared against itself, but with higher computation precision. Error is given in terms of ULP (see footnote 1 ), where the size of ULP for each row is given in the last column (its precise value is in Table 3). There is no row for boost float128 because there is no result with higher precision against which it can be compared, hence it is used only as reference against which lower precision results are compared. two avoided crossings (Fig. 7b) and the nonadiabatic coupling element D 12 (Eq. 35) has two pronounced peaks. The time evolution of probability distributions is shown on Fig. 8 and respective transmission and reflection probabilities are on Fig. 9.
The high momentum case in Fig. 8a demonstrates the effect of destructive interference between the first and second crossing. The wave packet populates the higher level around t = 150 (ρ 2 in Fig. 8a) then peaks at around t = 350 (transmission 2 in Fig. 9a) and arrives to second crossing at about the same phase at which it entered the higher level, but this time the nonadiabatic coupling element D 12 (R) (Fig. 7b) has negative sign. Same phase of packet in conjunction with negative sign of D 12 (R) causes the wave packet to leave the higher electronic surface almost completely, despite having high momentum. This effect can be seen in Fig. 9a, where in the end about 98% of population resides on the lower energy surface.
The low momentum case (Fig. 8b) shows constructive interference. First at around t = 500 about 32% of the population enters the higher energy surface (ρ 2 in Fig. 8b), then at second crossing additional 32% enters ρ 2 . At t = 900 there is a significant population increase. Figure 9b shows that about 64% of the population was transmitted to the higher energy surface.
This phenomenon is also known as Stükelberg oscillations and occurs when the time spent by the wave packet between two coupling regions is an integer or half integer multiple of mean wavepacket oscillations.
Again, the obtained results are in very good agreement with [58,59].

Precision and speed tests
The calculations from two previous sections (single and dual avoided crossings) are used in this section to compare the precision and performance with the global Chebyshev propagator [16,17]. This section will also demonstrate the speed gain obtained by using the semi-global propagator.
The parameters of semi-global method used in this comparison are shown in Table 3. The M parameter can be very low because there is no time-dependence in the Hamiltonian, the K parameter was adjusted to be the minimal value of K where the warning about too small K is not printed (see Listing 1,line 48). Namely the estimated error of the function of the matrix is smaller than ε. The ε parameter is set to the ULP error 22 22 see footnote 1 Table 5: Precision test of results from semi-global propagator compared against itself, but with higher computation precision. Error is given in terms of ULP (see footnote 1 ), where the size of ULP for each row is given in the last column (its precise value is in Table 3). There is no row for boost float128 because there is no result with higher precision against which it can be compared, hence it is used only as reference against which lower precision results are compared.   ). Error is given in terms of ULP 1 . The size of ULP for each row is given in the last column (its precise value is in Table 3).
semi-global global Chebyshev propagator [16] Benchmark because maximum possible precision in the calculations is used here. And the timestep used is ∆t = 1 a.u.
The global Chebyshev propagator [16,17] has only one parameter, namely how many elements in the series are calculated, and its value is actually fixed by the design of the algorithm to be 1.3 R (this recommendation is given in [17]), where R = ∆t 2ℏ (E max − E min ). This value was found to provide full precision in nearly all of the calculations, even for higher precision types 23 .
To compare the two algorithms I am performing the same calculation of transmission and reflection probabilities for all four cases (Table 2 and Figures 6 and 9) using both semi-global method and the global Chebyshev propagator [16,17]. The results of transmission and reflection probabilities are stored in a text file with all significant digits (column 2 in Table 8). Since the timestep ∆t = 1 a.u. and because the transmission and re- 23 To validate this I performed the same calculation using 10 R and found that the results are the same, except for two cases which are marked with ‡ in Table 6. flection probabilities are stored for each timestep the text files contain the number of lines equal to the total simulation duration ("simulation time" in Table 2) for each of the four cases. These numbers are then compared against the reference result and the maximum error in terms of the units of ULP found for each case is given in Tables 4-6.
The Table 4 shows the results of comparison of the global Chebyshev propagator [16,17] with itself but with higher numerical precision. The largest error found is for dual crossing, low momentum when comparing long double with float128 precision and equals to 20242 ULP units (where each ULP in this case equals to 1.1 × 10 −19 ). All ULP errors where a comparison is done between long double and float128 are unusually high and they can be explained by the following phenomenon: several of the last elements in the series of the Chebyshev propagator are very small since the Bessel function values are decaying exponentially, this results in addition of very small values to an otherwise large values resulting from the previous elements in the series. Hence the contribution of the last elements in the series vanishes and produces a slightly less accu- rate result than would be possible if higher precision was used. Very similar ULP error values will appear again in Table 6. The Table 5 shows the results of comparison of semi-global method against itself, but with higher precision. The largest error is 126 ULP units (where each ULP in this case equals to 1.1 × 10 −19 ) and is significantly smaller than in Table 4 which means that the semi-global algorithm overall has higher accuracy than the global Chebyshev propagator.
Finally the Table 6 shows the comparison of semi-global algorithm against the global Chebyshev propagator. In this case it is also possible to compare results for float128 precision between the two algorithms. The largest ULP error is 20237 for the comparison between long double semi-global and long double global Chebyshev propagator. This value is almost equal to the largest ULP error found in Table 4 and indicates that indeed the global Chebyshev propagator has lower accuracy 24 . The comparison between float128 types of both methods is more favorable with largest ULP error equal to 3416 24 Because a more accurate semi-global result is compared with less accurate All the calculations discussed in previous paragraphs took a certain amount of time which was measured to test the computation speed of both algorithms. These results are summarized in Table 7 separately for each precision, averaged over the four simulation cases. The semi-global method turns out to be 1.9× faster than the global Chebyshev propagator for double precision and even faster for higher precisions.
I would like to point out that the global Chebyshev propagator [16,17] is commonly used to validate accuracy of other time propagation methods [64], while in my precision and speed test it turned out to perform worse than semi-global method in both precision and the calculation speed.
global Chebyshev propagator. Why not the other way around? The answer lies in Table 5 which shows maximum possible errors of the semi-global method to be much smaller and also because in Table 6   The tests discussed in this section are available in the folder Section 5.2.3 PrecisionTest in the supplementary materials.

Benchmarks of high precision quantum dynamics
As mentioned in Section 3, I would like to emphasize that this algorithm is implemented in C++ for arbitrary floating point precision types, specified during compilation. It works just as well for types with 15, 18 or 33 decimal places. This is the reason why Real type instead of double is used e.g. in line 5 in Listing 1. The Real type is set during compilation to one of the following types: double, long double or float128. The list of all high precision types, the number of decimal places and their speed relative to double is given in Table 8. It follows my work on high precision in YADE, a software for classical dynamics calculations [56]. However the arbitrary precision types boost mpfr and boost cpp bin float are currently not available in the quantum dynamics code because the Fast  Fourier Transform (FFT) routines 25 for them are currently unavailable in the Boost libraries [65]. To resolve this problem I took part in the Google Summer of Code 2021 [66] as a mentor and now the high precision FFT code is in preparations to be included in the Boost libraries. The report from high precision FFT implementation in Boost is available here [67]. After that work is complete all types listed in Table 8 will be available for quantum dynamics calculations. Table 8 shows the speed comparison between different highprecision types, relative to double, separately for classical dynamics and quantum dynamics 26 . The classical dynamics are reproduced from YADE benchmark [56] to serve as a reference. I have done the quantum dynamics benchmark using the example of atom in an intense laser field from previous Section 5.1. 25 used in Listing 4 on page 10. 26 It should be noted that the two types of problems used in the benchmark: quantum vs. classical dynamics have entirely different characteristics and should not be compared per se. The only meaningful conclusion by comparing the two columns in Table 8 is that the excessive times are different. This conclusion cannot be generalized.  For the float128 type I did the benchmark twice, first for the ε same as for double type, then for the significantly smaller ε matching the float128 type. We can see that using ε = 2 × 10 −16 from double for the float128 makes it about 50× slower 28 . When using full float128 precision then the decreased error tolerance ε forces more iterations in Eq. 28 consequently making it 170× slower. If one wished to calculate with larger tolerance, say ε = 1 × 10 −25 , then still float128 has to be used, and it will have speed somewhere between the two values in Tab. 8.
I would like to mention that from my experience [57], it is better to increase ∆t and allow more sub-iterations in Equation 28 as it speeds up calculation more than changing the K and M parameters. For example, this atom in the laser field calculation took 30 seconds with double and ∆t = 0.1 a.u. and 10 minutes with double and ∆t = 0.0041(6) a.u. both having the same ε = 2 × 10 −16 error tolerance.
Moreover, the original code [2], which deals with a single Schrödinger equation, was written in Matlab [68] and the same simulation which took me 30 seconds in C++, took about 27 see footnote 1 on page 2 for details. 28 The semi-global algorithm using float128 with ε = 2 × 10 −16 is performing the same amount of mathematical operations as if the double type was used. 29 . So the 10× speed gain due to migrating from Octave to C++ can now be wisely spent on higher precision calculations or on simulating larger systems.

Accompanying C++ source code package
The C++ source code has been run and tested on Linux Ubuntu, Debian and Devuan 30 , and it should run without significant tweaks on any modern GNU/Linux operating system. Some tweaks might be needed for other operating systems such as MacOS or Windows. In general the source code is very minimal in the sense that there is only the implemented semi-global algorithm in files SemiGlobal.hpp and SemiGlobal.cpp together with the tests listed in Section 5 which can be invoked by various makefile calls (see makefile for full list), such as make plot Coker or make plot Atom.
All available make targets can be readily found by reading the comments inside the makefile. One notable target is make testAllFast which does almost all the tests and takes about 20 minutes. The full test make testAll takes about 7 hours, due to float128 precision being significantly slower. From all implemented tests, the file test single dual crossing.cpp is the most interesting because it contains code for working with multiple electronic levels as well as the placeholder code 29 Octave is an open-source version of Matlab [68]. The speed gain might be smaller in comparison to Matlab, since it is known to be faster than Octave. However I have no access to commercial Matlab software to quantify the difference. Single core was used in both cases, C++ and Octave. 30 The tests were run on: Devuan Chimaera, Debian Bullseye, Debian Bookworm, Ubuntu 20.04 and Ubuntu 22.04. Older linux distributions couldn't work due to too old version of the libeigen library. for time dependence in the potential (although it is not used 31  See file README.pdf in the accompanying source code package for additional details about how to use the semi-global algorithm with custom Hamiltonian in custom coordinate representation, such as curvilinear coordinates. The section Usage in the README.pdf will be maintained and improved as questions arise from users. 31 The time dependent potential is used in test atom laser ABC.cpp and test source term.cpp. A calculation of a full multi-level system with time dependent potential is in preparations to be published in a separate paper, where we investigate the dynamics of a three level NaRb system subject to a laser excitation [69].
The SemiGlobal.hpp and SemiGlobal.cpp files are written to be self contained in a generic way adhering to C++ coding standards. This means that these files are readily available to use in other C++ projects with only minimal changes at the interface to "glue" the code to a different codebase. Depending on whether high precision calculations are required in the other software package the file Real.hpp might be used as well.

Conclusions
In this paper, I present an implementation of the semi-global algorithm for coupled Schrödinger equations with the time-dependent Hamiltonian and nonlinear inhomogeneous source term as a means to describe the time-dependent processes in femtoand attosecond chemistry. The code works for multiple coupled electronic states and supports high precision computations with types long double (18 decimal places) and float128 (33 decimal places). Higher arbitrary precision types will become available in the future once FFT algorithm for them becomes ready to use in the C++ boost library.
The semi-global algorithm is verified to work correctly by comparing its results with five reference solutions: (1) atom in an intense laser field (2) single avoided crossing (3) dual avoided crossing (4) Gaussian packet in a forced harmonic oscillator and (5) forced harmonic oscillator with an inhomogeneous source term. All of them are available in the accompanying source code package.
The precision and performance test revealed that the semiglobal algorithm is more accurate than the global Chebyshev propagator, while being about two times faster for double precision and even faster for higher precisions.
This C++ code, which is 10× faster than the original Matlab/Octave code upon which it was based, can be used to produce reference high accuracy solutions of various problems such as: nonlinear problems, problems with inhomogeneous source term, mean field approximation, Gross-Pitaevskii approximation or scattering problems.
The attached C++ source code is self contained which makes it possible to reuse the algorithm in different software packages.

Supplementary materials
This paper is accompanied by a file SemiGlobalCpp.zip which contains the C++ source code for GNU/Linux described above, licensed under the GNU GPL v.2 license.