Wave-packet continuum discretisation for nucleon-nucleon scattering predictions

In this paper we analyse the efficiency, precision, and accuracy of computing elastic nucleon-nucleon (NN) scattering amplitudes with the wave-packet continuum discretisation method (WPCD). This method provides approximate scattering solutions at multiple scattering energies simultaneously. We therefore utilise a graphics processing unit (GPU) to explore the benefits of this inherent parallelism. From a theoretical perspective, the WPCD method promises a speedup compared to a standard matrix-inversion method. We use the chiral NNLO$_{\rm opt}$ interaction to demonstrate that WPCD enables efficient computation of NN scattering amplitudes provided one can tolerate an averaged method error of $~1-5$ mb in the total cross section at scattering energies $0-350$ MeV in the laboratory frame of reference. Considering only scattering energies $\sim40-350$ MeV, we find a smaller method error of $\lesssim 1-2$ mb. By increasing the number of wave-packets we can further reduce the overall method error. However, the parallel leverage of the WPCD method will be offset by the increased size of the resulting discretisation mesh. In practice, a GPU-implementation is mainly advantageous for matrices that fit in the fast on-chip shared memory. We find that WPCD is a promising method for computationally efficient, statistical analyses of nuclear interactions from effective field theory, where we can utilise Bayesian inference methods to incorporate relevant uncertainties.


Introduction
A large portion of interaction-potential models currently applied in ab initio manynucleon calculations are constructed using ideas from chiral effective field theory (χEFT) [1,2,3,4]. Such potentials typically contain 10-30 low-energy constants (LECs), acting as physical calibration parameters, that must be inferred from data. Bayesian methods for parameter estimation offer several advantages in this regard, in particular in conjunction with EFTs, see e.g. Ref. [5]. However, making reliable inferences typically incur a high total computational cost due to the large amount of posterior samples. It is therefore important to establish an efficient computational framework for generating model predictions of physical observables.
In this paper we look to low-energy nucleon-nucleon (NN) scattering cross sections as it constitutes the bulk part of the standard dataset for inferring the most probable values of the LECs, see e.g. Refs. [6,7,8,9]. The most recent, and statistically consistent, database [10] of NN scattering cross sections contains data for thousands of measured proton-proton (pp) and neutron-proton (np) cross sections at hundreds of different laboratory scattering energies, mostly below the pion-production threshold at T lab ≈ 290 MeV. In most cases, the computational bottleneck when predicting NN scattering amplitudes for a given interaction potential comes from obtaining numerical solutions to the Lippmann-Schwinger (LS) equation.
There are essentially three approaches for improving computational efficiency and speed of a computational procedure: i) develop improved numerical methods and algorithms tailored to the physical model at hand and its application, ii) use specific hardware, e.g. a faster CPU, increased memory bandwidth, or parallel architectures such as in a graphics processing unit (GPU) to better handle some of the most dominant computational procedures of the model, or iii) replace any computationally expensive model evaluations with a fast, as well as sufficiently accurate and precise, surrogate model, i.e. an emulator, which mimics the original model output. This latter approach is very interesting and in particular eigenvector continuation (EC) [11,12] applied to emulate NN-scattering amplitudes [13,14] shows great promise, although uncertainty quantification is yet to be explored. Note, however, that EC emulation attains most of its speedup when applied to potential models that exhibit linear parameter dependencies. In cases where such dependencies are not present one might resort to other methods to handle the non-linear response, such as EC combined with Gaussian process emulation [15]. However, one should note that Gaussian process emulation [16] and other standard machine learning methods exhibit poor scaling with increasing dimensionality of the input parameter domain.
Here, we will focus on approaches i) and ii) by exploring both the standard matrix-inversion (MI) method, see e.g. Ref. [17], and the wave-packet continuum discretisation (WPCD) method [18] for solving the LS equation. The WPCD method basically corresponds to a bound-state approach that uses eigenfunctions of the full NN Hamiltonian to approximate scattering solutions at any on-shell energy. This method is particularly interesting since it provides approximate scattering amplitudes at multiple scattering energies simultaneously. We have therefore implemented this inherently parallel method on a GPU. We also note that WPCD places no constraint on the analytical form of the potential or its parametric dependence. As such, WPCD acceleration for NN scattering complements the EC approaches for emulation [13,14,15] mentioned above.
Speeding up computations very often comes at the expense of accuracy and/or precision, and the WPCD method is no exception to this principle. The magnitude of experimental errors in the calibration data and the estimated theoretical modeldiscrepancies [19] provide natural tolerances for the level of method error that is acceptable. Indeed it is undesirable to have a method error that dominates the error budget such that it obscures or even hampers the inference of useful information. We therefore analyse the WPCD method in detail and quantify realistic computational speedups and compare the method errors to recent estimates of the model discrepancy in χEFT [20]. We also analyse and compare the numerical complexities of the MI and WPCD methods.

Nucleon-Nucleon Scattering
The LS equation, in operator form, for the transition-matrix operatorT at some scattering energy E is given byT This is an inhomogeneous Fredholm integral equation wherev is some NN-potential operator,ĝ 0 (E) = (E −ĥ 0 + i ) −1 is the free Green's function, andĥ 0 is the free Hamiltonian, i.e. the kinetic energy operator, and i → 0 is the positive imaginary part of the complex energy. Most realistic NN potentials in nuclear ab initio calculations do not furnish analytical solutions for the T -matrix. It is however straightforward to numerically solve for the T -matrix for an on-shell energy E. There exists e.g. variational methods [21,22,23,24], as well as Neumann or Born series expansions for sufficiently weak potentials. One can also try Padé extrapolants [25] in cases where the integral kernel is not sufficiently perturbative to converge the resulting Neumann series. In this work we employ the standard MI method [17] which amounts to inverting a relatively small complex-valued matrix at each scattering energy. This is a trivial operation that can be straightforwardly carried out within milliseconds on a modern CPU. However, solving at multiple scattering energies to obtain all cross-sections present in the NN database amounts to at least a few seconds of computation. In a Bayesian analysis, where one repeatedly evaluates a likelihood function across a multi-dimensional parameter domain, any speedup in the solution of the LS equation will directly impact the total computation time.
The momentum-space partial-wave representation of the LS equation for the NN T -matrix, Eq. (1), is given by where q, q , and k are relative momenta, E = p 2 /m N is the centre-of-mass (c.m.) energy with momentum p, m N is the nucleon mass, and g 0 (k; In this work we only consider the canonical NN interaction potential, and therefore use the shorthand notation T sJ l l (q , q) ≡ q , l , s, J|T |q, l, s, J for partial-wave amplitudes. Furthermore we use the following normalisation of momentum states, k |k = δ(k −k) k k . We suppress isospin notation since the total isospin T is defined uniquely by the Pauli principle given the spin and angular momentum quantum numbers s and l, respectively, while the projected isospin T z is defined at the outset of, and conserved throughout, a scattering observable calculation. Methods for obtaining the partial-wave-projected potential v sJ l l (q , q) can be found in e.g. Ref. [26]. Integrated and differential scattering cross sections at a specific scattering energy E can be straightforwardly evaluated given the partial-wave T -matrix using the expressions presented in Appendix A.
Throughout this work we will compare the efficiency and accuracy of the WPCD method, presented in Sec. 3.2, to a set of numerically exact results obtained using the MI method presented below.

Matrix-inversion method for solving the Lippmann-Schwinger equation
For the resolventĝ 0 (E), the kernel in the LS equation (2) has a pole singularity at k = p. This can be handled via a principal-value decomposition ‡, i.e.
such that the remaining integral can be evaluated using e.g. Gauss-Legendre quadrature on some grid of momenta k i = p with corresponding weights w i . Following [17], the complex T -matrix in Eq. (2) can be solved via the inversion of a finite-dimensional matrix equation for the on-shell momentum We can introduce a basis q i ∈ {k 1 , k 2 , . . . , k N Q , p} such that the operators can be written in matrix form with row i and column j corresponding to q i and q j respectively, i.e.
(V sJ l l ) ij ≡ q i |v sJ l l |q j . The on-shell T -matrix element is then given by (T sJ l l ) N Q +1,N Q +1 . Introducing a vector D with elements defined as Eq. (4) can be rewritten as a linear system of equations, where we have introduced the wave matrix F sJ l l , Direct inversion of F sJ l l in Eq. (6) is usually discouraged in scientific computing due to the instability of matrix inversion algorithms [27]. A more advisable practice is to use LU-decomposition. Additionally, it is numerically more stable to first introduce the K-matrix § as the principal value part of the LS equation. Since the potential v is real, the K-matrix is also real. This leads to a set of purely real matrix-equations [17] based on a nonsingular integral.
The on-shell momentum dependence in F sJ l l demands the solution of an entire linear system of equations for every energy of interest. Formally, the T -matrix is defined by the potential operator via [28] T sJ l l (q , q) ≡ q |v sJ l l |ψ + q , where |ψ + q are eigenstates (outbound scattering states) of the full Hamiltonianĥ with momentum q. With this, we can instead express the LS equation using the full resolvent where v sJ l l (q , ψ + k ) = q |v sJ l l |ψ + k and g(k; E) = (E − k 2 /m N + i ) −1 . This is significantly easier to evaluate numerically using quadrature since it only amounts to matrix multiplications. However, the scattering states |ψ + q are not available at the outset. This is where the WPCD method enters to effectively approximate the scattering states using square-integrable eigenstates of the NN Hamiltonian.

Wave-packet continuum discretisation
The WPCD method [18] effectively eliminates the requirement to explicitly solve the LS equation at several scattering energies E using matrix inversion. Instead, one diagonalises the full Hamiltonian in a finite basis, and uses the resulting discrete set of eigenstates to approximate all scattering states of interest. Equipped with these states, this approach enables straightforward evaluation of the full resolvent at any value of the on-shell scattering energy E, which makes the WPCD method intrinsically parallel with respect to obtaining scattering solutions at different energies. This is one of several known bound-state techniques to solve the multi-particle scattering problem [29]. To provide a self-contained presentation, we devote this section to introduce the WPCD method for describing elastic NN scattering, starting with a definition of a finite wavepacket basis.

Scattering observables in a finite basis
Generally we can project some Hamiltonian state |Ψ(E) with positive energy E onto a complete basis (including both bound and free basis states). In this case, the expectation value of an operatorÔ(ĥ) depending purely on the full Hamiltonianĥ can be represented in the following form, where {|ψ b i } n b i=1 are bound states with energies b i and |ψ(E ) are free states with energy E , both of which are eigenstates ofĥ, and we have defined u( b i ) = ψ b i |Ô|ψ b i and u(E ) = ψ(E )|Ô|ψ(E ) . Naturally, we have |Ψ(E) = |ψ(E) (for E > 0) asĥ is the Hamiltonian and Ψ the eigenstate, collapsing the expansion above. However, this identity is not useful in numerical approaches since we do not know ψ(E), leaving us to solve the integral in a finite, approximative basis.
A computational routine for evaluating integrals, such as quadrature, uses a finite mesh of points where the integrand is evaluated. In scattering, this approach requires a finite basis of states | ψ i with corresponding positive energies E i . These states do not span the whole continuous momentum-space and are thus referred to as pseudostates. Below, we will demonstrate how to construct the pseudostates in a wave-packet basis. The pseudostates form a basis for a finite quadrature-prescription to express an operator where we introduce quadrature weights w i such that To determine the weights w i we can use an equivalent quadrature (EQ) technique [30,31,32,33,34] in which it can be shown that the weights do not depend on the state |Ψ . The weights represent a type of transformation coefficient between pseudostates | ψ i and the states |ψ(E ) . An approximate relation based on Eq. (12) can then be introduced, In the WPCD method we discretise the continuum of free states. A wave packet is defined as the energy integral over some energy "bin" of width ∆E with Hamiltonian eigenstates |ψ(E) with positive energy E as the integrand, It follows that lim ∆E→0 |ψ(E, ∆E) = |ψ(E) . We can define an orthogonal wave-packet basis by letting the bin boundaries E and widths ∆E lie on a mesh such that the bins do not overlap: It is straightforward to normalise this basis and show that the wave packets have eigenenergies e i = E i + 1 2 ∆E i . The expectation value of an operatorÔ in some energy bin can now be approximately represented in a wave-packet basis where E ∈ D i . As expected, the quality of this approximation is subject to the bin widths ∆E i . It can be reasoned [35,36,37,38], on behalf of Eqs. (13) and (15), that the EQ weights are approximately given by In short, we have a method for approximating the EQ weights in a wave-packet basis, with which we can express the spectrum of a scattering operator and thereby effectively solve scattering problems. We now proceed to approximately represent the pseudostates as Hamiltonian eigenstates in a wave-packet representation. To this end, we setup a wave-packet equivalent of a partial wave by generalising Eq. (14) and thus define a normalised free wave-packet (FWP) as where f (k) is a weighting function and N i is a normalisation constant. We obtain energy or momentum wave-packets using weighting functions f (k) = 1 or f (k) = k µ , respectively, where µ = m N 2 is the reduced mass. As above, these two types of wave packets have eigenvaluesĥ wherep is the momentum operator. The normalisation of the momentum and energy wave-packets is given by the bin widths, i.e. N i = ∆k i or N i = ∆E i , respectively. Operators represented in a basis of partial-wave plane-wave states are related to the FWP representation via where q ∈ D i , q ∈ D j . In this work we have implemented energy as well as momentum wave-packets, and have found no significant difference or advantage of either choice. The eigenstate wave-packets |z i of the full Hamiltonian with positive energy, or scattering wave packets (SWPs), are given by the eigenvalue This operation is the most time-consuming part in the WPCD method to solve the LS equation. It provides us with a matrix of transformation coefficients C ij ≡ x i |z j such that where D is a diagonal matrix of energies b i and i , and H ij = x i |ĥ|x j . In order to solve the LS equation on the form of Eq. (9), we must define a wave-packet representation of the outbound scattering states. To define SWPs, we must construct the bin boundaries of |z i such that the positive eigenenergies are given by This is no different than the wave-packet eigenvalues in Eqs. (18) and (19). This shift of the FWP energies are indicated in the left panel in Fig. 1. However, it is not possible to construct the bin-boundaries E i exactly since Eq. (22) only provides n equations while there are n + 1 boundaries. Therefore, the following scheme [39] can be used to approximate the bin boundaries of |z i , Here we change bound-state notation from Eq. (11) such that |ψ b k → |z b k . This seems natural since |z b k is expressed in terms of FWPs from the diagonalisation.
Right: Shift and splitting of degenerate FWP energies e i into energies i,1 and i,2 shown by solid and dashed lines, respectively. Note that a solid and a dashed line from two different energies e i and e j will not cross [18], [40].
such that they yield approximate eigenvalues¯ i , In the case of K coupled channels, the FWP energies are degenerate and will be split ¶ in the full Hamiltonian eigendecomposition. A FWP with energy e i = E i + 1 2 ∆E i will give rise to K SWPs |z i,κ with energies i,κ , corresponding to each coupled state κ = 1, 2, . . . , K. The energies are typically ordered such that i,κ < i,κ+1 ∀ i, κ [41]. Furthermore, this prevents mixing between levels with different energies [18], i.e. i,κ < j,κ+1 ∀ i < j. Therefore, we use the boundary construction scheme in Eq. (23) such that |z i,κ have boundaries given by the energies i,κ . Note it is very important to construct the Hamiltonian matrix with degenerate FWP bases representing each coupled state, i.e. |x i,1 = |x i,2 , see the right panel of Fig. 1.
The approach presented so far works fine for short-range NN potentials. Likewise, the eigenspectrum of the long-range Coulomb Hamiltonian can be straightforwardly described using wave packets, but these must then be constructed from Coulomb wave functions instead of free states |q [40] as in Eq. (17). In WPCD, we automatically "smooth" out the typical low-momentum singularities presented by the Coulomb Hamiltonian. This means that the formalism of WPCD works well for both the shortand long-range parts of the interaction, but it is necessary to treat them separately. In ¶ Naturally, K ≤ 2 for NN scattering. this work we have not studied Coulomb wave packets as we only consider neutron-proton scattering.

WPCD-method for solving the Lippmann-Schwinger equation
Following Eq. (20), we can relate elements of the T -matrix in a continuous partial-wave basis and a FWP basis via If we use the full resolventĝ(E), defined in connection with Eq. (9), we can writê such that in a partial-wave-projected wave-packet basis we obtain Note that for a realistic potential we should only have n b = 1 (the deuteron). The full resolvent is given by the full Hamiltonian, of which |z k are eigenstates. In such a basis we can derive a closed form expression forĝ(E), see Appendix B. In the WPCD representation of Eq. (27) all energy-dependence is straightforwardly evaluated via the resolvent. We can therefore find an on-shell T -matrix element via simple summation of the scattering wave-packets. This is an important advantage of using the WPCD method for simulating scattering processes. Note, however, that the resolvent has a logarithmic singularity for E = E k or E = E k+1 . We handle this by averaging with respect to the on-shell energy E, The nuclear potential, represented in a free wave-packet basis as usually vary mildly across a typical momentum-bin D i . It is therefore often sufficient to use a midpoint approximation to evaluate the integral. This offers a significant reduction in computational cost. In a momentum wave-packet basis, the midpoint approximation is simply given by wherek are the bin midpoints.
Here we summarise the necessary steps to implement the WPCD method for NNscattering calculations.
(i) Distribute the bin boundaries for the free wave-packets and choose a weighting function f (k), see Eq. (17). Recommended options are e.g. uniform (equidistant), Gauss-Legendre, or Chebyshev distributions. The results in this work are based on a Chebyshev distribution. The Chebyshev distribution for N WP points, {y j } N WP j=1 , is defined by [18] where t is a "sparseness degree" and α is a scaling parameter. Let y k be either the momentum or energy bin boundaries, and we then set the initial boundary y 0 = 0. For our simulations we have used momentum wave packets (f (k) = 1), t = 2, and α = 100 MeV.
(ii) Diagonalise the Hamiltonian in a free wave-packet basis. This yields a set of energy eigenvalues i and accompanying matrix of eigenvectors C as according to Eq. (21).
(iii) Construct bin-boundaries E i for the pseudostate wave-packets using the set of energy eigenvalues and Eq. (23).
(iv) Express the full resolvent g(E) in the scattering wave-packet basis, see Appendix B. In this work we use the energy-averaged resolvent from Eq. (28) to avoid singularities.
(v) Obtain approximate T -matrix elements in the free wave-packet basis by evaluating the LS-equation where The T -matrix can be transformed to a continuous plane-wave basis using Eq. (20). However, it is important to note that with energy averaging there is ambiguity in which value q , q ∈ D i to choose, as the wave-packet representation gives the same value for all choices. While we discuss this further in Section 5.1, we find it more beneficial to choose q , q such that they match the wave-packet eigenvalues, given in Eqs. (18)- (19).

Numerical complexities of the MI and WPCD methods
Here we present the minimum number of floating-point operations (FLOP) required for computing on-shell q|T sJ l l |q matrix elements at n E different values of the scattering energy. We focus on the MI and WPCD methods, presented in Secs. 2.1 and 3.2, respectively. We also assume that the potential matrix p |V sJ l l |p is pre-computed and available in memory at the outset. Typically, in both methods we use matrices of sizes n × n for n < 100. To avoid confusion, we let N Q denote the number of quadrature points in the case of the MI method, while N WP denotes the number of wave packets in the case of the WPCD method. We retain n to symbolise a basis size in general, regardless of method.

MI complexity
Given a quadrature grid with N Q points, the MI method first requires setting up the wave matrix (7) at each scattering energy. Naively, the complexity of the matrix construction is dominated by the matrix-matrix product of the potential V with the resolvent g 0 . Since the resolvent is diagonal, it is more efficient to multiply each row of V with the diagonal element of g 0 , yielding n + 1 scalar-vector multiplications in a single F -matrix construction with numerical complexity according to where the factor of 2 is due to g 0 being complex.
The on-shell T -matrix element is obtained from the last element of the T -matrix, i.e.
where we have expressed it using matrix inversion. Matrix inversion typically requires O(n 3 ) operations for an n × n matrix. However, solving the LS linear system, is more advisable, and can be done very efficiently in two steps: Firstly, we perform a lower-upper (LU) decomposition, where a square matrix A is expressed as the product of a lower-triangular matrix L with an upper-triangular matrix U , which requires O 2 3 n 3 operations. The decomposition allows for AX = B to be solved for each column of X using O(2n 2 ) operations. For complex matrices these numbers increase by a factor 4 for both routines.
In summary, the MI-method has a total computational complexity of (note the linear scaling with the number of on-shell energies n E ) where O MI (K) shows the cost of using a K-matrix approach instead. The difference is simply replacing the cost of doing complex calculations with real calculations. We see there is roughly a factor 4 speedup in using a K-matrix representation instead of a T -matrix representation.

WPCD complexity
The complexity of the WPCD-method is dominated by three kinds of linear algebra tasks i) the Hamiltonian matrix diagonalisation, ii) two matrix-matrix products, and iii) one matrix addition (see steps 1-5 in Sec. 3.2). We let N WP be the number of wave packets and n E the number of scattering energies as before. By "parallelism" we refer to the fact that WPCD can solve the scattering problem for several energies at once, following a single Hamiltonian diagonalisation.
For this reason we have fully implemented the WPCD method on a GPU utilising the CUDA interface with the cuBLAS [42] and cuSOLVER [43] libraries for linear operations. A more detailed outline of the GPU code is provided in Appendix C. Sequential algorithms are usually easier to handle, and not all hardware allow for optimal parallelisation due to, for example, slow computer memory transfer. Thus, we present complexity models for both the sequential and parallel energy-evaluations of the WPCD method. Note, however, that this is simply a comparison of FLOP models that do not account for memory transfer times or detailed processor architecture. We emphasise that all WPCD results presented in this paper were calculated using our parallel GPU-code, and that the sequential complexity is presented purely to provide further insight.
The MI method can also be implemented efficiently on a GPU. This approach will not exhibit the same inherent parallelism with respect to multiple on-shell calculations. However, there does exist efficient parallel routines for solving linear systems on the form of Eq. (6).

Sequential complexity
The sequential mode for the WPCD method means solving the LS equation for each on-shell energy in sequence rather than simultaneously. This suggests using a CPU, rather than a GPU, since CPUs typically have a faster clock frequency and can thus iterate through the on-shell energies faster than a GPU. Of course, CPUs today are multicore and with several processing units, but note that they might not have a number of cores equal to or greater than n E . The analysis below assumes a scenario where a processing unit is working with a single computational core.
(i) A real-valued NN Hamiltonian is efficiently diagonalised using QR factorisation or a divide-and-conquer algorithm, both of which are usually used together with Householder transformations. The divide-and-conquer algorithm has a complexity of O 8 3 n 3 for getting both eigenvectors and eigenvalues of an n × n matrix. (ii) Next, we evaluate the matrix-matrix product of the potential matrix V and the coefficient matrix C in the rightmost term in Eq. (32). Square matrix-matrix multiplication requires 2n 3 − n 2 FLOP in a straightforward, sequential approach.
(iii) The LS equation for T ii (E ∈ D i ) is evaluated via simple summation and multiplication. The sum, involves two multiplications per term, and the sum will require N WP − 1 additions. There is also the addition of the first term in Eq. (32). These operations must be done for every on-shell energy, resulting in a complexity of 6N WP × n E FLOP. Note that if we use energy averaging we only require a single evaluation for each on-shell wave packet, giving an upper limit n E ≤ N WP .
The WPCD sequential complexity is then given by To summarise, the overall complexity of the MI method scales linearly with the number of scattering energies n E , see Eq. (37). For the WPCD method we get all the on-shell energy scattering-solutions from a single Hamiltonian diagonalisation, giving a very "cheap" scaling with n E .

Parallel complexity
The parallel WPCD approach is based on simultaneous solutions of the LS equation for all on-shell energies. Contrary to the sequential approach, we assume a sufficiently large set of processors (like in a GPU) to handle all on-shell energies at once. This will remove the factor n E in the complexity model (40) such that the cubically-scaling term will clearly become the dominating one. It can be combatted by making use of hardware-specific, massively parallel algorithms for matrix diagonalisation and matrix-matrix multiplications. Here we present the parallel approach to the same steps as above, and in the same ordering. We assume we have p compute processors, or threads, available.
(i) The parallel cyclic-order Jacobi method is a parallel approach to the Jacobi eigenvalue algorithm: a method based on finding a similarity transformation of a matrix to its diagonal form by repeated Jacobi rotations (see e.g. Ref. [44]). This method has held major appeal for parallel computing due to its inherent parallelism compared to other eigenvalue-finding routines. However, parallel optimisation of the basic algorithm depends greatly on how a set of processors is organised with regards to communication and memory. Therefore, there is no general complexity model for the method-any algorithm should be written with a specific hardware in mind. One approach can be found in Ref. [45] with a demonstrated complexity O n 3 p log n for convergence for a symmetric and real n × n matrix. (ii) Parallel matrix-matrix multiplication algorithms is an ongoing field of research (see e.g. Ref [46]). While there exist very efficient methods, such as Cannon's algorithm [47], they depend highly on hardware and matrix characteristics (sparsity, symmetry, etc). The divide-and-conquer approach is a very general and straightforward way to parallelise matrix multiplications. In essence, we divide the matrices into M = n 2 p submatrices of sizes m × m = p. If combined with on-chip shared memory between the processors, this permits all processors to simultaneously work on calculating the product while minimising processor-toprocessor communication time. The method is explained excellently in literature such as [44]. This simple method has a theoretical minimum complexity of O (n × M ) when performed in parallel + .
(iii) There is limited parallel optimisation to be gained in Eq. (39). We can multiply each term of the summation in parallel, and do the whole summation sequentially for the sake of simplicity. The result is a complexity of O(2(N WP + 2)) FLOP. Note, however, that the inherent parallelism of the WPCD method allows us to do the summation simultaneously for all E and thereby effectively removing the factor n E seen so far in the complexity models.
In conclusion, assuming p > N WP and adequately large shared memory (see point 2 above), a somewhat optimal and parallel WPCD complexity model is given by We see from Fig. 2 that the efficiency of the parallel WPCD approach scales very well with the number of available processors. We also see that for a single processor (p = 1) the parallel and sequential approaches are roughly equal (while not taking into account any overhead in memory transfers), while for a realistic value p = 1024 the complexity model for the parallel approach demonstrates a clear advantage. The value p = 1024 corresponds to the current limit on threads with shared memory for typical Nvidia GPUs.

Continuum-discretised neutron-proton scattering computations
In this section we present a detailed analysis of the precision and accuracy of the WPCD method for computing neutron-proton (np) scattering observables and phase shifts. For all calculations, we employed the optimised next-to-next-to-leading-order chiral potential N2LO opt [48]. The primary goal is to analyse the trade-off between minimum computational cost and maximum method accuracy in the WPCD method, and contrast this to the conventional MI method. Note that there is no problem in obtaining highly accurate results from either method. We simply focus on the performance and accuracy + This is not the full picture. Often, a set of processors is divided in groups of processors with limited shared memory that cannot fit all p elements from each matrix in the matrix-matrix product, and one has to define more submatrices than given by M . This introduces yet another complication to the parallel complexity model that has to do with shared memory size, which we will not account for here.  of the WPCD method as we reduce the number of wave-packets such that all objects fit in the fast on-chip shared memory on the GPU. In our analysis we consider the MI method with N Q = 96 Gauss-Legendre points, see Eq. (4), to yield virtually exact results. For brevity, we will refer to such numerically converged calculations as exact.

Computing phase shifts
In Fig. 3 we compare np scattering phase shifts in the 1 S 0 , 3 P 0 , and 3 S 1 partial waves as well as the 1 mixing angle, as obtained from the WPCD method using N WP =32 and N WP =64 momentum wave-packets. The free wave-packet bin boundaries follow a Chebyshev distribution with scaling factor α = 100 MeV and sparseness degree t = 2, see Eq. (31), and the resolvent was energy-averaged according to Eq. (B.7).
For the WPCD-result, one immediately observes a discretised, step-like version of the otherwise smoothly varying exact description of the scattering phase shifts. This is entirely due to the momentum discretisation and finite number of wave packets. Indeed,  due to the energy-averaging we obtain only one T -matrix element per momentum bin. In the limit N WP → ∞, and everything else equal, we will recover the exact result obtained with the MI method. We observe this convergence trend already when going from N WP =32 to N WP =64 wave packets. For instance, the mixing angle moves closer to the exact results when increasing N WP . Overall, the WPCD method is performing as expected, but there are two features we would like to point out: (i) The values of the low-energy 1 S 0 phase shift are overestimated near the peak where the phase shift turns over. This is likely due to the momentum-averaging of operators. The potential matrix in the 1 S 0 channel ( q |V1 S 0 |q ) is shown in Fig. 4 for both a continuous momentum-basis and a N WP = 32 wave-packet basis with bins distributed according to Eq. (31). The potential is constant within each wave-packet bin as expected according to Eq. (29). This makes it challenging to reproduce finer details of the interaction. There is also a discrepancy between the continuous and wave-packet values when the chosen q -momentum is not near any bra-state bin midpoint. This is most distinctive in the green curves at q = 205.622 MeV, which is very close to a bin boundary. We see in Eq. (29) that we average  over momenta within two bins, and when a potential varies strongly within a bin such that the matrix elements at the bin boundaries are quite different from the bin mid-point values, this averaging is too coarse to mimic the potential accurately. This effect becomes less significant with increasing N WP since the grid will grow denser.
(ii) The WPCD-results for 3 S 1 show a distinctive drop at 90 degrees. This trend is consistent for all basis sizes and is simply due to the treatment of the deuteron bound state in the WPCD framework. We extract a phase shift δ via an inverse trigonometric function with values for δ ∈ [−90, 90] degrees. A bound state is characterised by a transition δ(E + ) − δ(E) = 180 degrees at some energy E for an infinitesimal step > 0, as dictated by Levinson's theorem [49]. It is apparent that this transition across 90 degrees is more difficult to reproduce for the WPCD method. The same effect also gives rise to inaccuracies in the 1 mixing angle. However, this is of no concern when computing a scattering observable.

Linearly interpolating phase shifts
Although the resolution of the WPCD method is limited by the energy-averaging across each momentum-bin-i.e., yields a step-like character of the predictions-the mid-pointsq i = 1 2 (q i−1 + q i ) of each bin in Fig. 3 are typically closest to the exact results, as expected from the wave-packet eigenvalues, see Eqs. (18) and (19). We can therefore linearly interpolate the phase shifts δ i ≡ δ(q i ) across several bins, i.e. across scattering energies, via for c.m. momenta q ∈ [q i−1 , q i ] where q i are the FWP bin boundaries. This simple and straightforward approach offers a rather precise prediction for any on-shell scattering energy. Of course, the phase shifts can be linearly interpolated using other points, i.e. we can letq for some n ∈ [0, m], such that n = m 2 is the bin mid-point again. Figure 5 shows the resulting predictions using linear (mid-point) interpolation of the 1 S 0 phases presented in Fig. 3. The bands estimate the effect of varying the interpolation point. The bands span the resulting δ(q) calculated with m = 10 and n = [0.1, 1, 2, . . . 8, 9, 9.9] in Eq. (43). As expected, mid-point interpolation yields results that are very close to the exact calculation. In the following we will therefore only use this interpolation choice.

Computing cross sections
We compute scattering cross sections from scattering amplitudes according to the method outlined in Appendix A. Figure 6 shows the total cross section obtained using the WPCD-method using momentum-space bins with N WP = 32 and N WP = 64. As can be seen in the figure, the linearly-interpolated results reproduce the exact result rather well. On a larger scale it is nearly impossible to tell any difference between results obtained using N WP = 32 bins and N WP = 64 bins. To emphasise the monotonically increasing accuracy of the WPCD method as we increase the number of momentumspace bins, we calculate the absolute value of the difference between the exact and the WPCD prediction for the total cross section for a range of bin resolutions, see Fig. 7. From this it is apparent that the WPCD method converges, although slowly, for a bin partition following a Chebyshev distribution.

WPCD accuracy
There are primarily two approximations that impact the accuracy of the WPCD method: the number of bins N WP and the distribution thereof. Also, the averaging of continuous   Figure 7. Absolute values of the relative difference between exact results and WPCD method for the total cross section, shown as a function of wave-packet basis size N WP and laboratory kinetic energy.
states into wave packets means we use momentum-averaged matrix representations of operators in the LS equation. This averaging should improve with reduced bin widths. We can easily control N WP , and in this section we analyse the performance of WPCD with respect to different choices of this method parameter.
To better quantify the convergence of the WPCD method we study the root-meansquare error (RMSE) with respect to the exact result for the total cross section across a range of scattering energies. We use the standard RMSE measure where σ exact,i denotes the exact results and σ method,i denotes either the WPCD-or MIcalculated total cross sections at some scattering energy E i for i = 1, . . . , n E .
We find that the RMSE for WPCD with scattering energies corresponding to laboratory kinetic energies 40 ≤ T lab ≤ 350 MeV remains fairly constant at ∼ 2.0 mb when using N WP = 16. This is interesting for three reasons: • The WPCD coupled channel Hamiltonian will be of size 2N WP × 2N WP , i.e. we diagonalise 32 × 32 Hamiltonian matrices. Most GPUs today, including the Nvidia Tesla V100 we have used here, have 64 kB shared memory (also called "on-chip" memory) which is significantly faster to access than the GPU's main memory. These matrices fit entirely on the GPU shared memory, allowing for a strong reduction in GPU memory read/write demand while performing the diagonalisation.
• A recent Bayesian analysis of chiral interactions suggests that the χEFT truncation error for scattering cross sections at next-to-next-to-leading-order is at least 2 mb, at 68% degree-of-belief (DoB), and at least 5 mb at 95% DoB [20].
• Regarding experimental uncertainties, the combined statistical and systematic uncertainties in measured np total cross sections are at the 1% level [50,51].
In absolute terms, this amounts to uncertainties of the order 1 mb at laboratory scattering energies 40 MeV. At lower energies the cross section increases of course, leading to slightly larger absolute experimental errors.
In summary, we find that the WPCD method error, when using very few wavepackets, is slightly larger than typical experimental uncertainties but smaller than the estimated model discrepancy up to next-to-next-to-leading-order in χEFT.
One can clearly reduce the WPCD method error by increasing N WP . However, this will also increase the computational cost and we recommend studying the actual wall time cost in relation to, e.g. the RMSE measure defined above. We perform such an analysis in Sec. 6.
Before ending this section on the method accuracy, we would like to emphasise that the maximum total angular momentum J max in the partial-wave expansion of the potential also impacts the accuracy of the description of the scattering amplitude. For the linearly-interpolated WPCD method with N WP = 16, where we observe a method error of ∼ 2 mb in the total cross section at T Lab 40 MeV, and find it unnecessary to go beyond J max = 6.

Method Performance
With an account of the precision and accuracy of the WPCD method we now profile its time performance. Typically, when using the MI method, we calculate every on-shell phase shift of interest by explicitly solving the LS equation, while in the WPCD approach we interpolate in-between bin mid-points, as discussed above. This will, unsurprisingly, induce a substantial time-performance penalty in using the MI method, due to the linear scaling with the number of energy solutions, as shown in Eq. (37). However, we can of course linearly interpolate the phase shifts calculated with the MI method, rather than invoking an explicit calculation at every on-shell energy. Therefore, to facilitate a balanced comparison between the two methods, we also employed linear interpolation to extract phase shifts when using the MI method. In our studies, this has turned out to be a highly efficient way to speed up the calculation with the MI method while maintaining precision and accuracy of the results.
To facilitate a comparison, when using the MI method we solve the LS equation at n E on-shell energies also following a Chebyshev distribution. We then linearly interpolate the phase shifts using these energies to calculate neutron-proton total cross sections in the T lab energy ranges (0, 350] MeV and (40,350] MeV for the calculation of RMSE values. For WPCD we can only vary the number of wave packets N WP while for MI we can vary both N Q and n E , i.e. the number quadrature points and the number of interpolation points or on-shell energies (at bin mid-points), respectively. Figure 8 shows the wall times for solving the LS equation to obtain cross sections, at different levels of method accuracy measured by the RMSE value. In Tab. 1 we show a few interpretations of the figure for a handful of relevant method parameters. As mentioned, the WPCD method is implemented on a GPU while the MI results were obtained using an optimised CPU implementation [7]. The time profiles were obtained using an Nvidia Tesla V100 32 GB SMX2 and an Intel Xeon Gold 6130 for the WPCD GPU and MI CPU results, respectively.
From our analysis we conclude that the WPCD method is faster than the MI method if one can tolerate ∼ 1 − 5 mb method RMSE in the prediction of total scattering cross sections. For such applications it would then be advisable to use N WP 48 bins, on the basis of Fig. 8. It is worth noting that the RMSE is dominated by contributions from scattering cross sections below laboratory kinetic energies T lab ∼ 40 MeV. Indeed, with N WP = 48 we obtain an RMSE value of ∼ 4 mb across an interval T lab ∈ (0, 350] MeV. The RMSE drops to ∼ 0.8 mb when considering only cross sections with T lab > 40 MeV. Irrespective of how the WP bins are distributed, it is therefore necessary to ensure a sufficiently high density of bins below T lab = 40 MeV to accurately reproduce the 1 S 0 phase-shift peak and the 3 S 1 bound state. With the WPCD method we can obtain increasingly faster solutions to the LS equation as we reduce the number of wave-packets.

Summary and outlook
In this study we have compared the WPCD method with the standard MI method, with an emphasis on their respective efficiency, precision, and accuracy when solving NN scattering problems. This study is done with an interest in reducing the computational costs of Bayesian inference analyses of nuclear interaction models.
We find that the WPCD method, with GPU-acceleration, is capable of providing scattering solutions to the NN LS equation much faster than a conventional MI method implemented on a CPU. This is largely due to the WPCD method being capable of delivering scattering amplitudes at several on-shell energies in an inherently parallel fashion, and that linear interpolation across several WP bins offers straightforward predictions for any scattering energy. However, compared to solutions obtained using the standard MI method, the WPCD method is less accurate, in particular in regions where the scattering amplitudes vary strongly with energy. This is also expected due to the momentum-space discretisation and approximation of the scattering states. Nevertheless, in applications where a certain method error can be tolerated-as for example in low-order EFT predictions of NN scattering cross sections-the GPUimplemented WPCD method presents a computational advantage. This finding makes it particularly promising for computational statistics analyses of EFT nuclear interactions utilising Bayesian inference methods.
We also find that the computational gain of using GPU hardware is limited by the amount of shared memory that is available. This constraint basically corresponds to an upper limit on the number of WP bins that can be used for efficient computations. We note, however, that GPU hardware is continuously improved, and that a fourfold increase in the size of the fast shared memory on the GPU would enable a twofold increase of both the WP-basis and method accuracy while incurring a very mild additional computational cost. The GPU global memory is typically not saturated in WPCD calculations of NN scattering calculations. The WPCD method can also be applied to compute three-nucleon scattering amplitudes [37,52]. However, the limited amount of global memory might become a constraining factor for such applications that involve a more complicated Hilbert space basis.
In this study we have focused on the inherent parallelism of the WPCD method and the opportunities that it offers to benefit from the use of GPU hardware. Of course, one could also explore parallelisation of the WPCD method on a CPU and make parallel use of the fast cores available in modern CPUs. The efficiency of the GPU WPCD approach is almost fully determined by the time needed to diagonalize the channel Hamiltonian and could therefore benefit from the faster CPU clock speed. Naturally, one could also consider a GPU implementation of the MI method wherein the LS equation is solved in a parallel fashion for multiple scattering energies simultaneously. Still, this would require multiple matrix inversions (one for each interpolation energy). A simple phase-shift interpolation applied to the MI method facilitates sufficiently accurate results for NN scattering observables, at a significantly lower computational cost compared to explicitly solving the LS equation at every scattering energy of interest.
The helicity-basis projection of the spin-scattering matrix is written as M s ms,m s , where s is the conserved total spin and m s is the corresponding projected spin. These matrices are usually expressed in partial-wave expansions, where the third and fourth rows are Clebsch-Gordan coefficients, J is the total angular momentum, l and l are the orbital angular momenta of the inbound and outbound states respectively, T Z is the azimuthal isospin projection, Y l m (θ, φ) is an azimuthal spherical harmonic, and S 1J l l (p , p) = p , l , s, J|S|p, l, s, J is the usual scattering matrix. The on-shell partial-wave-projected S-matrix is related to the on-shell T (p, p; E)matrix element via S sJ l l (p, p; E) = 1 − 2πiT sJ l l (p, p; E) .
Due to the conservation of probability, the S-matrix is unitary, and can therefore be parametrised by real parameters like phase shifts. In the Stapp convention [53], the coupled-channel S-matrix S 1J l l is then given by where δ ±,J refers to δ l,J for l = J ± 1 and J is the mixing angle. The uncoupled-channel S-matrix S 0J ll is given by (for l = J) The angle θ denotes the scattering angle between the c.m. inbound and outbound relative momenta p and p respectively, while φ is the rotation angle of p around the inbound momentum p, but cylindrical symmetry allows us to set φ = 0. Using Eq. (A.2) we can calculate scattering observables such as the differential cross section, However, this approach does not utilise symmetries to reduce the number of noncontributing terms of M s ms,m s . Instead, the M -matrix can be expressed in terms of non-vanishing spin-momentum products after some consideration of parity conservation, isospin and time-reversal symmetries, and the Pauli principle. We use the Saclay convention [54] for these terms, where a, b, c, d, and e are the Saclay amplitudes, σ i are the Pauli spin matrices acting on nucleon i = 1, 2, and where we define the following unit vectors The Saclay amplitudes are given by the spin-projected matrices as, In this parametrisation, the differential cross-section is given by For the results in this paper we used the following expression for the total cross section: See Ref. [54] for a complete account of scattering observables in the Saclay parametrisation.

Appendix B. The resolvent in a wave-packet basis
The resolventĝ(E) for the full Hamiltonianĥ ≡ĥ 0 +v, can be calculated analytically in a pseudostate wave-packet basis {|z i } n i=1 of the full Hamiltonian. The resolvent is represented in the basis by (see Eq. (17)) where µ = m N 2 . Note that we set the weight function f (k) = k µ and normalisation N i = ∆E i as these are energy wave-packets (from the diagonalisation ofĥ). Using ψ k k , this becomes where we have introduced the Kronecker delta δ ij since ψ If E / ∈ D i , we take the limit → 0 and solve the integral to find where p / ∈ D i , and E i and E i+1 is the lower and upper boundary of D i expressed in energy, respectively. If E ∈ D i , then we have a simple pole at p = k. The pole-integration is done using the infinitesimal complex rotation ±i together with the residue theorem, giving where θ is the Heaviside step-function. The derivation of the resolvent expressed in a momentum wave-packet representation follows a similar procedure. In that case, it is possible to use momentum wave-packets where f (p) = 1 and N i = k i+1 − k i , in which case the derivation above changes a little, see [18].
Energy averaging of the resolvent is done by integrating the resolvent with respect to E, in the bin E ∈ D k , divided by the bin width ∆E k . We introduce the denotation g k ij to reflect this. The derivation is straightforward: where we used N i = ∆E i , and as presented in [18].
The code for GPU-utilisation was written to make use of the CUDA interface, which is developed and maintained by the Nvidia Corporation. CUDA allows for efficient utilisation of a Nvidia GPU using high-level programming languages such as C and C++, Python, or Fortran. For numerically demanding linear algebra operations we made use of the CUDA-libraries cuBLAS [42] and cuSOLVER [43]. These libraries are very similar in use to BLAS and LAPACK-two standard libraries for linear algebra on the CPU. Diagonalising the full Hamiltonian and solving the LS equation are the most numerically demanding parts of the steps presented in Sec. 3.2. These steps are therefore solved on the GPU. The ability to work on several channels simultaneously-this being the advantage of the GPU-is made possible by using CUDA's batched -routines. These are pre-written routines that maximise the efficiency of the GPU to solve several linear algebra problems simultaneously for small matrix-sizes (typically less than 1000 × 1000 in matrix dimension). Such efficient parallelism is generally difficult to achieve "by hand"due to the massive load on GPU-memory read-write accesses.
The importance of efficient memory use is made apparent in Fig. B1 where we show the computation time spent by three different Nvidia GPUs (top panel) and the decomposition of time used for the Nvidia V100 GPU (bottom panel). We see that the majority of the total time is spent on the Hamiltonian diagonalisation. A large fraction of the diagonalisation task is spent on memory accesses as part of the Jacobi method. The three GPUs: V100, T4, and K40, have differences in memory technology [55] which is the main reason for the observed differences in the performance.
We usedcublas<t>gemmStridedBatched to calculate the V C-matrix product in Eq. (32). This calculates a matrix-product on the form C ← αAB + βC, where C is overwritten by the right-hand side. This setup is standard for BLAS gemm-routines. Here, α and β are scalars, while A, B, and C are sets of matrices stored congruently in three arrays, i.e. they are "batched".
To diagonalise the Hamiltonians we used cusolverDn<t>syevjBatched. This routine utilises the parallel cyclic-order Jacobi method, as was briefly discussed in Sec. 4.2, to diagonalise batches of matrices simultaneously.
Lastly, solving Eq. (39) was done using a custom-written function, referred to as a "kernel" in CUDA. The advantage of using the GPU for this task comes with the energy-dependence in the resolvent. We can calculate all on-shell T -matrix elements simultaneously following a single batched Hamiltonian diagonalisation.