nlchains : A fast and accurate time integration of 1-D nonlinear chains on GPUs

We present nlchains, a software for simulating ensembles of one-dimensional Hamiltonian systems with nearest neighbor interactions. The implemented models are the α - β Fermi–Pasta–Ulam–Tsingou model, the discrete nonlinear Klein–Gordon model with equal or site-specific masses, the Toda lattice and the discrete nonlinear Schrödinger equation. The integration algorithm in all cases is a symplectic sixth order integrator, hence very accurate and suited for long time simulations. The implementation is focused on performance, and the software runs on graphical processing unit hardware (CUDA). We show some illustrative simulations, we estimate the runtime performance and the effective scaling of the cumulative error during integration. Finally, we give some basic pointers to extend the software to specific needs. © 2019TheAuthors.PublishedbyElsevierB.V.ThisisanopenaccessarticleundertheCCBYlicense (http://creativecommons.org/licenses/by/4.0/).


Introduction and motivation
Nonlinear chains, that is systems of points with nonlinear nearest-neighbor interactions, have an important role in the un- derstanding of the basic processes of nonlinear physics and statistical mechanics. Despite being in general toy models, or at best crude approximations of real physical systems, they show a rich phenomenology, and even basic questions such as the thermalization dynamics or heat transport law often do not have a definitive answer. Not by chance, one of the first numerical experiment in physics was the Fermi-Pasta-Ulam-Tsingou, that is the renowned FPUT experiment [1,2]. The aim of the experiment was to observe the thermalization of a linear chain of masses and springs, when a small nonlinearity is added. The apparent paradox of the FPUT experiment is that thermalization in the FPUT system cannot be attained in a short time, and so it could not be observed with the computers of the fifties.
Recently, we and collaborators have applied tools of Wave Turbulence [3,4], a statistical description of weakly interacting wave systems, to seek universal traits in the thermalization dynamics of nonlinear chains, [5][6][7][8][9] and also [10,11]. To this end, we needed a software to run large simulation of these systems. These numerical experiment are often time-consuming because thermalization oftentimes requires a long time compared to the wave periods of the corresponding linearized system, and also because large ensembles of realizations of the same chain are needed to extract meaningful statistics. For these reason, the code had to be written with performance but also accuracy in mind, because the Hamiltonian structure of the system needs to be preserved for long times. We could not find a properly published software that could fit our needs. Recently, an implementation of the FPUT system (possibly extendible to other model) has been published [12], however it was not designed with performance as a strong point, but rather it has a pedagogical value, stressing on fast prototyping and code readability. Our contribution here is the software that we used in our research, that is the implementation on graphical processing unit hardware (GPUs) of the time integration for several one-dimensional nonlinear chains. The principal value of this software is the effort that we have spent in tuning its performance.

The implemented algorithms
In Table 1 we list the models that are already implemented in nlchains. The q j and p j variables are the conjugate coordinates and momenta, while ψ j is a complex variable, for j an index that runs from 0 to N − 1, with N the length of the chain. We have included the original α-β Fermi-Pasta-Ulam-Tsingou model (FPUT), the discrete nonlinear Klein-Gordon model with equal masses (DNKG) and disordered, site-specific masses (dDNKG), the Toda lattice and the discrete nonlinear Schrödinger equation (DNLS). The specific version of the Toda lattice potential has been chosen so that it is tangent to the α-FPUT model, that is the potential energy between two adjacent masses Extending the software to other systems with similar potentials is expected to be easy. The parameters α, β, m and m i can be set by the user.

The integration scheme
As mentioned in the introduction, the choice in the algorithm was driven by the need of a good performance but also accuracy in the conservation of the Hamiltonian structure of the nonlinear chains. We chose the 6th order symplectic Yoshida integrator [13]. This algorithm suited our needs for the following reasons. It allows for very long time simulation, as being symplectic it avoid secular growth of conserved quantities such as the Hamiltonian. It is explicit, which makes it direct and easy to implement. We chose the sixth order, as it resulted in an optimal trade-off between computational speed and accuracy in our field of research. Higher order schemes of the same type can be implemented with trivial modifications of the source code (details will be given in a later section).
For reference, we report here the principle of this integration scheme. Since all the systems considered are Hamiltonian, the formal solution to the time evolution of some initial state z = (q 0 , . . . , q N−1 , p 0 , . . . , p N−1 ) (or z = (ψ 0 , . . . , ψ N−1 ) for the DNLS model) is given by the Poisson brackeṫ Now the above equation can be formally solved by introducing the differential operator D H = {·, H(·)}, to obtaiṅ In general it is not possible to give an explicit form of e δtD H , as that coincides with integrating the system. However, since the Poisson brackets are bilinear, it is possible to take advantage of the fact that the Hamiltonians in consideration are the sum of different terms. If one splits the Hamiltonians in two contributions, H = H A + H B , then the differential operator also splits in two corresponding parts, In general, D A and D B do not commute, so it is not possible to write e δtD H = e δtD A e δtD B , however it can be shown that it is possible to generate a scheme of the type where the coefficients c k and d k are real and x is some positive integer, the order of the integration scheme. Given that, when the Note that the coefficient d 7 is 0, hence it is possible to merge the last integration of one step e δtc 7 D A with the first integration of the next step e δtc 0 D A . We expect that if the user needs higher order integrators (e.g. 8th), modifying the software to this end should be quite trivial.
In concluding this section, we remark that for all the models except DNLS, the integration is performed in physical space. For the DNLS model however, because the linear and nonlinear Table 1 The list of implemented models in nlchains. For real models, q j and p j are the conjugate coordinate and momentum at index j ∈ [0, N) with N the length of the chain (q j = p j ). For DNLS, ψ j is a complex variable, and i is the imaginary unit.

Subprogram Equation of motion and Hamiltonian density
sub-Hamiltonians are diagonal in Fourier and physical space respectively, the integration scheme become essentially a refined split-step scheme [15], and the time evolution of the linear part is performed in Fourier space, as suggested (but not implemented) in [14]. This has the side effect that a large number of Fast Fourier Transforms (FFTs) are performed in integrating the DNLS model, that is 14 FFTs per step: the number of non-zero d k coefficients, doubled to account for a direct and inverse FFT to go back and forth from physical space. As we will explain later, this has important consequences in the numerical accuracy and speed of the algorithm.

Other calculated quantities
In order to observe recurrence and equipartition, the software also calculates with a user-settable interval the average energy per eigenstate of the linearized system (that is α = 0 and β = 0 in Table 1). For the DNKG, FPUT and Toda systems, it can be shown that the eigenstates of the linearized systems are the so called normal modes, where theq k andp k are the discrete Fourier transform of the q j and p j , and ω k is the dispersion relation of the system, that is ω k = √ m + 4 sin (π k/N) 2 . For the DNLS model, the normal modes a k are simply the discrete Fourier transformψ k of the physical space variables ψ j , and the dispersion relation is ω k = 4 sin (πk/N) 2 . In all these cases the energy per mode is defined as with ⟨·⟩ being the average over the ensemble. For the dDNKG model, the eigenstates v k and corresponding eigenvalues ω 2 k are calculated numerically from the matrix representation of the system of ordinary differential equations that correspond to the equations of motion of the chain. The energy per mode is then obtained from and explicit projection of the coordinates and momenta vectors q = (q 0 , . . . , q N−1 ) and p = (p 0 , . . . , p N−1 ) over the eigenstates, From the average energy per mode, the software calculates and outputs the associated information entropy, where N is the length of the chain and E = ∑ e k the total linear energy, and the Wave Turbulence entropy Traditionally, Eq. (10) has been used to monitor the route to thermalization of the FPUT problem, but in the framework of Wave Turbulence only Eq. (11) has the properties of an entropy function, that is it can be proven that statistically it is a monotonic decreasing function in time. However, it can be shown [8] that both these two entropies are greater than zero out of thermal equilibrium, and zero at perfect equipartition, and they are essentially equivalent for the purpose of monitoring the route to thermalization.

Software architecture
The software is packaged in a stand-alone Linux executable. Build requirements and compilation steps are documented in the readme (file README.md) that comes with the sources. An important detail of the building process is that all models except DNLS are optimized at compile-time for a specific chain length. A warning is generated if a build of nlchains is launched for a value of the chain length that does not correspond to the optimized value, but the computation is carried out anyway. For more details we refer the reader to Section 5.
The simulation is set up with a number of command line arguments. The invocation in the shell is in the following form: The executable can be run as-is to run on a single GPU on the current host, or through MPI to split the ensemble of realizations on different GPUs. The user should assign to each compute node the same number of MPI processes as the number of GPU attached to the node itself. When running through MPI, the software expects the presence of a shared filesystem, because full-state dumps are written on disk with the MPI shared I/O facilities. The GPU code is written in CUDA, and the host code is written in C++14, hence a NVIDIA GPU card (minimum compute capability 3.0) is required. Please note that the software is implemented in floating point double precision, and so a card of the Tesla line is suggested, because consumer-level cards are largely limited in the double precision performance.

Table 2
List of the output files. The value of prefix is set with the commmand line option -p (see Table 3) For the purpose of performance comparison, and as a fallback in the case that no GPU is available on the host, all the models also have a CPU-only implementation. To launch it, simply append to the model name -cpu, for example DNLS-cpu. The implementation is single-threaded but multiple cores and nodes can be used for parallelization in the same way that the GPU implementation can be launched on multiple GPUs. The implementation also makes use of advanced SIMD instructions present in modern processors (SSE, AVX and AVX-512), and as such it should be regarded competitive in performance. This implementation is used here to compare to the performance of the GPU implementation (see Section 5.2). However, it is not the main message of this paper to present a CPU implementation. The interested reader is referred to the implementation notes attached to this paper as supplementary material for further information.

File formats
The format of the inputs and outputs is raw binary. The list of output files is shown in Table 2. The

Command line options
The first argument to nlchains is one of the subprogram listed in Table 1. If no other argument is printed, the list of arguments applicable to the model is printed, together with a short description. In Table 3 we describe the most important command line options that are common to all models and are necessary to run a simulation. Other common switches are available (such as for terminating when a given entropy threshold is reached), but for brevity refer to the readme available in the source tree. Table 3 Command line options that are common to all models. The model parameters α, β, m and m i (see Table 1) can be set respectively with --alpha, --beta, -m; for the DNKG model -m

Illustrative examples and accuracy
As an example of the use case of this software, we show in Fig. 1 the interesting case of the Toda lattice dynamics versus the α-FPUT dynamics, with the same initial state in terms of linear wave modes and the corresponding entropy curves given by Eq. (11). The initial state of the ensemble (empty circles in the top half of Fig. 1) has been initialized with random values for the energy per linear mode, the same set for all realizations in the ensemble and rescaled to have E = 1, but each realization had a different phase of the normal modes a k . This scheme ensures that all realization have the same initial linear energy. The value of α is 0.5, so that the nonlinear terms of the Hamiltonians are small compared to the linear part, because in this regime it is easier to observe the equipartition of the linear modes. The timestep is set to δt = 0.1 and the total number of steps is 2 * 10 7 . We can see that for the Toda lattice the entropy (dashed line in the bottom half of Fig. 1) quickly settles to a value not far from the initial one, and that clearly signals that the system is not thermalized, while for the α-FPUT system (solid line) the system eventually reaches equipartition. This can be appreciated from direct inspection of the energy per mode at the final state: in the top half of Fig. 1, the filled circles are the energy per mode of the α-FPUT chain at equipartition (minus some statistical fluctuation due to the finite size of the ensemble), while empty squares are the final state for the Toda lattice. For more elaborate examples we refer the interested reader to the papers [7,8].
In Fig. 2 we show the scaling of the accuracy of the simulation as a function of the step size. In order to measure the accuracy, we control the value of a know exact integral of motion, that is the value of the Hamiltonian H(t), or the total energy. The simulation is then run for a fixed total time T = 100000 (arbitrary units), but with a different time step, and hence a different total number of steps. The initial state of each simulation is initialized in a similar way to the data of Fig. 1. The accuracy is then calculated as the average relative deviation from the initial value of the energy, Since we use a sixth order symplectic integrator, it is expected that the scaling of the error is of the type ϵ = O(δt 6 ). We see in Fig. that we get the expected scaling of the error for the models FPUT, DNKG, Toda and dDNKG, up to a saturation around ϵ ∼ 10 −13 . For the model DNLS, we get a saturation much earlier, at around ϵ ∼ 10 −9 . This can be explained by the fact that the DNLS algorithm as we mentioned earlier requires 14 Fast Fourier Transforms (FFTs) for each single time step, hence the data undergoes many more additions and multiplications compared to the other models, and numerical errors due to the finite accuracy of machine numbers accumulates faster. In fact, the behavior of ϵ when δt is small is of the type ϵ = O(δt −1 ), that is it is proportional to the number of FFTs. This is further corroborated by the fact that when the nonlinear parameter is set to zero, that is when the symplectic integration scheme is no longer a source of error as it becomes exact, the scaling of the error is still ϵ = O(δt −1 ) even for large values of δt.
All the data shown in this Section is attached as supplementary material.

Implementation and software extendibility
The software does not have an interface for adding new models, and essentially all the available settings can be accessed through command line options. This is a design choice because performance has been the top priority in developing this code: in the case of GPU programs good performance is in general attained with tight coupling between host and GPU code, and avoiding unnecessarily generalization (such as allowing to specify new models through virtual functions). In lieu of a runtime flexibility, nlchains is designed to be easily modifiable and extendible. In support of the source code comments, a manual with extensive implementation notes, a description of the internal utility interfaces and examples for their usage is provided in the supplementary material of this paper (also as a stand-alone document in the source code tree, supportingmaterial/documentation/implementation-notes.pdf).
We refer the user who needs to adapt nlchains to his or her needs to this manual, to avoid cluttering this paper with implementation details.

Notes on the implementation for the end user
We mention here only some implementation details that are useful even for the user who does not intend to modify the functionality of nlchains.
All models except DNLS have three different implementations of the GPU kernels. One implementation is optimized for a chain length less than 32 (move_chain_in_thread), another one is generic for all chain lengths (move_split), and the last one (move_chain_in_warp) must be tuned at compile time for a specific chain length greater or equal to 32. The instruction on how to select the target chain length optimization are detailed in the readme within the sources. This optimized implementation is much faster compared to the generic one (referred to as the ''split'' kernel for how the q j and p j variables are kept in separate buffers), because the ensemble state is kept in registers for all the duration of the kernel, rather than being read and written in global memory. Since register memory is limited, the maximum target chain length is around 1024, though it is not possible to give a precise upper bound as that depends on the GPU hardware and the compiler version. The user should always compare the runtime of the optimized version and the generic one by using the command line option --split_kernel.
We showed earlier how the integration of the DNLS model essentially turns into a refined split-step scheme with a large number of FFTs involved. We use the library cuFFT [17] for this purpose. In order to save some of the numerous round trips of the ensemble to and from registers and global memory, the linear and nonlinear operators have been implemented as cuFFT callbacks. In general, this leads to a large performance gain, however there might be circumstances where the non-callback version may be faster. The use of callbacks can be suppressed selectively with the command line options --no_linear_callback and --no_nonlinear_callback, and we encourage to benchmark the various combinations with and without callbacks for a defined ensemble size and hardware combination. Table 4 Performance measurements on a single K40m card (clocked at 875MHz and 3004MHz for core and memory respectively) for a dataset with a chain length of 64 and 1024 copies, and a kernel batching size (option -b) of 100000 (1000 for the DNLS model). File dumping has been disabled in these runs.

Chain
Steps/second

Performance measures
Performance of the software depends on a large number of factors. For reference, in Table 4 we show some rough estimates of the number of steps per second for most of the implementations present in the software, for a fixed chain length size of 64 and ensemble size of 1024. These performance figures should roughly scale linearly with the number of GPUs, and linearly with the inverse of the chain length and ensemble sizes. As mentioned earlier, the big performance bottleneck for the FPUT model is the memory access in the case of the ''split'' kernel, hence it is crucial to recompile the software for a desired chain length. Such Table 5 Performance measurements on a Intel Xeon Processor E5-2680, both single threaded and 24 threads, for a dataset with a chain length of 64 and 1024 copies (1008 for the multithreaded test), batching size (option -b) of 10000 (1000 for the DNLS model). File dumping has been disabled in these runs.

Chain
Steps/second (1 thread) Steps/second (24 threads)   DNKG  1229  24510  dDNKG  1280  25623  FPUT  802  17123  Toda  197  4165  DNLS  90  1965 difference is expected also in the models DNKG, dDNKG and Toda. The DNLS model is the slowest of all, due to the large number of FFTs involved. However, it is possible to appreciate the speedup due to the use of cuFFT callbacks. As previously mentioned, we have implemented the integrators as traditional CPU code, in order to compare the advantages of the GPU implementation. The code makes use of recent vectorization (SIMD) capabilities of recent CPUs, in particular it can use the AVX and AVX-512 instruction sets. The CPU implementation has been tested on an Intel Xeon Processor E5-2680 (which has a release date similar to the NVIDIA K40m card used in the GPU benchmarks), and as such it utilizes the AVX instruction set. It can be parallelized in the same way of the GPU implementation (see Section 3). The results are shown in Table 5, both for a single thread run and a 24-threads run (all the cores of the CPU in use), with the same parameters of the data shown in Table 4. We observe the following. The relative order in terms of performance of the models is essentially the same as the GPU implementation. The multithreaded performance is roughly linear in the number of threads, though it shows the typical signs of saturation: a 20-fold increase is observed when running the simulation with 24 threads. Finally, we see that the speedup of the (optimal) GPU implementation over the multithreaded implementation is significant. For the real models, we observer a speedup of 6.7x (DNKG), 5.9x (dDNKG), 4.9x (FPUT) and 9.0x (Toda). For the DNLS model, the speedup is much more modest (1.2x). This is due to the fact that the GPU implementation suffers from the highlatency global memory operations, while the CPU implementation operates on a single chain at once, and hence the CPU cache is utilized. Unfortunately it was not possible for us to test the code on more recent GPU hardware which should provide a much better memory technology.

Impact
The purpose of this software is to provide the scientific community a specialized tool for simulating nonlinear chains (and possibly other simple Hamiltonian, nearest-neighbor systems). While the simulation algorithms is not new, and trivial implementations of the Yoshida sixth order integration algorithms are not particularly difficult, to our knowledge there has been so far no effort to code a simulation with a strong focus on performance, modernity and extendibility of the code, nor software that makes use of heterogeneous hardware architectures such as GPUs.
This software has been essential in our research work [7,8]. Naive implementations of this kind of simulations can have a runtime of weeks. Exploiting the parallelization possibilities of GPU hardware, and a painstaking work of tuning and optimization of the code made the run of a simulation of a typical size a matter of at most hours.

Conclusions
In this paper we presented nlchains, a specialized software for simulating a number of one-dimensional Hamiltonian systems with nearest neighbor interactions on GPU hardware. The software has been coded during the study of the thermalization of these systems, but other applications are possible, as the speed and accuracy are very good. We have described briefly the usage of the software, and mentioned a few implementation details that can guide the interested user in adapting the software to his or her needs. We provided also a few simulation results, most importantly the scaling of the cumulative errors in the simulations, which matches very well the expected scaling O(δt 6 ) with δt being the step size. To our knowledge, this is also the first implementation of the Yoshida 6th order symplectic integrator for the discrete nonlinear Schrödinger equation as described in [14] with the suggested optimization of leveraging the Fast Fourier Transform for the linear operator of the algorithm.

Declaration of competing interest
The authors wish to confirm that there are no known conflicts of interest associated with this publication.