Fully Implicit Time Stepping Can Be Eﬃcient on Parallel Computers

c


Introduction
Benchmarks in parallel computing are often micro-benchmarks or computationally expensive components of full applications, such as solvers for linear systems. It is often the case that the choice of numerical method can be as important as the optimization of the component subroutines. To be able to compare different choices of numerical methods, full application benchmarks are also very useful. The International Workshops on High Order Computational Fluid Dynamics methods [7,12] are a conference series with the aim of comparing the speed and accuracy of computational fluid dynamics software for solving particular well defined problems. One of the considered cases is the Taylor-Green vortex at a Reynolds number of 1600. This is a case where there is a vortex instability which is difficult to compute correctly using low resolution or a poor numerical method [24]. While there are a wide range of spatial discretization methods suitable for solving the Navier-Stokes equations, Fourier spectral methods are a class of methods where the choice of time stepping procedure on the effectiveness of solving a partial differential equation can be easily investigated [20]. By comparing two different time stepping regimes, one can also determine whether discretization methods which have conserved discrete analogues of the continuum conserved quantities are well suited for approximating turbulent flows, and will be useful in engineering applications where time and cost to solution are important.
Following this introduction, Section 1 introduces the incompressible Navier-Stokes equations; the numerical algorithms which use the Fast Fourier transform as a component are described in the Section 2. A description of the massively parallel computational platform used is in Section 3. The results of the computational experiments are described in Section 4. This is then followed by the conclusion.

Incompressible Navier-Stokes Equations
The incompressible Navier-Stokes equations written in dimensionless form are: where u ∈ R 3 is the velocity vector and p ∈ R is the pressure. Here Re denotes the Reynolds number that characterizes the flow and it is defined as

The Numerical Method
A Fourier spectral method is used to perform a direct numerical simulation of the Taylor-Green vortex for the incompressible Navier-Stokes equations. For the Taylor-Green vortex flow, given the periodic square box [−mπ, mπ] 3 (which defines the computational domain) and the initial velocity field, the representative length scale and velocity module are set equal to the multiple m of the "standard" box width 2π (i.e., L ref = m) and the maximum velocity component of the initial flow field.
The parallel codes are available at [9,10]. They use MPI and are similar to the example programs available at [4]. The library 2DECOMP&FFT is used for the domain decomposition and the parallel fast Fourier transforms [21].
In both implicit midpoint rule (IMR) and Carpenter-Kennedy [5] (CK) time stepping schemes, the time advancement is done in Fourier space and the nonlinear terms are calculated by transforming to real space, multiplying and then transforming back to Fourier space. Following [6], we do not de-alias our schemes because the simulations are fully resolved. Visualization is done using Octave [15], Matlab, Python (Matplotlib [19]), Paraview [2] and VisIt [8].

Implicit Midpoint Rule
The implicit midpoint rule is where j is the iterate and n is the timestep. Note that the implementation of the IMR in this study uses fixed point iteration with the stopping critera that the change between successive iterations is less than 10 −10 in the sum of the l ∞ norms of the components of the individual flow field components. The method requires three levels of storage for u n+1,j+1 , u n+1,j and u n , and uses 15 Fast Fourier Transforms per iteration. The IMR method also has discrete analogues for energy and enstrophy dissipation: Fully Implicit Time Stepping Can Be Ecient on Parallel Computers

Carpenter-Kennedy Method
The CK method consists of splitting the equation into linear and nonlinear parts, and then solving the linear part implicitly, and the nonlinear part explicitly, using the steps u ← v 8: end for 9: u n+1 = u 10: return u n+1

Computational Platform
Numerical experiments have been performed on a variety of computational platforms, which are listed in the acknowledgements. The results reported here were obtained on Hazelhen [18] a Cray XC 40 supercomputer with dual Intel R Xeon R CPU E5-2680 v3 (30M Cache, 2.50 GHz) processors per node. This supercomputer uses an Aries interconnect [17]. As this computer has a network where congestion effects can change runtime, reported results were run on 768 cores for which repeated short runs showed that the typical standard deviation in the runtime was 20% [3], see also Fig. 10. Access to a computer system with a network that isolates communication for a particular job to a particular subnetwork (such as found on supercomputers with torus networks) was not available for this study.
The source codes [9,10] were compiled with GCC GNU Fortran compilers using the Cray provided wrappers (ftn) and optmization flag -O3. 2DECOMP&FFT [21] was used as the parallel Fourier transform and domain decomposition library, with FFTW 3.3.4.7 as the one dimensional Fast Fourier transform library.

Verification
To ensure verifiability of the computed results, the benchmark case requires production of data for the kinetic energy (Fig. 1), kinetic energy dissipation rate (Fig. 2) and enstrophy (Fig. 3) evolution for the Taylor-Green Vortex between times of 0 and 10. Also required is a plot of the midplane vorticity (Fig. 4)

Comparison between the Implicit Midpoint Rule and Carpenter-Kennedy Methods
The IMR requires 15 fast Fourier transforms per iteration, while the CK method requires 70 iterations per timestep. Thus for a single timestep, if 4 or fewer iterations are needed, the IMR method will require less time to solution than the CK method. The IMR method is a second order method, while the CK method is a third order method, though for high Reynolds number flows, behaves like a fourth order method. Nevertheless, the IMR method should be good at capturing statistical behavior of turbulent flows. Figure 5 shows the number of fixed point iterations required for the IMR during the computation, and Tab. 1 shows the average number of fixed point iterations, as well as the total core hours required for a computation. The number of iterations is smaller when the timestep is small and increases as the enstrophy (Fig. 3) and kinetic energy dissipation rates (Fig. 2) in the flow increase. Figure 6 and 9 show that the IMR method gives levels of accuracy of 10 −3 for enstrophy and 10 −6 for the kinetic energy dissipation rate, and the CK method gives levels of accuracy of 10 −6 for enstrophy and 10 −9 for the kinetic energy dissipation rate.    -IMR 16000  CK 4000 -IMR 16000  CK 8000 -IMR 16000  CK 16000 -IMR 16000  IMR 2000 -IMR 16000  IMR 4000 -IMR 16000 IMR 8000 -IMR 16000 Finally, Fig. 10 shows that for the first 20 timesteps of a 2000 timestep run, the IMR and CK methods have the same runtime.

Conclusions
In the initial phase of the Taylor-Green vortex flow, the second order implicit midpoint rule is efficient for moderate accuracy simulations because it requires only a few fixed point iterations to converge. Despite the fact that the implicit midpoint method preserves the energy dissipation structure of the equations, the current results show that for the time scale and Reynolds numbers considered here, the higher order Carpenter-Kennedy method is more accurate at capturing the global kinetic energy and enstrophy evolutions. Structure preserving schemes, like the implicit midpoint rule are often used in computer graphics simulations [16]. These can be coupled with spatial discretization methods that have lower communication requirements than the fast Fourier  -IMR 16000  CK 4000 -IMR 16000  CK 8000 -IMR 16000  CK 16000 -IMR 16000  IMR 2000 -IMR 16000  IMR 4000 -IMR 16000 IMR 8000 -IMR 16000 Relying on classical schemes may be inappropriate, especially for computationally costly simulations where statistical reproducibility but not point wise accuracy is required. Thus in addition to semi-implicit schemes such as the Carpenter-Kennedy method, fully implicit schemes should also be considered as they may be computationally efficient [23]. This is because for time evolutionary schemes (as opposed to stationary problems as considered in other studies [1,13,22]), a good initial iterate obtained from the numerical approximation at the previous time step can make the number of iterations required for convergence of the iterative scheme small. In the atmospheric simulations [25] it has also been observed that implicit schemes may give a faster time to solution than explicit schemes, despite having lower scalability and a lower floating point efficiency.
For many computer scientists, algorithm/subroutine optimization is a common task, but full application optimization typically also requires domain specific knowledge. At present, it is challenging to find studies that combine both domain specific and high performance computing knowledge. Domain specific community benchmarking efforts, such as the International Workshop on High Order Computational Fluid Dynamics serve as a very useful complement to traditional high performance computing benchmarks such as Linpack [14]. Such efforts will become much more relevant in the future as most scientific computing will necessarily be parallel. midpoint rule and Charles Doering, José Gracia, Hans Johnston, Ning Li, Peter Van Keken, Divakar Viswanath and Jared Whitehead for helpful discussion. BKM was partially supported by HPC Europa 3 (INFRAIA-2016-1-730897). The computational resources used to build and test the programs were: • Flux operated at the University of Michigan.
• Keeneland initial delivery system operated by Georgia Tech and the National Institute for Computational Sciences. • Noor operated by KAUST information technology services.
• Shaheen and Shaheen II operated by the KAUST Supercomputing Laboratory.
• Hazelhen and Kabuki at HLRS. Flow field visualization was aided by an extended collaborative support allocation from the extreme digital science and discovery environment (XSEDE) which is supported by National Science Foundation grant number OCI-1053575. We thank Mark Vanmoer for help with this.
This paper is distributed under the terms of the Creative Commons Attribution-Non Commercial 3.0 License which permits non-commercial use, reproduction and distribution of the work without further permission provided the original work is properly cited.