Advancing parabolic operators in thermodynamic MHD models: Explicit super time-stepping versus implicit schemes with Krylov solvers

We explore the performance and advantages/disadvantages of using unconditionally stable explicit super time-stepping (STS) algorithms versus implicit schemes with Krylov solvers for integrating parabolic operators in thermodynamic MHD models of the solar corona. Specifically, we compare the second-order Runge-Kutta Legendre (RKL2) STS method with the implicit backward Euler scheme computed using the preconditioned conjugate gradient (PCG) solver with both a point-Jacobi and a non-overlapping domain decomposition ILU0 preconditioner. The algorithms are used to integrate anisotropic Spitzer thermal conduction and artificial kinematic viscosity at time-steps much larger than classic explicit stability criteria allow. A key component of the comparison is the use of an established MHD model (MAS) to compute a real-world simulation on a large HPC cluster. Special attention is placed on the parallel scaling of the algorithms. It is shown that, for a specific problem and model, the RKL2 method is comparable or surpasses the implicit method with PCG solvers in performance and scaling, but suffers from some accuracy limitations. These limitations, and the applicability of RKL methods are briefly discussed.


Introduction
Complex physical systems often have processes that act on widely different time-scales. For magnetohydrodynamic (MHD) models of the Solar corona, these processes include plasma flow, slow and fast magneto-sonic waves, resistivity, kinematic viscosity, and thermal conduction [1]. When implementing the models with numerical methods, accurately capturing the dynamics of these wide-ranging time-scales can make the simulations computationally intractable due to the fine time-resolutions required. Also, when using classic explicit finite difference algorithms (such as the Euler method), there exists numerical time-step stability restrictions (typically a function of grid cell size) that can limit the time-step further.
In order to make the simulations computationally feasible, they need to be integrated at time-steps which exceed the fastest time-scales. This is possible because often, a system can be integrated without accurately capturing the detailed evolution of the faster time-scales and still obtain useful solutions. However, integrating at large time-steps can violate the explicit numerical stability requirements, and if too large, can cause the model to loose convergence and become very inaccurate. Therefore, accuracy considerations must be taken into account when integrating past the fast time-scales, and numerical methods that can overcome the explicit limits are required.
In our model, we adhere to the flow speed time-scale, but exceed the remaining time-scales through a variety of methodologies (see Sec. 4). The focus of this paper is comparing methods to exceed the explicit numerical time-step restrictions of parabolic operators. To do this, the operators are first isolated from the rest of the terms in their equations of motion through operator-splitting [2] and are written as where F is the parabolic operator acting on the variable u.
There exists a number of methodologies which allow for the efficient integration of Eq. (1) including implicit finite difference methods (e.g. backward-Euler, Crank-Nicholson, etc.), Saul'yev unconditionally-stable explicit schemes [3,4], and super time-stepping schemes (also known as extended-stability Runge-Kutta schemes) [5]. In some cases, smaller time-scale nonparabolic operators yielding the effect of the original operator can be found (e.g. see Ref [6] where a wave formulation of thermal conduction is used). Implicit schemes typically lead to large systems of equations at each time-step that are often solved using Krylov subspace methods (in production run described in Sec. 5. The validation of the two methods is discussed in Sec. 6 including important applicability concerns for RKL methods. The timing results and analysis are described in Sec. 7, and we conclude in Sec. 8.

Implicit Backward Euler Solved with Preconditioned Conjugate Gradient
There exists a wide range of implicit discretization schemes to integrate Eq. (1) and numerous methods to solve the resulting system of equations. For the present context, we limit ourselves to the backward Euler (BE) scheme solved with a preconditioned conjugate gradient (PCG) sparse solver (we denote the full method as BE+PCG). The BE+PCG method is chosen as it is commonly used and is the current method applied in the MAS code. Although the method is well-known, we review it in this section in order to highlight its advantages and disadvantages, especially in relation to parallel performance.

Implicit Backward Euler
One of the easiest-to-implement implicit methods for integrating Eq. (1) is the BE scheme given by where ∆t = t n+1 − t n is the chosen time-step and u n = u(t n , x). If one assumes that F is a linear self-adjoint operator, then Eq. (2) can be written as where M is a symmetric positive-definite matrix. Given the solution at step n (u n ), Eq. (3) leads to a system of equations, where A = 1 − ∆t M, which, when solved yields u n+1 . The BE method is a low-order method (O(∆t) accurate in time), but because it is unconditionally stable, L-stable (i.e. its amplification factor tends to 0 as the wave mode number tends to ∞), and very simple to implement, it remains a useful and wide-spread method.

Preconditioned Conjugate Gradient
The matrix A of Eq. (4) is often large and very sparse. A common method for solving such sparse systems are Krylov subspace iterative solvers (see Ref. [13] for a detailed review), and in the case of symmetric operators, the Conjugate Gradient (CG) method. In order to allow the CG method to converge to the solution efficiently, the system is preconditioned with an approximate inverse of the matrix P −1 ≈ A −1 , leading to the preconditioned-CG method (PCG). The application of the preconditioner (P −1 x) is almost always performed without explicitly formulating the inverse matrix P −1 . The PCG algorithm as used in MAS is shown in Fig. 1 where we have highlighted the expressions that require inter-process communication with a green box to denote local communication/synchronization (stencil interchange) and a red box to denote global communication/synchronization (global summation for the dot products). While the basics of the PCG method are relatively simple to implement, it is difficult to properly choose optimal convergence criteria/values [14] and, even more so, an efficient preconditioner P.

Preconditioners
Check r r for convergence A major difficulty with using the PCG method is selecting and computing an inexpensive yet efficient preconditioner (PC). Preconditioning is a very complex topic with many decades of research and a multitude of methods [15]. The use of any method has many weighing factors such as the amount of computational work, complexity of implementation, ability to parallelize efficiently, applicability to the linear solver and matrix representation, as well as problem-specific considerations that can reduce the usefulness of the PC (and in some cases can even cause the PCs to break down [16]).
Another difficulty is that certain PCs which work very well to improve convergence, can be very difficult to implement efficiently (or at all) on large HPC systems.
For the comparisons in this paper, we use two relatively simple communication-free PCs: 1) A Point-Jacobi/diagonal scaling (PC1), which simply uses the inverse of the diagonal of A. The PC1 is very inexpensive to formulate and apply, but is limited in its effectiveness. 2) A non-overlapping domain decomposition with zero-fill incomplete LU factorization (PC2), which is more expensive to formulate and apply than PC1, but is also much more effective at reducing iterations. One drawback of PC2 is that, due to the non-overlapping implementation (used to avoid adverse scaling performance), it becomes less effective as the number of processors increases. In fact, in the non-practical upper limit of one processor core per grid point, PC2 becomes equivalent to PC1. However, for realistic uses, the degradation in convergence remains small for reasonable grid points per processor ratios (see Sec. 7). Pseudo-code descriptions of the implementation for the initial formulation and subsequent application of both PCs is described in Appendix A.
Many new methods/preconditioners such as Geometric/Algebraic multigrid [17] and interprocessor overlapping PCs [18] exist, which would yield different performance and scaling results than presented here. However, such methods are often substantially more difficult to implement and parallelize (especially into legacy code) and often have trade-offs in effectiveness and parallel scaling that are hard to predict.

Advantages and disadvantages of the BE+PCG method
Some advantages of using the BE+PCG approach are that; (a) it is widely used, robust, and has well-known convergence properties; (b) the BE scheme is L-stable, which efficiently damps out high wave modes (see Sec. 6); (c) it can be used for slightly non-symmetric operators [19]; (d) its basic form is fairly straightforward to implement; (d) there exists many available 'black-box' implementations, and (e) it can be efficient given a good preconditioner and implementation.
However, the BE+PCG approach does suffer from some important disadvantages. One limitation in using the PCG method (or any Krylov solver) is that it requires a linear operator. Often, (as in the case of thermal conduction) the parabolic terms have non-linear coefficients and/or are desired to be flux-limited. In order to do this, using BE, one would have to employ a nonlinear system solver such as the popular Newton-Krylov methods [7] which can be computationally expensive. Efficiently porting the PCG method to accelerator hardware is a formidable challenge and, as mentioned in the previous section, choosing an optimal stopping criteria and efficient preconditoner is often very difficult.
Running PCG solvers on large HPC systems exposes other disadvantages. The global communication/synchronization requirement of the dot-products in the PCG algorithm can adversely effect the scaling of the method, especially in the presence of (even small) load imbalance (see Sec. 7). There are ongoing efforts to mitigate this problem [20], but they have yet to show substantial improvement. This is problematic for domain decomposed problems (such as ours) because the grid resolution in each dimension does not typically divide evenly into the number of processors assigned to that dimension, leading to some sub-domains having more grid points than others. Therefore, as the problem is spread to more and more processors, the load imbalance will increase (see Fig. 10 in Sec. 7).

Explicit Runge-Kutta Legendre Super Time-stepping scheme
A relatively recent class of schemes for integrating parabolic operators explicitly without numerical restriction on the time-step size are 'super time-stepping'/'extended stability' (STS) methods. The main idea behind them is to use a Runge-Kutta multi-stage scheme with stages added for stability rather than accuracy. For a given desired time-step, a number of stages/sub-steps (s) and specialized coefficients for each stage are found such that the overall advance is stable. STS methods are based on either Chebychev polynomials (RKC) [11] or Legendre polynomials (RKL) [8,9], and are either formulated as factored [21] or recursive [5] methods.
The RKC STS method has been used successfully for parabolic operators in MHD codes [22,23,24] and the RKL method is currently being implemented into the Lare3D MHD code [25,26] and has been shown to be accurate and efficient for some coronal loop simulations [27]. Ref. [9] has shown that for nonuniform/nonlinear diffusion coefficients (such as ours), the RKL STS method has more desirable properties than the RKC. We therefore use the RKL method, and for accuracy and robustness, we use the second-order accurate version (RKL2) [9].
The algorithm for the RKL2 scheme is shown in Fig. 2. The pre-computed coefficients for each stage are defined as The number of stages/sub-steps needed for stability (given a desired ∆t) is given by where ∆t Euler is the explicit Euler time-step stability limit (for a simple and fast way to calculate ∆t Euler that avoids too much underestimating, see Appendix B). For better accuracy, the authors of Ref. [9] recommend always ensuring s is odd (although in our experience, when using more sub-steps than Eq. (5), this is unnecessary).  . Estimated speedup of using the RKL2 versus the explicit Euler method assuming one Euler step has the same computational cost as one RKL2 sub-step. The jaggedness of the plot is due to the discrete nature of Eq. (5).
Since ∆t Euler ∝ ∆x 2 , the number of sub-steps follows s ∝ 1/∆x whereas subcycling the Euler method would take ∝ 1/∆x 2 steps. Therefore, the RKL2 scheme requires many less steps than the Euler method and the larger the overall time-step is, the more speedup the RKL2 method exhibits (see Fig. 3).
As in describing the PCG solver, in Fig. (2) we have highlighted the terms in the RKL2 method that require parallel communication. Unlike the PCG solver, the RKL2 scheme only has local communication/synchronization, so is expected to scale much better in the presence of load imbalance and system performance fluctuations (see Sec. 7 where this is indeed the case).

Advantages and disadvantages of RKL2
Some of the main advantages of the RKL2 method (and other STS methods) is that it is very easy to implement (both in a matrix and matrix-free manner), lends itself to easy implementation of directive-based parallelism (such as OpenMP/OpenACC), does not require global synchronizations/communications, and can be used with nonlinear operators/fluxlimiting. Its performance gains can vary, but in cases where the accuracy of the fast-time scale of the operator is not needed, it can have substantial speedup over standard explicit methods, and, as we will show in Sec. 7, can sometimes outperform implicit methods.
One disadvantage of the RKL2 method is that it is very new with very few real-world implementations to test its robustness in a variety of settings. Also, in some cases, the number of sub-steps/stages can become very large, in which case implicit methods may outperform it. One additional aspect of the RKL2 scheme is that, as will be discussed in more detail in Sec. 6, its amplification factor does not generally monotonically decrease for increasing wave mode numbers, and is quite high for the highest modes. This can cause solution artifacts when using only a few total time-steps on solutions containing high wave mode structures, and, when using large time-steps in coupled systems, may fail to damp the high wave modes.

The MAS Thermodynamic MHD Model
The MHD model we use to test the RKL2 and BE+PCG methods is the Magnetohydrodynamic Algorithm outside a Sphere (MAS) code [28]. MAS integrates the time-dependent resistive thermodynamic MHD equations in spherical coordinates and has been used extensively in models of coronal structure [29,30,28,31], coronal dynamics [32,33,34] and coronal mass ejections [35,36]. It is also the primary MHD model in CORHEL (Corona-Heliosphere), a suite of models [37] for describing the solar corona and inner heliosphere that is available for public use at NASA's CCMC 1 . In this section we describe the model equations and numerical methods used in MAS as they are configured for our comparisons (see Sec. 5).

Model equations
In MAS, the one-fluid ideal MHD equations [1] are extended to include additional physics with a wide variety of models options and parameters. For the problem used in this paper, the governing equations take the form: where A is the magnetic vector potential, B = ∇ × A is the magnetic field, J = c 4 π ∇ × B is the current density, ρ is the plasma density, T is the temperature, p = 2 k T ρ/m p is the plasma pressure, v is the plasma velocity,b = |B|/B is the normalized direction of the magnetic field, c is the speed of light in a vacuum, γ = 5/3 is the adiabatic index, m p is the proton mass, k is is a profile that limits the radial extend that collisional thermal conduction is active, f nc (r) = 1−f c (r) is the equivalent profile for collisionless thermal conduction (the last term of Eq. (8) [38]), Q(T ) is a radiative loss function (here we select the piece-wise exponential approximation of Ref. [39]), H is a series of combined empirical heating models that are a function of B [28], v A = |B| 2 /4 π ρ is the Alfvén wave speed, g = −g 0 R 2 r/r 2 is the gravitational force, and the last expression in Eq. (11) is a semi-implicit term described in the next section. Eqs. (9) and (10) are a WKB approximation for Alfven wave pressure advance [29] where + and − are the forward and backward Alfvén wave pressures.
The resistivity term in Eq. (6) is added in order to dissipate structures that cannot be resolved since they are smaller than the cell size. For the current simulations, we use a constant value of η = 4.678×10 −8 s corresponding to a magnetic diffusion coefficient of ∼ 10 12 cm 2 /s. While this is much larger than the typical chromospheric (∼ 10 7 cm 2 /s) and coronal (∼ 10 4 cm 2 /s) values [1], the corresponding magnetic diffusion time scale remains much smaller than the typical Alfvén travel time (yielding Lundquist numbers of ∼ 10 6 ). Therefore, although sub-grid level structure is diffused, the overall solution is not affected substantially. The artificial viscosity term of Eq. (11) is added for a similar reason, in that it serves to dissipate unresolvable stricture. For our problem, we use a kinematic viscosity value of ν = 6.7 × 10 15 cm 2 /s, which, while very large, yields a diffusion time-scale for velocity that is still 500 times smaller than the typical Alfvén travel time.
The function β Tcut (T ) is a cut-off function that sets a minimum temperature that can be used for the thermal conduction coefficient as β Tcut = (T /T cut ) 5/2 for T < T cut and β Tcut = 1 for T ≥ T cut , where T cut = 5 × 10 5 K. Applying this cut-off function serves to broaden the width of the transition region (to make it more numerically resolvable) in such a way that it has a minimal effect on the global coronal solution [40].
In describing the MHD equations above, we have boxed the parabolic terms, where those used to test the two methods are colored blue (the resistivity term uses a very small number of PCG iterations to solve, and is therefore excluded from the method comparisons).

Numerical methods
The MAS code is written in FORTRAN90 and parallelized using MPI. It computes the MHD equations using finite difference on a logically rectangular non-uniform spherical grid. The non-uniformity of the grid allows MAS to efficiently resolve small-scale structures such as the transition region and active regions, while allowing for coarser grid points over the global scale. Each component of the field vectors are staggered which, combined with the use of A as a primitive, ensures that ∇ · B = 0 exactly. Advective terms are differenced using first-order upwinding, while parabolic and gradient terms are centrally differenced. The non-uniformity of the grid is implemented into the discretizations using the standard method 'A' of Ref. [41]. For the anisotropic thermal conduction operator, we use the 'standard' scheme of Ref. [42]. The viscosity operator is written in a self-adjoint form (excluding boundary conditions) that leads to a nearly-symmetric solver matrix.
The code uses an adaptive time-step that conforms to the plasma flow CFL condition. As this is often a much slower time-scale than the magnetosonic waves, a semi-implicit term is added to Eq. (11) that stabilizes the algorithm for time-steps larger than the fast magneto-sonic wave limit. The semi-implicit factor is given as S = (∆t 2k2 ) −1 (C 2 w /(1 − C f ) 2 − 1), wherek 2 is the upper limit of the combined inverse grid spacing, C f is the flow CFL, and C w is the fast magnetosonic wave CFL [43].
The temporal differencing uses a predictor-corrector scheme for the advective and semiimplicit operator, where any reactive terms are included in the corrector step. The parabolic terms are first-order operator split. The WKB approximate Alfvén wave pressure advance of Eqs. (9) and (10) are sub-cycled at their explicit advective CFL condition based on v A .
All parabolic operators and the semi-implicit solves are currently computed using the BE+PCG method with PC1 for the resistivity, and PC2 for the remaining solves. The operator matrices are stored in a DIA sparse storage format [44] for internal grid points, while the boundary conditions are applied matrix-free. The PC2 preconditioner LU matrix is stored in the CSR sparse storage format optimized for memory access described in Ref. [45]. In our implementation of the STS method for this comparison, we use the same matrix format used for the solver operators.

Real-world Test Case
To provide a useful comparison of the methods, we use a production run of the MAS code. The procedure for producing an MHD model of the solar corona has been described previously [30,28]. In brief, a full-sun map of the radial magnetic field derived from observations is used to specify the boundary conditions at the solar surface. A potential solution is computed Table 1. Grid information for the coronal relaxation run.
using this boundary condition to supply the initial magnetic field condition (see Fig. 4). A spherically symmetric solar wind solution (see Fig. 5) is used as the initial condition for the other variables (ρ, T , v). This initially non-equilibrium state is advanced in time until a steadystate is approached. The resulting solutions can be compared against a range of solar and heliospheric observations [37], including EUV and X-ray emission [28], and white-light eclipse images [46]. For the case shown here, the surface radial magnetic field is taken from a flux evolution model (ADAPT) that assimilates observational Earth-based magnetograms to create a full-sun magnetogram at a one particular time [47,48]. Sometimes active regions (AR) can emerge on the far-side of the Sun that are not assimilated into the ADAPT map until they rotate into Earth's view. In the case being simulated here, such an AR was spotted by the STEREO spacecraft in EUV emission images. In order to better approximate the synchronic state of the surface field, the far-side AR (as observed later in time) is added into the ADAPT map. The map is then interpolated onto the computational grid and smoothed. In Fig. 4, we show the final form of the ADAPT map used, and the initial condition of B derived from the resulting potential field solution. The full simulation integrates the initial conditions for 48 simulation-hours. For validation runs (Sec. 6), we integrate from the initial conditions to t = 8 simulation-hours as this contains the most dynamical/difficult portion of the total run. For the timing runs, we start with the solution at t = 8 simulation-hours (computed using MAS's default algorithms) and then integrate using the two methods for another t = 6 simulation-minutes.
The domain is global and covers a radial range from 1 R to 30 R . It is discretized into a logically-rectangular nonuniform spherical grid (r, θ, φ). The grid (summarized in Table 1) is nearly uniform in φ, coarsens slightly towards the pole in θ, and is highly non-uniform in r (in order to better resolve the transition from the upper chromosphere to the corona). The time-step changes over the runs to conform to the flow CFL condition and during the 8-simulation-hour  Portions within a r-φ cut at θ = π are shown for ρ, v r , and T .
validation simulation, ranged from ∼ 2.5 seconds to ∼ 3.4 seconds with an average step-size of ∼ 2.7 seconds. In Fig. 5, we show some plasma values within an r-φ slice at θ = π/2 for the initial condition and the final relaxed corona at t = 48 simulation-hours.

HPC environment
To test the parallel scaling and performance of the methods, we run the simulations on stateof-the-art HPC systems. In order not to limit results to one system, we use both the Comet machine at the San Diego Supercomputer Center (SDSC) and the Stampede machine at the Texas Advanced Computing Center (TACC). The hardware and software configuration used for each system are shown in Table 2. We note that although the systems are similar in design, Comet has newer hardware (Haswell AVX2-vectorization processors, more and faster RAM, etc) but on a standard allocation, Stampede allows runs to be performed with more than twice as many processors as Comet.

Validation
Since our focus is on comparing the two methods (BE+PCG and RKL2) to each other, we do not analyze the solution accuracy of the original BE+PCG method. As mentioned in Sec. 4.1, the MAS code has been used extensively and its accuracy validated [49]. Instead, we focus on comparing the two methods to each other by using a long simulation (t = 8 simulation-hours) and qualitatively comparing the results. In Fig. 6, we show r-θ cuts at φ ≈ 137 • of various field quantities for each run.
We see that the two methods produce very similar global results even with complicated structures present. However, as much of the grid is focused near the transition region at roughly r = 1.02 R , it is difficult to compare the solution in those regions when viewing the full domain. On closer inspection near the solar surface (Fig. 7), we once again see general agreement between   the two methods, but also see an alarming difference in some localized regions where the RKL2 method exhibits high wave-mode oscillations in some fields. A possible explanation of these oscillations is that they are exhibited due to the RKL2's inability to adequately damp high wave modes within one large time-step (as shown below). Due to the extremely difficult conditions in transition-region physics, the standard algorithms in MAS sometimes exhibit oscillatory behavior if the spatial structures are not fully resolved. In order to combat this effect, the artificial viscosity term is added to damp out such oscillations (more than the numerical viscosity due to the algorithm is capable), and smooth out the solution. The true viscosity of the solar corona is so small, this term would not have been included otherwise. In contrast to this, the thermal conduction term is an essential part of the physics of the model. original BE+PCG for viscosity), the solution becomes visually identical to the BE+PCG case as shown in Fig. 7.
To gain some insight into why the RKL2 method does not damp out the high wave modes, we first note that even though the amplification factor of a convergent scheme is less than 1 (i.e. it is stable and diffusive), it is possible that it is too large to properly damp out a growing wave mode. In Fig. 8, we show the amplification factors for the BE and RKL2 methods for the one-dimensional case of Eq. (1) with a constant diffusion coefficient discretized on a uniform grid with spacing ∆x. The factors are derived from the analytic eigenvalues of the system and are shown for integrating at a time-step 1/5, 5, 50, and 500 times the explicit Euler limit ∆t Euler (in our simulations, viscosity is being integrated at over 700× the limit in some portions of the grid, and thermal conduction over 10000×). The amplification factor for the exact solution of the semi-discrete system and for the explicit Euler method (in the case where it is stable) are shown for comparison.  Figure 8. Amplification factors for the BE and RKL2 methods for various values of ∆t/∆t Euler . The exact discrete solution is shown for comparison, and the explicit Euler method is shown for the case where it is stable.
We first notice that for small wave modes, the RKL2 is much more accurate than BE, which is not surprising as it is a second-order accurate method. We also see that when exceeding the Euler limit by a large factor, the amplification factors for both BE and RKL2 are much larger than the true solution, and the RKL2's amplification factor is much larger than the BE's. Also, the RKL2's amplification factor does not tend towards 0 as k∆x → π implying that the RKL2 method is generally not L-stable, although it appears to be strongly A-stable [50] and is positive preserving. The large amplification factor of the RKL2 for high modes will not damp high-frequency solutions efficiently within large time-step because modes which should be virtually eliminated according to Eq. (1) will only be damped by a factor of ∼ 0.5. However, with successive application of the method, the wave modes will eventually damp. Therefore, if Eq. (1) were integrated by itself, the RKL2 would be convergent (given enough steps and total simulation time), but when coupled to a larger multi-physics model, the modes may not be damped. We note that BE is not exempt from this issue as its amplification factor is many times larger than the true solution, but since its amplification factor tends towards 0 for high modes, and is generally small, it is expected to be more robust than RKL2.
To support this conclusion, we have tested sub-cycling the RKL2 method within each large time-step, and find that it yields a solution visually identical to the BE method given enough sub-cycles (∼ 10 in our case). We also found that sub-cycling explicit Euler steps for ∼ 1/4 of the time-step and completing the remainder of the step with RKL2 also yielded correct results. This is understandable as the explicit Euler method damps the high modes very efficiently such that running it for ∼ 1/4 of a time-step damps high modes more than even a full step with BE. These considerations may lead to new scheme strategies that combine methods in order to get both an efficient and overall correct solution.
This analysis indicates that the RKL2 (and other STS methods) may be inadequate for use in multi-physics codes when the parabolic operator is being used either for smoothing oscillations and/or shocks (like artificial viscosity) and perhaps even if they are being used in a balanced equation where the diffusion must counter high-mode growth due to reaction or other driving terms.
The amplification factors shown in Fig. 8 for RKL2 are qualitatively different than those shown in Ref. [8] and Ref. [9]. This is because the factors shown there are for the RKL methods where the number of sub-steps s is chosen first, and the time-step factor is then set exactly from the equation for s (such as Eq. (5)). In such an exact case, the RKL2's amplification factor does tend towards zero for very high modes, but still remains near ∼ 1/2 for a large portion of the frequency spectra. In a real-world case where the time-step is set by other factors, and then the minimum stable s is computed (the suggested procedure given in Ref. [9]), the amplification factors become like those shown in Fig. 8. Even one extra s step over the exact amount needed causes the qualitative change. Interestingly, using any extra number of sub-steps for the RKL1 scheme eliminates the problem shown in Ref. [9] of it approaching magnitude 1 at k∆x = π, possibly making that scheme more useful than otherwise assumed (although it is not positive preserving).
For uniform grid methods, if one could compute ∆t Euler exactly, one could modify the timestep of the system to exactly follow Eq. (5) for the nearest whole value of s and potentially mitigate this problem. However, if computing ∆t Euler exactly is not efficiently possible (which is often the case) and/or one is using non-uniform grids, the number of sub-steps s will always be more than exact, at least somewhere on the grid.
Even though our solutions exhibit the oscillations shown in Fig. 7, since they are limited to a few localized regions and do not seem to effect the rest of the global solution (as shown in Fig. 6), we still report the timing results for the runs as a comparison of the scaling of the methods. Since our main concern in the timings is parallel scaling, we note that any modifications along the lines described above to help the RKL2 method damp the high modes (e.g. sub-cycling Euler steps) may increase the total compute time, but should not effect the scaling, as those steps have the same communication profile as the RKL2 method itself. Also, since using the RKL2 for thermal conduction alone produces a consistent solution, the timing results for thermal conduction have increased relevance.

Timing Results
To time the simulations, we use calls to the MPI library's timing API routines. It is well known that massive parallel codes can have a very wide range of run times when using many processors due to fluctuations of system load and communication occupancy. In order to extract the scaling out of the methods in as an ideal situation as possible, we use multiple runs (between 4 and 6) of each processor count and take the best result. In addition to recording the total time for each portion of the run tested (thermal conduction and viscosity), we also record the 'non-compute' time for each portion which we define as the time the algorithms are involved in inter-processor communication (including buffer loading and unloading, data transfer, and synchronization/waiting).
Since Comet and Stampede each have different number of processor cores per node, the domain decomposition is unique for runs on each system, along with the number of total processors for each run. On Comet, the runs are performed using 1 node up to the maximum of 72 nodes, while on Stampede, due to the smaller amount of system memory per node, the simulations could not be run on fewer than 4 nodes. The domain is split up along processors in each direction yielding 3D block sub-domains. Since the grid size in each dimension will typically not divide evenly into the processor count along that dimension, some cores will have one extra grid point along some dimensions than others. In the worst case, cores may have one extra grid point along all three dimensions. For a fixed problem size, this causes an intrinsic load imbalance that increases as the number of total processors increases. In Fig. 9 we show the processor topologies used on each system and the maximum grid-based load imbalance. An additional possible source of load imbalance comes from cores containing sub-domains along  physical boundaries, which have more or less work to perform based on the boundary conditions. For example, in the case of the viscosity solver, the cores containing sub-domains along the polar boundaries require calculating a polar average involving communication/synchronization with all other cores conditioning the polar boundaries.
The timing results for thermal conduction and viscosity are shown in Fig. 11. Since the BE+PCG simulations with PC1 were found to be much slower than either BE+PCG PC2 or RKL2 during the Comet runs, they were not repeated on Stampede.
We first observe that all methods scale well up to 512 processors. Higher then that, we see that the BE+PCG method starts to loose scaling, and in some cases, the run time starts to increase with larger number of cores. The BE+PCG PC1 seems to scale a bit better than PC2, but is overall a much slower method. Therefore, if the RKL2 method is within the desired accuracy, in this case it is shown to perform far better than BE+PCG PC1 (which is still a commonly used method). The RKL2 is shown to scale better than the BE+PCG PC2 method in all cases. For thermal conduction, the run time for RKL2 is very similar to the BE+PCG PC2 up to the point that the scaling starts to diminish. This is so even though it is using many more sub-steps than the number of iterations of BE+PCG PC2 (see Fig. 10). This shows that one sub-step of the RKL2 method is much faster than one iteration of the PCG PC2.
Using the maximum number of processors on a standard allocation, the thermal conduction using RKL2 exhibited a speedup of over 2.5X for Comet and 13.7X for Stampede compared to the BE+PCG PC2 method. For viscosity, the RKL2 is at least twice as fast as BE+PCG PC2 for all processor counts, and due to its better scaling, reaches a maximum speed increase over BE+PCG PC2 of ≈ 3X on Comet and over 6.5X on Stampede.
Another interesting observation is that it appears that the thermal conduction operator does not scale as well as the viscosity operator when using BE+PCG. One possible reason for this is that the number of iterations required for the PCG solver increases as the processor count increases (due to the degradation of the PC2 -see Sec. 2.3) much more so for thermal conduction than it does for viscosity.  It is informative to explore the communication (or 'non-compute') time of the runs, which we display (per iteration/sub-step) in Fig. 12. We have separately recorded the non-compute time for both local (peer-topeer) and global communications. We see that the peer-to-peer non-compute time for the RKL2 method is somewhat higher than for the PCG methods (which is unexpected as they both use the same matrix-multiply routines). As the processor count increases, this non-compute time decreases (due to the reduction in the number of transferred data points) as expected. However, for the PCG methods, the global non-compute time (which is always larger than the point-to-point time) stays relatively constant for a while, but then starts to increase as the processor count gets high. For Stampede, this effect is the greatest. Since at the time these runs were performed, Stampede was much more oversubscribed then comet, the network congestion could be a cause of this drastic increase. In either case, we see that the RKL2's lack of global communications is key to its improved scaling. Through testing using MPI barriers, we have found that it is not the AllReduce MPI routine itself in the dot products that is the main cause of the PCG's lack of scaling, but rather it is the routine's global synchronization acting as a barrier, causing all imbalances encountered before the routine to 'pile up'. As the amount of computational work per processor decreases, the effect of these imbalances increases.
Finally, although our focus is on the timing of the thermal conduction and viscosity operators, in Fig. 13 we show the total wall time of the MAS code for the Stampede runs of Fig. 11 and what percentage of the total run time each part of the MAS code contributed to (based on the validation runs of Sec. 6). It is interesting to note that although the parabolic operators used to test the methods only accounted for ∼ 30% of the code's run time when using BE-PCG PC2, the effect of using RKL2 seems to improve scaling to the extent that the overall code exhibits a speedup of over 3X for 4096 cores on Stampede. Why this is the case is still unclear, but may have to do with easing the load on the infiniband buffers, and/or allowing the load imbalance to equalize/catch-up between operators.

Conclusions
In this paper we have compared a representative algorithm from two classes of methods for integrating parabolic operators past the standard explicit time-step limits in the context of thermodynamic MHD simulations of the solar corona. The first method is an implicit scheme solved with a Krylov solver: the backward-Euler scheme with the preconditioned conjugate gradient (BE+PCG) with either a diagonal (PC1) or incomplete LU (PC2) preconditioner. The other is a super time-stepping method: the second-order Runge-Kutta Legendre (RKL2) scheme. The methods were described in detail and tested on a production-level coronal relaxation simulation of the described MAS code. Special attention was given to the performance and scaling of the methods.
We found for our specific case, that the RKL2 scales much better for high core counts than the BE+PCG methods, and matches or exceeds the performance for low numbers of cores. For integrating viscosity, the RKL2 method performed over twice as fast as the BE+PCG. In other tests (not reported here), the speedup of the RKL2 method was even larger (over 3X). The global synchronization of the dot products in the BE+PCG method were found to be the most likely cause of the poor scaling at high core counts, since any load imbalance would be bottle-necked there.
Although the RKL2 method has superior performance and scaling, we found that, when used for viscosity, it exhibited slowly-growing grid-level oscillations in a few localized portions of the domain. An analysis of the methods showed that these oscillations (generated by other parts of the model due to the presence of unresolved structures) are not adequately damped by the RKL2 method due to its high amplification factor for large wave modes. Some methods of avoiding these errors were discussed, but each would slow down the algorithm. Using RKL2 for only the thermal conduction operator yielded correct results, implying that its use may be limited to specific use-cases. Our analysis also showed that the BE+PCG method, while not immune to the same problem, is much more robust due to it being L-stable.
Future work will include searching for ways to modify the RKL methods to alleviate the damping issues discussed above. As for the Krylov solvers, investigation into new non-blocking PCG methods has the potential to improve their scaling. Another interesting direction would be to formulate hybrid schemes, which combine the advantages of the multiple methodologies. Research into the application of STS-type methods to wave equations is of particular interest, as it could yield a scalable alternative to semi-implicit operators.