Large-eddy simulations with ClimateMachine v0.2.0: a new open-source code for atmospheric simulations on GPUs and CPUs

. We introduce ClimateMachine, a new open-source atmosphere modeling framework using the Julia language, designed to be scalable on central processing units (CPUs) and graphics processing units (GPUs). ClimateMachine uses a common framework both for coarser-resolution global simulations and for high-resolution, limited-area large-eddy simulations (LES). Here, we demonstrate the LES conﬁguration of the atmosphere model in canonical benchmark cases and atmospheric ﬂows, using a total energy-conserving nodal discontinuous-Galerkin (DG) discretization of the governing equations. Resolu-5 tion dependence, conservation characteristics, and scaling metrics are examined in comparison with existing LES codes. They demonstrate the utility of ClimateMachine as a modelling tool for limited-area LES ﬂow conﬁgurations


Introduction
Hybrid computer architectures and the need to exploit the power of graphics processing units (GPUs) are increasingly driving developments in atmosphere and climate modeling (e.g., Schalkwijk et al., 2012;Palmer, 2014;Schalkwijk et al., 2015;Marras et al., 2015;Abdi et al., 2017b, a;Fuhrer et al., 2018;Schär et al., 2020).The sheer computing power available on modern hardware architectures presents opportunities to accelerate atmosphere and climate modeling.However, exploiting this computing power requires re-coding atmosphere and climate models to an extent not seen in decades, and portable performance and scaling across different platforms remain difficult to achieve (Fuhrer et al., 2014;Balaji, 2021).
In this paper, we introduce ClimateMachine, a new open-source atmosphere model written in the Julia programming language (Bezanson et al., 2017) to provide a computational framework that is portable across CPU and GPU architectures.
Additionally, the model is designed to be usable across a range of physical process scales, from large-eddy simulations (LES) with meter-scale resolution to global circulation models (GCM) with horizontal resolutions of tens of kilometers, as in a few other recent models (Dipankar et al., 2015).The use of Julia aims to increase accessibility and utility of ClimateMachine as a simulation tool.We focus on the LES configuration of ClimateMachine in this paper.
One distinguishing aspect of the ClimateMachine LES is that it uses a nodal discontinuous Galerkin (DG) formulation to approximate the Navier-Stokes equations for compressible flow (Giraldo et al., 2002;Hesthaven and Warburton, 2008a;Giraldo and Restelli, 2008;Kopriva, 2009;Kelly and Giraldo, 2012;Giraldo, 2020).The DG method is a spectral-element generalization of finite-volume methods.It lends itself well to modern high-performance computing architectures because its communication overhead is low, enabling scaling on manycore processors including GPUs (Abdi et al., 2017b).Another important consideration within ClimateMachine is the use of total energy of moist air as a prognostic variable, ensuring energetic consistency of the simulations.We demonstrate that the ClimateMachine LES can be successfully used to simulate canonical LES benchmarks, including simulations of flows over mountains and different cloud and boundary-layer regimes (e.g., Straka et al., 1993;Schär et al., 2002;Stevens et al., 2005).
In what follows, we describe the conceptual and numerical foundations and governing equations of ClimateMachine and demonstrate the model in a set of standard two-and three-dimensional benchmark simulations.Section 2 begins by highlighting the governing equations.Their numerical approximation through the DG representation is described in Section 3. Section 4 presents sub-grid scale models used in the LES to represent under-resolved flow physics, with results from key benchmarks presented in Section 5. Conservation properties are examined in Section 6, and performance on CPU and GPU hardware is described in Sections 7 and 8, respectively.Section 9 contains closing remarks.Additional details about the model, boundary conditions, statistical definitions, and computer hardware are summarized in the appendices.

Working fluid
The working fluid of the atmosphere model is moist, potentially cloudy air, considered to be an ideal mixture of dry air, water vapor, and condensed water (liquid and ice) in clouds.Dry air and water vapor are taken to be ideal gases.The specific volume of the cloud condensate is neglected relative to that of the gas phases (it is a factor 10 3 less than that of the gas phases).All gas phases are assumed to have the same temperature, and are advected with the same velocity u = (u, v, w) T .Cloud condensate is assumed to sediment relative to gaseous phases slowly enough to be in thermal equilibrium with the surrounding fluid.
The density of the moist air is denoted by ρ.We use the following notation for the mass fractions of the moist air mixture (mass of a constituent divided by the total mass of the working fluid): q d : dry air mass fraction, q v : water vapor specific humidity, q l : liquid water specific humidity, q i : ice specific humidity, q c = q l + q i : condensate specific humidity, q t = q v + q c : total specific humidity.
Because this enumerates all constituents of the working fluid, we have q t + q d = 1.In Earth's atmosphere, the water vapor specific humidity q v dominates the total specific humidity q t and is usually O(10 −2 ) or smaller; the condensate specific humidity is typically O(10 −4 ).Hence, water is a trace constituent of the atmosphere, and only a small fraction of atmospheric water is in condensed phases.The working fluid pressure is the sum of the partial pressures of dry air and water vapor such , where R d is the specific gas constant of dry air, and R v is the specific gas constant of water vapor.

Mass balance
Moist air mass satisfies the conservation equation Moist air mass is not exactly conserved where precipitation forms, sublimates, or evaporates, where water diffuses, or where condensate sediments relative to the gas phases (Bott, 2008;Romps, 2008).The right-hand side involves the local source/sink of water mass Ŝqt owing to such non-conservative processes, which we take into account although it is small because water is a trace constituent of the atmosphere.

Total water balance
Total water satisfies the balance equation Here, the source/sink S qt arises from evaporation or sublimation of precipitation and formation of precipitation.Diffusive fluxes of moisture are captured by d qt .The effective sedimentation velocity of cloud condensate w c is defined such that q c w c = q l w l + q i w i (3) with w l and w i defined to be positive downward ( k being the upward pointing unit vector).The right-hand side ρ Ŝqt of the total water balance equation is the same as the right-hand side of the mass balance equation (1).

Momentum balance
The coordinate independent form of the conservation law for momentum is where I 3 is the rank-3 identity matrix, Φ is the effective gravitational potential including centrifugal accelerations, τ is a viscous and/or subgrid-scale (SGS) momentum flux tensor; and F u (typically with F u • u < 0, so that F u represents a momentum sink) is any other drag force per unit mass that may be applied, for example, at the lower boundary.The term involving the planetary angular velocity Ω accounts for Coriolis forces.To improve numerical stability, we have factored out a reference state with a pressure p r (z) and density ρ r (z) that depend only on altitude z and are in hydrostatic balance, so that they satisfy The tensor involving the diffusive flux d qt of water on the right-hand side of (4) represents the momentum flux carried by water that is diffusing; this term is usually very small, but we take it into account.

Energy balance
The specification of a thermodynamic or energy conservation equation closes the equations of motion for the working fluid.
We use the total specific energy, e tot , as the prognostic variable.Total energy is conserved in reversible moist processes such as phase transitions of water.
Total energy satisfies the conservation law (Romps, 2008;Bott, 2008) where the total specific energy e tot is defined by The constituents (dry air and moisture components) here are assumed to be moving with the same velocity u (that is, we neglect, as is common, the diffusive and sedimentation fluxes of water in the kinetic energy).The constituents are also assumed to be in thermal equilibrium at the same temperature T , so that the specific internal energy of moist air is the weighted sum of the specific energies of the constituents: dry air (I d ), water vapor (I v ), liquid water (I l ), and ice (I i ): Here, c vk for k ∈ {d, v, l, i} are isochoric specific heat capacities for the appropriate species denoted by k; they are taken to be constant.The reference specific internal energy I v,0 is the difference in specific internal energy between vapor and liquid at the arbitrary reference temperature T 0 ; I i,0 is the difference in specific internal energy between ice and liquid at T 0 (Romps, 2008).
The reference internal energies are related to specific latent heats of vaporization and fusion, L v,0 and L f,0 , at the reference temperature T 0 through The values of the thermodynamic constants we use are listed in Table 1.
Furthermore, the flux F R is the radiative energy flux per unit mass; J is the conductive energy flux per unit mass, and D is the specific enthalpy flux associated with the diffusive flux of water The flux u • ρτ is the energy flux associated with the viscous and/or SGS turbulent momentum flux; and Q is any internal energy source (e.g., external diabatic heating).The flux represents the downward energy flux due to sedimenting condensate.
The terms involving ρC(q j → q p ) (j ∈ {v, l, i}) represent the loss of internal and potential energy of moist air masses owing to precipitation formation; the kinetic energy loss is neglected, consistent with the neglect of the source/sink associated with precipitation formation in the momentum balance (4).Additional energy sinks involve the energy loss owing to heat transfer from the working fluid to precipitation as it falls through air and possibly melts at the freezing level (Raymond, 2013); the associated energy sources/sinks are generally provided by a microphysics parameterization and are subsumed in the term M .

Equation of state
Pressure p is calculated from the ideal-gas law where R m is the gas "constant" of moist air, with the ratio of the gas constants of water vapor and of dry air

Saturation adjustment
Gibbs' phase rule states that in thermodynamic equilibrium, the temperature T and liquid and ice specific humidities q l and q i can be obtained from the three thermodynamic state variables density ρ, total water specific humidity q t , and internal energy I. Thus, the above equations suffice to completely specify the thermodynamic state of the working fluid, given ρ, q t , and I, the latter obtained from the total energy via its definition (6).
Obtaining the temperature and condensate specific humidities from the state variables ρ, q t , and I is the problem of finding the root T of where I * (T ; ρ, q t ) is the internal energy at equilibrium, when the air is either unsaturated and there is no condensate (q v = q t ), or water vapor is in saturation and the saturation excess q t −q v is apportioned, according to temperature T , among the condensed phases q l and q i .We solve this nonlinear "saturation adjustment" problem by Newton iterations with analytical gradients (cf.Tao et al., 1989;Pressel et al., 2015).To obtain the saturation vapor pressure and derived functions needed in this calculation, we assume all isochoric heat capacities to be constant (i.e., we assume the gases to be calorically perfect); with this assumption, the Clausius-Clapeyron can be integrated analytically, resulting in a closed-form expression for the saturation vapor pressure (Romps, 2008).
This procedure allows the use of total moisture q t as the sole prognostic variable, but confines the system to the assumption of equilibrium thermodynamics.Alternatively, using explicit tracers for the condensate specific humidities q l and q i allows non-equilibrium thermodynamics to be considered and mixed-phase processes to be explicitly modeled.
3 Discretization of the governing equations

Space discretization
The governing equations are discretized in space via a nodal DG approximation.To describe the DG procedure, we recast the equations ( 1)-( 5) in compact notation as where Y = [ρ, ρu, ρe tot , ρq t ] T is an abstract vector of state variables; F 1 contains the fluxes not involving gradients of state variables and functions thereof; F 2 contains the fluxes involving gradients of state variables (e.g., diffusive fluxes); and S(Y ) contains the sources.
The DG solution of ( 16) is approximated on the finite-dimensional counterpart Ω h of the flow domain Ω, which consists of N Ωe non-overlapping hexahedral elements Ω e such that where a superscript h indicates the discrete analog of a continuous quantity.By virtue of tensor-product operations allowed on hexahedral elements and the ability to rely on inexact quadrature when elements of order greater than 3 are utilized, high-order Galerkin methods are particularly attractive for operation intensive solutions (Kelly and Giraldo, 2012).Within each element, the finite dimensional approximation of Y (x, t) is given by the expansion where (N + 1) 3 is the number of collocation points within the three-dimensional element of order N , and ψ e α are the interpolation polynomials evaluated at local point α inside element e.
From now on, the subscript/superscript e is omitted with the understanding that all operations are executed element-wise unless otherwise stated.Furthermore, the physical elements in the x = (x, y, z) space are mapped to a reference element ξ = (ξ, η, ζ).The three-dimensional basis functions ψ α result from the one-dimensional functions L i (ξ), L j (η), and L k (ζ) as the tensor product: Each function L is a one-dimensional (1D) Lagrange polynomial defined on the 1D reference element [−1, 1].The Lagrange function evaluated at points i along the ξ direction within the element is where ξ i are the N + 1 co-located interpolation points along ξ.The polynomials L j and L k in the two other directions η and ζ are built in the same way.The N + 1 interpolation points may be chosen in variety of ways (Deville et al., 2002;Karniadakis and Sherwin, 1999); here we choose Legendre-Gauss-Lobatto (LGL) points (Giraldo and Restelli, 2008).The Kronecker δ property of the Lagrange polynomials is such that This allows us to reduce the operation count, as follows.
We construct the space and time derivatives as By virtue of the 3D Kronecker δ property, the spatial derivatives of the basis functions appearing here are given by Using this property reduces the operation count significantly since we only require 3N operations instead of the N 3 operations otherwise needed to compute the derivatives at a given node (Abdi et al., 2017b).
The operators defined on the reference elements are mapped onto the physical space by means of the transformation where is the inverse Jacobian of the transformation from physical space to the reference element.
The DG approximation of the differential equations ( 16) is constructed by multiplying, within each element, the equation by the test function ψ α and then integrating over the element volume Ω e , such that where ψ α within each element belongs to the function space of square integrable piecewise polynomials of order N (i.e., ψ ∈ L 2 ).By definition, these functions are discontinuous across element boundaries; differentiability is not globally required but only within each element (Hesthaven and Warburton, 2008b).Integrating the divergence term by parts yields where Ω e and Γ e are, respectively, the volume and boundary of each element, n is the outward facing unit vector orthogonal to each element face, and F * 1 is a numerical flux.The imposition of the numerical fluxes across element boundaries is the numerical mechanism that promotes continuity of the discontinuous solution across the elements.The numerical fluxes are calculated as the approximate solution to a Riemann problem across two neighboring elements.ClimateMachine currently implements the Rusanov (1961), Roe (1981), and Harten-Lax-van Leer-Contact (HLLC) (E.F. Toro et al., 1994;Harten, 1983) numerical fluxes.The Rusanov flux, for instance, is constructed as where Y − is the state at the internal interface of element e, Y + is the state at the external interface of e, and λ Γe (Y − , Y + ) is an estimate of the maximum flow speed (e.g., the maximum eigenvalue of the Jacobian of the flux F 1 with respect to the state variables, which is the speed of sound).
Because the second-order derivatives in ∇ • F 2 cannot be directly built with the weak variational formulation if a discontinuous function space is used (Bassi and Rebay, 1997), an auxiliary variable Y is introduced such that which can then be discretized via DG as Here, Y * is approximated via centered flux like in Bassi and Rebay (1997).We also refer to Abdi et al. (2017b) for more details.
For algorithmic efficiency, inexact quadrature is used to calculate the integrals above.By virtue of inexact integration and of equations ( 17), ( 19), and ( 24), the variational DG equations yield the semi-discrete matrix problem where w s i are interpolation weights.The algebraic details to obtain this expression can be found in Giraldo and Restelli (2008), where the s superscript indicates a value that is defined on the element boundary surface.The system (32) is integrated on each element with respect to time.
In order to achieve good parallel scaling it is necessary to overlap communication and computation to the fullest extent possible.With DG (and all element-based Galerkin methods) this can be naturally achieved by splitting Eq. ( 32) into terms that arise from the approximation of volume integrals and surface integrals.All volume contributions can be calculated independently of element-to-element communication regardless of the order of the spatial approximation, as can surface integrals that are not on elements which share boundaries across MPI ranks.Thus, in the code, we start with message passing interface (MPI) communication, do all volume calculations and surface calculations for elements not on boundaries shared across ranks, and then apply surface calculations for elements on the rank boundaries after communication operations have been completed.This approach makes DG naturally effective with respect to parallel computing as previously shown by, e.g., Müller et al. (2018) on CPUs, Abdi et al. (2017b) on GPUs.At high order, element-based Galerkin methods such as DG require fewer neighboring degrees of freedoms than high-order finite difference and finite volume discretizations.
The benchmarks presented in this paper with isotropic grid spacing are run using the 4th-order 14-stage method of Niegemann et al. ( 2012), which has a large explicit time-stepping stability region.One of the benchmarks, however, uses a highly anisotropic grid, which benefits from the use of a 1-D IMEX approximation; there we use a variant of the horizontally explicit, vertically implicit (HEVI) schemes by Bao et al. (2015).While 3-D IMEX is also an option, its performance in terms of time-to-solution is ultimately limited by the availability of scalable 3-D implicit solver algorithms.
The governing equations are resolved with the discretizations presented in Section 3.This leaves unresolved but dynamically significant scales on the computational grid that must be modeled with three-dimensional SGS models.In general, SGS fluxes are modeled as diffusive fluxes, which capture down-gradient transport of conservable scalar quantities assuming that mixing lengths are small compared with the scales over which the gradients of the scalars vary.We address the physical form of the diffusive flux components in equations ( 1)-( 5), following which we describe standard models of subgrid-scale turbulence available for use in ClimateMachine.
The diffusive momentum flux tensor τ is represented in terms of the symmetric rate of strain tensor S such that with Here, ν t is a turbulent viscosity tensor whose components are typically orders of magnitude larger than the molecular viscosity and are a function of the velocity gradient tensor.
The diffusive flux d qt of total water specific humidity in equation 2 is modeled as where D t is a turbulent diffusivity vector.The turbulent diffusivity D t is related to the turbulent viscosity tensor ν t via the turbulent Prandtl number such that where Pr t takes a typical value of Pr t = 1/3.
The unresolved flux of total enthalpy h tot results in a diffusive subgrid flux term of the form where J is the thermal diffusion flux analogous to the molecular conductive heat flux, and D is the energy flux carried by water vapor, defined in equation 11.For energetic consistency, we use the same turbulent diffusivity D t for moist enthalpy and water.

Smagorinsky-Lilly model
The turbulent eddy viscosity ν t in the model by Smagorinsky (1963) and Lilly (1962) (SL henceforth) is defined by means of the magnitude of the rate of the strain tensor S, whose components are S ij = 1 2 ∂ui ∂xj + ∂uj ∂xi , according to for i, j = 1, 2, 3; C s is a constant Smagorinsky coefficient usually within the range 0.12 < C s < 0.21; and ∆ is the LES filterwidth.Inside each hexahedral element of order N and side lengths L x,y,z along the x, y, z directions, the effective grid resolution is ∆(x, y, z) = L x,y,z /N , which is the average distance between two consecutive nodal points.We use an isotropic eddy viscosity tensor ν t in LES, with its components defined by equation (38).

Vreman eddy viscosity model
The SGS model developed by Vreman (2004) is of interest because of its robustness across flow regimes and because it has low dissipation near wall boundaries and in transitional flows.Its computational complexity is similar to the classical SL model.
While the Vreman model is extensively used in engineering LES, it is uncommon in atmospheric flows, where a constant coefficient SL or the 1-equation TKE model by Deardorff (1970Deardorff ( , 1980) ) are the most common choices (see, e.g., Stevens et al. ( The turbulent eddy viscosity of this model depends on first-order derivatives of velocities and is given by where Here, summation over repeated indices i, j ∈ {1, 2, 3} is implied, C s is the constant Smagorinsky coefficient, and u i represent the components of the resolved-scale velocity vector, so that u i,j is the velocity gradient tensor.The mixing lengths ∆ m can be determined as the grid spacing in the direction implied by subscript m; in this paper ∆ m ≡ L x,y,z /N .
Richardson correction in stable regions of the atmosphere To account for the atmospheric stability and, effectively, reduce turbulence generation to zero in stably stratified atmospheres, we multiply the eddy viscosity by the correction factor where Here, Pr t = 1/3 is a constant turbulent Prandtl number, and θ v is the virtual potential temperature for a given specific humidity q t and specific liquid-water content q l (see, e.g., Deardorff (1980)).

Numerical stability
When high-order Galerkin methods are used to solve non-linear advection dominated problems, spurious Gibbs oscillations affect the solution and need to be addressed.ClimateMachine provides a set of spectral filters, cut-off filters, and artificial diffusion methods to remove these oscillations.While filters may be effective, we found that stabilizing the LES solution by means of the SGS eddy viscosity alone is effective and robust; this is in agreement with results shown by Marras et al. (2015) and Reddy et al. (2021) in the case of continuous Galerkin methods.This approach stems from the idea that the unresolved scales are responsible for the numerical oscillations of numerical solutions.Detailed analyses of the interactions of subgridscale models and filtering techniques with DG numerics in the context of atmospheric flows will be presented in a forthcoming paper.

Numerical experiments and discussions
ClimateMachine is tested against the following set of standard benchmarks: (1) dry rising thermal bubble in a neutrally stratified atmosphere; (2) dry density current; (3) hydrostatic and non-hydrostatic mountain-triggered linear gravity waves; (4) the Barbados Oceanographic and Meteorological Experiment (BOMEX); and (5) decaying Taylor-Green vortex in a triply periodic domain.
All tests are executed in a 3D domain even when the problem is effectively two dimensional, with effectively zero tendencies in the third dimension; this setup is identified as 2.5D in what follows.
5.1 2.5D Rising thermal bubble in a neutrally stratified atmosphere A neutrally stratified atmosphere with uniform background potential temperature θ 0 = 300 K is perturbed by a circular bubble of warmer air.The hydrostatic background pressure decreases with z as The perturbation is as defined in Ahmad and Lindeman (2007), where  (Boyd, 1996;Vandeven, 1991) was used by Giraldo and Restelli (2008).

2.5D Density current in a neutrally stratified atmosphere
The density current problem by Straka et al. (1993) is used to test the LES framework in a flow with Kelvin-Helmholtz instabilities.As for the rising thermal bubble, the background initial state is in hydrostatic equilibrium at uniform potential temperature θ 0 = 300 K.A perturbation of θ centered on (x c , z c ) = (0, 3000) m and with radii (r x , r z ) = (4000, 2000) m is given by the function  To reach solution grid convergence, this test is classically executed with a constant kinematic viscosity ν = 75 m 2 s −1 .Increasingly finer structures are resolved when the resolution increases (Marras et al., 2012(Marras et al., , 2015)).A measure of solution fidelity is the front position, which we compare against other models in Table A1 for different resolutions.The ClimateMachine results show quantitative agreement with respect to the frontal location from a range of models with varying spatial discretizations at resolutions ranging from 12.5 m to 100 m.This demonstrates the scope for capturing small-scale flow features with the numerics described in Section 3.
The structure of potential temperature at the final time t = 900 s is shown in Figure 3 for the Vreman and SL solutions.
When Rusanov is the chosen numerical flux (Figures 3a and 3b), the solutions are very similar although Vreman is visibly less dissipative for a prescribed value of the Smagorinsky coefficient C s = 0.18, with finer scales of motion apparent in the contours of potential temperature.Further quantitative analysis of the dissipative properties of the SGS models is reported in Section 5.5.Despite Vreman being less dissipative than SL, the Roe (Figure 3c) and HLLC (Figure 3d) fluxes contribute to additional numerical diffusion when compared to Rusanov fluxes.Detailed analysis of the interaction between numerical fluxes and subgrid-scale models will be presented in future articles.

Passive transport over warped grids
To verify the correct behavior of the DG implementation in the presence of topographic features, the simple passive advection test described by Schär et al. (2002) is used.The conservation law for the diffusive transport of a passive tracer χ is which is approximated via DG in the same way as Eq. ( 16).For a scalar tracer variable χ, we model diffusive fluxes d χ such that where δ χ relates the ratio of turbulent diffusivity of the tracer to that of the energy and moisture variables.For this test case, however, tracer diffusivity is set to zero to assess the stability and transport properties when using a warped grid.The volume grid in ClimateMachine is built by stacking elements above the surface and warping them around the terrain profile.To reduce the element distortion across the domain, a linear grid damping function is used such that a topography conforming surface of nodal points near the domain's bottom surface decays to a horizontal plane at higher altitudes (Gal-Chen and Somerville, 1975).
The initial scalar field χ is described by an elliptical perturbation centered on (x c , z c ) = (25, 9) km, with radii (r x , r z ) = (25, 3) km, such that An effective uniform grid resolution ∆x = ∆z = 500 m is used for this test.The initial velocity profile is given by where u 0 = 10 m s −1 , z 1 = 4 km, and z 2 = 5 km.
The topography is defined by the function where h 0 = 3 km, a = 25 km, λ = 8 km, and x 0 = 75 km.
The contours of χ in Figure 4 show minimal distortion in spite of the warped elements directly above the topographical feature, indicating that the DG transformation metrics from physical to logical space in the presence of topography do not adversely affect the solution.Deviation from the initial profile of tracer magnitudes lie between -4% and +2% when the tracer is above the topographical feature, with maximum deviation amplitudes of -5% and +3% at the end of the test, showing a favorable comparison with the hybrid and SLEVE coordinate results presented in Schär et al. (2002).

Mountain-triggered gravity waves
To assess the correct implementation of a Rayleigh sponge layer to attenuate fast, upward propagating gravity waves before they reach the top of the domain, two steady-state mountain-triggered gravity wave problems suggested by Smith (1980) are solved.The sponge layer is described in Appendix A2.These problems consist of a flow that moves eastward with uniform horizontal velocity u = (u, 0, 0) m s −1 in a doubly periodic domain.The flow impinges against a mountain of height h m and base length a centered at x c as The background state is in hydrostatic balance with Brunt-Väisälä frequency N , such that for a given surface potential temperature θ sfc = T sfc .The hydrostatically balanced pressure is which yields, by means of the ideal gas law, the background density These tests are affected by spurious oscillations that appear at approximately 5000 s into the simulation.In the absence of shear, because of the free-slip bottom and top boundaries, the SGS models are unable to introduce sufficient diffusion to remove the Gibbs modes so that an exponential filter (Hesthaven and Warburton, 2008b) of order 64 was applied on the velocity field to remove spurious modes.The filter assumes the form of where s is the filter order, η is a function of the polynomial order, and α = − log(ε M ) is a parameter that controls the smallest value of the filter function for machine precision ε M .In double precision, α ≈ 36.The filter in this form is applied to perturbations of the prognostic variables from the balanced background state.

Linear hydrostatic
The linear hydrostatic case proposed by Smith (1979) consists of a neutrally stratified isothermal atmosphere with θ = θ sfc = 250 K.The background atmosphere is isothermal with temperature T 0 , resulting in a Brunt-Väisälä frequency of The flow moves in a periodic channel along the x-direction with velocity u = (20 m s −1 , 0, 0) over a mountain with h m = 1 m and a = 10000 m.A Rayleigh absorbing layer is added at z s = 25 km with relaxation coefficient α = 0.5 s −1 , power γ = 2 and domain top z top = 30 km (see Appendix A2 for details).The domain extends from 0 to 240 km in the horizontal direction.
The steady-state solution at t = 15000 s is shown in Figure 5a.It is consistent with the DG results shown by Giraldo and Restelli (2008).

Linear non-hydrostatic
The linear non-hydrostatic mountain waves are forced by a flow of uniform horizontal velocity u = (10 m s −1 , 0, 0) over a mountain with h m = 1 m and a = 1000 m.The domain extends from 0 to 144 km in the horizontal direction and is 30 km high.The steady-state solution at t = 18000 s is shown in Figure 5b.It is consistent with results shown by Giraldo and Restelli (2008).

Decaying Taylor-Green Vortex
The decaying Taylor-Green vortex (TGV) is a classical test to estimate the dissipative properties of turbulence models in the absence of solid boundaries.The gravity-free flow is initialized in a triply periodic cube of dimensions [−π, π] 3 .The solenoidal initial velocity field u 0 = (u 0 , v 0 , w 0 ) is defined as with initial pressure where k is the wavenumber, U 0 = 100 m s −1 , ρ 0 = 1.178 kg m −3 , and p ∞ = 101325 Pa.Fourth-order polynomials are used for all simulations considered in this section.
We first consider the volume-averaged kinetic energy, which provides insight into the dissipation characteristics of the flow with respect to non-dimensionalized time t * = kU 0 t.In integral form, the kinetic energy can be written as: where • denotes a volumetric average over the volume Ω h .If the flow is inviscid, the kinetic energy should be conserved.This is only valid if the numerics or SGS models do not introduce numerical dissipation, or if all flow scales are well resolved.
As such, the time series of kinetic energy is a metric that shows the point along the simulation at which the solution becomes under-resolved.The kinetic energy dissipation rate is the second quantity of interest, which allows us to quantify the rate of decay of kinetic energy over time.This is defined as A third quantity of interest for this analysis is enstrophy, which is defined as the square of the vorticity norm: The enstrophy of a fully resolved flow should go to infinity if the flow is inviscid.Therefore, enstrophy can be used as a criterion to estimate the effect of numerical dissipation.
By means of a three-dimensional fast fourier transform (FFT) of the velocity field, the kinetic energy spectrum is calculated as: where K = 2π/L, L is the characteristic length, A is a three dimensional array of Fourier mode amplitudes, υ = k z /k and The TGV flow is simulated using both the SL and Vreman models on grids with 32 3 , 64 3 , 128 3 , and 192 3 points.Figure 6 shows a 3D visualization of the flow at two different non-dimensional times using zero Q-criterion isosurfaces, which identify balance between rotation and shear in the flow (Hunt et al., 1988).As the flow evolves, the flow generates smaller and smaller-scale vortices.Eventually, the flow becomes under-resolved, making it impossible to conserve kinetic energy.As the flow continues to evolve, an instability occurs, which causes the disintegration of the vortex sheet.After this point, the TGV's dynamics are controlled by the interaction of small-scale vortical structures formed by vortex stretching.
Results for the coarse-resolution simulations are presented in Figure 7, and those for the fine-resolution simulations are shown in Figure 8. Figure 7a shows that the kinetic energy changes with time for the 32 3 resolution simulations are distinctly different from their higher-resolution counterparts in Figure 8a.The severe under-resolution of the flow seems to generate much larger amounts of dissipation early on.This is further demonstrated when comparing Figures 7b and 8b.Beyond 64 3 resolution, the peaks in dissipation are larger as the resolution increases, since smaller vortices can be resolved before the instability eventually happens.The 32 3 simulations (Figure 7b) have larger peaks than even the 192 3 (Figure 8b) simulations, demonstrating that the under-resolution of the vortex structures leads to different, more dissipative early-flow behavior.
Figure 8b shows that the 128 3 and 192 3 simulations using both SGS closures are characterized by a peak in the kinetic energy dissipation at t * = 9.Brachet et al. (1983) and Brachet (1991) demonstrate this result for their direct numerical simulations (DNS) of the Taylor-Green vortex for a Reynolds number R e = U 0 /kν ≥ 3000.Examining Figure 7b, the 64 3 simulations suggest a dissipation peak time of t * = 8, and the 32 3 simulations suggest a dissipation peak time of t * = 6.The underprediction of the time at which the peak occurs is due to an inability to resolve the vortices that appear early on in the flow's evolution at extremely coarse resolutions.This leads to the early appearance of the instability which causes the dissipation peak.Furthermore, we see that the kinetic energy decay occurs sooner for the lower-resolution simulations, as a result of increased dissipation from the SGS models.
Figure 8c also shows that the flow's enstrophy behaves as expected, with peak values at t * = 9 coinciding with peaks in kinetic energy.The higher-resolution simulations are able to reach a higher enstrophy than the lower-resolution simulations as they are naturally able to resolve more vortical motion.On the other hand, the choice of SL or Vreman models seems to have very little impact on the ability to resolve more small-scale eddies for low resolutions.However, the SL SGS scheme leads to higher enstrophy than the Vreman SGS scheme, increasingly so as resolution increases.
Figure 9 shows the kinetic energy spectra obtained for the higher-resolution simulations used for this test.All simulations present peaks in their respective spectra at k = 2 to 4, which persist even as the flow evolves over time.These peaks are explained by Drikakis et al. (2007) as being imprinted on the spectra by the initial velocity field.Furthermore, as the flow evolves, all spectra show a slope close to the theoretical k −5/3 of homogeneous turbulence and eventually decay towards a k −3 slope at higher wavenumbers.This behavior is consistent with the DNS results of Brachet et al. (1983) who showed, using DNS, this transition occurs around k=60.

Barbados Oceanographic and Meteorological Experiment (BOMEX)
BOMEX features a shallow cumulus topped boundary layer as described in Holland and Rasmusson (1973).The setup of this test follows Siebesma et al. (2003).The initial profiles are characterized by a well-mixed sub-cloud layer below 500 m, a cumulus layer between 500 m and 1500 m, an inversion layer up to 2000 m, and a free troposphere above.Large-scale forcing includes prescribed large-scale subsidence, horizontal advective drying, radiative cooling, and Coriolis acceleration.Sensible and latent heat fluxes at the surface are prescribed to SHF = n•(ρJ sfc ) = 9.5 W m −2 and LHF = n•(ρD sfc ) = 147.2W m −2 .Additional detail on the application of boundary conditions is presented in Appendix A. The domain, Ω = 6400 × 6400 × 3000 m 3 , is doubly periodic in the x and y directions.A Rayleigh sponge layer (see Appendix A2 for details) is applied along Figure 10 shows the vertical profiles of the domain-mean thermodynamic and turbulence properties over the last hour of the simulations.Figure 11 shows the time series of liquid water paths (LWP), cloud cover, and turbulence kinetic energy.The time averaged results of the vertical profiles of θ l , q l , cloud fraction, and q v = q t − q l during the last hour are in good agreement with the same quantities presented by Siebesma et al. (2003).The SGS model does not have much effect on the simulation characteristics, except that the Vreman SGS model produces a stronger peak in the variance of vertical velocity, w w , near the cloud top.Although the difference is mild, it is possibly due to the low dissipation nature of the Vreman's model.Excluding the first hour of flow spin up, the results compare well with PyCLES (Pressel et al., 2015) and fall within the ensemble range shown in Figure 2 of Siebesma et al. (2003).Details on the computation of horizontally-averaged profiles can be found in Appendix B.
A large domain simulation of BOMEX with effective horizontal resolution ∆x = ∆y = 50 m and vertical resolution ∆z = 20 m in a 30 × 30 × 3.6 km 3 domain was executed using 16 GPUs on the Google Cloud Platform.The simulation was executed using 1D IMEX time integration (1D implicit in the vertical direction and 2D explicit in the horizontal direction) at maximum horizontal advective Courant number C = 0.9.A visualization of instantaneous shallow cumulus structures is shown in Figure 12.Siebesma et al. (2003).Line colors as in Figure 10.

Mass and energy conservation
We define the time dependent normalized total mass and energy changes as, respectively, and where t 0 indicates the initial time and Ω is the full domain.Figure 13 shows ∆M (t) and ∆ρe tot (t) for a 1 hour simulation of a moist rising thermal bubble.The SL eddy viscosity model was used to represent under-resolved diffusive fluxes.This simulation was run using free-slip boundary conditions with adiabatic walls.We note that the loss of energy and mass in the system is contained to O(10 −15 ), that is, numerical roundoff-error.This result highlights a key benefit of the general formulation of the prognostic conservation equations in flux form, which guarantees conservation properties up to source or sink contributions.

CPU strong-scaling
Demonstration of favorable scaling capabilities across multiple hardware types is critical to the utility of ClimateMachine as a competitive tool for large-eddy simulations.Toward this, we first examine strong scaling on CPU architectures.The rising thermal bubble problem described in Section 5.1 is used as the test problem, with its domain extents modified to form an 8.192 km 3 cube, with an effective nodal resolution of 32 m to ensure that CPU memory on a single rank is maximally loaded.
For tests with multiple MPI-ranks, each rank resides on a unique node, to ensure communication overhead is appropriately represented; in practice, one would expect to use multiple ranks per node.Scaling across multiple threads is not assessed in the present work.Figure 14 shows the speedup in time-to-solution for 10 time-integration steps of the test problem with N ranks ∈ {1, 2, 4, 8, 16, 32} for both dry and moist simulations.We exclude checkpoint, diagnostic and periodic run-time output steps from time-to-solution measurements.In both dry and moist simulations, we see a speedup of approximately 19.7 when using 32 ranks compared with the corresponding single-rank simulation.
A single rank GPU run of the test problem on a 6.144 km 3 domain with 32 m effective resolution (restricted by GPU memory capacity) has a wall-clock time for ten integration steps of 314 s.The wall-clock time for a 32-rank CPU run was 449 s for a 2.37 times larger problem.
This provides an estimate for a comparison between CPU and GPU hardware performance.However, the balance between memory bandwidth limits and compute operation limits guides the maximum scaling possible on the GPU hardware relative to its CPU counterpart, so this cannot be interpreted as a direct comparison across hardware types.Based on the present results, we conclude that it is more feasible to pursue strong-scaling improvements on CPU hardware than on GPU hardware.
Further optimization and exploration of scaling in ClimateMachine is ongoing work.Additional details on the hardware used for scaling tests can be found in Appendix C.

GPU weak-scaling
To test the multi-GPU scalability of ClimateMachine, we first execute a BOMEX setup that is sufficiently large to saturate one GPU.The single-GPU execution represents the baseline from which we calculate the average time per time-step denoted by t 1 .We then expand the domain size to match an increase in the number of GPUs and measure the average time per time step.Our scaling is then obtained as the ratio t 1 /t n , with t n being the average time per time-step obtained with n GPUs.The results are obtained using up to 16 NVIDIA Tesla V100 GPUs running Julia version 1.4.2,CUDA 10.0, and CUDA-aware OpenMPI 4.0.3.Figure 15 shows excellent weak scaling for up to 16 GPUs on Google Cloud Platform resources.Over 95% weak scaling was achieved with 1D-IMEX time integration, and over 98% for the simulation with explicit timestepping.This is an encouraging result and supports the ability to prototype smaller problem setups and deploy larger simulations in the ClimateMachine limited area configuration at identical resolutions without significantly compromising the time to solution.
This paper introduced and assessed the LES configuration of ClimateMachine, a new Julia language simulation framework designed for parallel CPU and GPU architectures.Notable features of this LES framework are: -Conservative flux form model equations for mass, momentum, total energy and total moisture to ensure global conservation of dynamical variables of interest (up to non-conservative source or sink processes) -Discontinuous-Galerkin discretization with element-wise evaluation of the approximations to volume and interface inte- 5000, 2000)  m, and θ c = 2 K.The initial velocity field is zero everywhere.Periodic boundary conditions are used along y, and solid walls with impenetrable, free-slip boundary conditions are used in the x and z directions.Detailed information on boundary conditions for all test cases is provided in Appendix A. Five runs are performed at effective uniform resolutions ∆x = ∆z = 250 m, 125 m, 62.5 m, and 31.25 m, and 15.625 m, with polynomial order N = 4. Potential temperature θ and the two velocity components u and w are plotted at t = 1000 s in Figure1for the grid resolution of 15.625 m, which represents a reference solution for comparison with solutions at coarser resolutions.The value of the maximum potential temperature perturbation ∆θ max and of the horizontal and vertical velocity components agree with the 125 m resolution results shown byAhmad and Lindeman (2007).The grid dependence of the solution is shown in Figure2, where potential temperature is plotted for ∆x = ∆z = 31.25 m, 62.5 m, 125 m, and 250 m.While the solution is visibly more dissipative at coarser resolutions, the bubble's leading edge position (and hence propagation speed) is not sensitive to the grid resolution.The SL and Vreman closures are used to model diffusive fluxes in this problem.The solutions show no discernible differences, and only the SL solution is shown.A visual comparison of the two becomes more meaningful when shear triggers mixing, which is shown for the density current test in Section 5.2.Although DG inherits an implicit numerical diffusion as an effect of the numerical flux calculation across elements, under-resolved advection-dominated problems still require a dissipation or filtering mechanism to preserve the solution's stability.In the case ofMarras et al. (2015), a dynamically adaptive SGS model was used whereas a Boyd-Vandeven filter Figure 1.2.5D rising thermal bubble with effective resolution ∆x = ∆z = 15.625 m and N = 4. Panels depict (a) potential temperature θ, (b) horizontal velocity u, and (c) vertical velocity w at t=1000 s in Ω = 10 × 10 km 2 .

Figure 2
Figure 2. 2.5D rising thermal bubble solution with N = 4 at decreasing effective resolution.Grid convergence of potential temperature θ at four different resolutions to be compared against the 15.625 m resolution results shown in Figure 1.From left to right: ∆x = ∆z = 31.25 m, 62.5 m, 125 m, and 250 m.Although the solution is visibly more dissipated at coarser resolution, the bubble's leading edge (and hence propagation speed) is not affected.

Figure 3 .
Figure 3. 2.5D density current.Potential temperature θ (K) at t = 900 s computed with an effective DG resolution of ∆x = ∆z = 12.5 m with domain extents shown in meters.(a) Rusanov numerical flux with SL SGS.(b) Rusanov numerical flux with Vreman SGS.(c) Roe numerical flux with Vreman SGS.(d) HLLC numerical flux with Vreman SGS.The color scale ranges from θ = 285 to 300 K. Shared colorbar for all plots shown in panel (a) for clarity.

Figure 4 .
Figure 4. Solution of the passive transport of scalar χ at three different time instances, with contours of tracer quantity χ (from 0, outermost contour, to 1, innermost contour) overlaid on a representation of nodal points on the underlying mesh.Tracer advection is driven by a prescribed velocity profile from left to right.As it crosses the deformed grid above the mountain ridge, only a minimal distortion of tracer contours is observed, which is completely recovered back to a smooth solution downwind of the ridge.

Figure 5 .
Figure 5. Vertical velocity w for two linear mountain wave tests.Velocity contours are shown in the range from −4 × 10 −3 m s −1 (blue) to 4×10 −3 m s −1 (red).(a) Hydrostatic test at t = 15000 s, with an effective horizontal resolution ∆x = 500 m and effective vertical resolution ∆z = 240 m with N = 4.(b) Non-hydrostatic test at t = 18000 s, with an effective horizontal resolution ∆x = 200 m and effective vertical resolution ∆z = 160 m with N = 4.

Figure 6 .
Figure 6.Taylor-Green Vortex.Isosurfaces of zero Q-criterion (these idenfity surfaces where the vorticity norm is identical to the strain-rate magnitude) on a 192 3 grid, corresponding to 48 elements of order 4 at t * = 4 (left) and t * = 60 (right).Plots are shown for the Vreman solution.The isosurfaces of the Q-criterion are colored by dimensionless kinetic energy.

Figure 7 .
Figure 7. Evolution of volumetrically averaged (a) kinetic energy represented on a semi-logarithmic scale, (b) kinetic energy dissipation, and (c) enstrophy, each computed on 32 3 and 64 3 grids with the Vreman and SL SGS models.

Figure 8 .
Figure 8. Evolution of volumetrically averaged (a) kinetic energy represented on a semi-logarithmic scale, (b) kinetic energy dissipation and (c) enstrophy, each computed on 128 3 and 192 3 grids with the Vreman and SL SGS models.

Figure 10 .
Figure 10.BOMEX.Profile of the mean state of liquid potential temperature, total specific humidity, cloud fraction, liquid water specific humidity, and variance of the vertical velocity fluctuations averaged along the last hour of the simulation.The solutions with PyCLES and ClimateMachine were calculated with effective grid resolution ∆x = ∆y = 100 m and ∆z = 40 m.Results calculated with Vreman (blue lines) and SL SGS (orange lines) models in ClimateMachine are compared against results of PyCLES (grey lines) in its Paired-SGS modality using the SL model.

Figure 11 .
Figure11.BOMEX.From left to right, time series of horizontally averaged LWP, cloud cover, and turbulence kinetic energy diagnosed from ClimateMachine using Vreman and SL.These results are consistent with ensemble results presented in Figure2of the intercomparison study bySiebesma et al. (2003).Line colors as in Figure10.

Figure 12 .
Figure 12.BOMEX.Instantaneous visualization of the shallow cumulus structures on a 30 km×30 km×3.6 km domain.The white shading of the volume rendering of the cloud corresponds to a maximum value of q l = 2.5 × 10 −4 kg kg −1 ; a maximum qt = 1.8 × 10 −2 kg kg −1 is shown on the bottom surface in light yellow shading.The simulation was executed using 16 GPUs on the Google Cloud Platform.

Figure 13 .
Figure 13.Time evolution of the mass and total energy loss (relative change when compared against initial conditions) for a moist thermal bubble simulation.Blue line: relative change in energy; Orange line: relative change in mass.

Figure 14 .Figure 15 .
Figure 14.Speedup of time-integration (solver) step relative to time to solution for single-rank simulation of the rising thermal bubble problem in an 8.192 km 3 domain with uniform effective resolution of 32 m (with fourth-order polynomials).Blue circles: dry simulation; Orange squares: moist simulation.
grals resulting in reduced time-to-solution due to MPI operations -Application of model equations to the solution of benchmark problems in typical LES codes, including atmospheric flows in the shallow cumulus regime (BOMEX) -Demonstration of strong-scaling on CPUs with up to 32 MPI-ranks (speed-up of 19.7 in time-to-solution), and weak scaling up to 16 GPUs (95-98%), in both dry and moist simulation configurations.Because the momentum flux tensor depends linearly on velocity derivatives, this amounts to homogeneous Neumann boundary conditions on velocity components parallel to the surface.On the other hand, viscous drag is imposed by the classical aerodynamic drag law n • (ρτ ) = −ρC d,int u p,int u p,int , (A2) where the quantities with subscript int are evaluated at an interior point x int adjacent to the surface.The drag coefficient C d,int = C d (Y ; x int ) can depend parameterically on state variables Y (x, t) and on the position x int of the interior point relative to the surface.In the present implementation, the plane of interior points relevant to boundary flux evaluation is interpreted as the first layer of interior nodes in the surface-adjacent elements.The drag law boundary condition amounts to inhomogeneous Neumann boundary conditions on velocity components parallel to the surface.Specific humidity As for momentum, the advective specific humidity fluxes normal to a rigid surface vanish, but the diffusive or SGS specific humidity fluxes normal to the surface may not vanish.Normal components of SGS fluxes of condensate, q l , are generally set to zero at boundaries n • (ρd q l ) sfc = 0, (A3) implying homogeneous Neumann boundary conditions (n • ∇q l = 0) on the condensate specific humidities.The total SGS specific humidity flux then reduces to the vapor flux at the surface.SGS turbulent deposition of condensate (fog) on the surface can in principle occur; representing this would require nonzero condensate fluxes at the surface.With the assumption of zero condensate boundary fluxes, we have n • (ρd qt ) sfc = n • (ρd qv ) sfc ,where evaporation (measured in kg m −2 s −1 ), or condensation if negative, is given byE = n • (ρd qv ) sfc = −n • (ρD t ∇q t ) sfc .Evaporation can be zero (water impermeable) or it can be given as a function of Y at the surface according toE = n • (ρd qv ) sfc = E(Y ; x sfc , t),which, numerically, translates into an inhomogeneous Neumann boundary condition on the vapor specific humidity.EnergyAs for momentum and humidity, the advective energy fluxes normal to a rigid surface vanish, but the diffusive or SGS flux of total enthalpy, h tot , normal to the surface (units of W m −2 ) n • ρ(J + D) = −n • (ρD t ∇h tot ),