Deep Neural Networks to Correct Sub-Precision Errors in CFD

Information loss in numerical physics simulations can arise from various sources when solving discretized partial differential equations. In particular, errors related to numerical precision ("sub-precision errors") can accumulate in the quantities of interest when the simulations are performed using low-precision 16-bit floating-point arithmetic compared to an equivalent 64-bit simulation. On the other hand, low-precision computation is less resource intensive than high-precision computation. Several machine learning techniques proposed recently have been successful in correcting errors due to coarse spatial discretization. In this work, we extend these techniques to improve CFD simulations performed with low numerical precision. We quantify the precision-related errors accumulated in a Kolmogorov forced turbulence test case. Subsequently, we employ a Convolutional Neural Network together with a fully differentiable numerical solver performing 16-bit arithmetic to learn a tightly-coupled ML-CFD hybrid solver. Compared to the 16-bit solver, we demonstrate the efficacy of the hybrid solver towards improving various metrics pertaining to the statistical and pointwise accuracy of the simulation.


Low precision solver
High precision solver and uses deep learning to correct the resulting errors.
The rise of machine learning (ML) as a powerful tool for information processing, together with the large amounts of data generated in CFD simulations, has led to the use of ML for understanding, modelling, optimising and controlling fluid flows [1]. Purely data-driven approaches to spatio-temporal as PDEs in the loss function, and hence are biased towards producing physically accurate outputs [5, 6,7]. Machine learning has also been used in turbulence closure modelling [8,9,10].
Recent efforts to use machine learning in scientific computing have involved differentiable programming [11,12,13]. Software frameworks for deep learning (e.g. PyTorch and JAX) implement reverse-mode automatic differentiation to compute gradients in neural networks [14,15]. This capability can be leveraged by re-implementing numerical solvers in these frameworks, method has been demonstrated to improve generalisation to new inputs and the adherence to physical laws [16,17]. Since the gradients can be readily calculated end-to-end, a solver-ML system trained in this manner can also be used for flow control [18], inverse design and shape optimization [19,20]. Several numerical solvers written in automatic differentiation frameworks have been recently open-sourced [17,21,22]. In recent times, the realisation that deep neural networks retain their learning and predictive power even when using reduced-precision numerics has spurred the widespread adoption of floating-point standards such as halfprecision IEEE 16-bit format (float16), which consists of 10 mantissa bits resulting in roughly 4 significant digits of precision [23]. This format is currently well supported on modern GPUs, with around 2-3 times more theoretical throughput than float32 [24]. The GPU memory savings obtained from the reduced numerical precision can also enable simulations with more mesh elements than would otherwise be possible. However, with 5 exponent bits, float16 has a maximum representable value of 65504. This can lead to overflow issues in scientific computing applications, including deep neural networks [25]. The 'Brain' floating-point format (bfloat16) was developed at Google Brain to mitigate this issue [26]. Previous studies on numerical precision in scientific computing noted that certain operations are more susceptible to error accumulation than others.
In particular, large summations or accumulation operations, repeated operations in loops, large-scale parallel computation (since floating-point arithmetic is non-associative), and resolving small-scale phenomena cause the most error accumulation [27]. Simulations of turbulent fluid flows satisfy all of these conditions: resolving small-scale eddies is important; the PDEs are inherently non-linear, resulting in vastly different evolution trajectories with small changes in the initial conditions; and they are solved using large 5 scale matrix operations on parallel processing devices such as GPUs. Climate modelling is an application where the accumulation of numerical error makes reproducibility and verification of the results difficult. He and Ding [28] showed that employing double-double precision arithmetic in long inner product loops in atmospheric simulations dramatically reduces the variability of results between subsequent runs.
To examine the benefits of reduced precision, we performed preliminary tests on the canonical test case of the Taylor Green Vortex on a cartesian grid of 2π 3 [29]. We have used our in-house high-order CFD solver COMP-SQUARE [30,31]  In Section 2, we quantify the error accumulation in a Kolmogorov forced turbulence test case. In Section 3, we explore deep learning strategies to correct this error. We present the key results from our experiments in Section 4, following which we conclude with a discussion on future directions in Section 6.

Test case and solver
In this section, we first quantify the error accumulation due to reduced numerical precision and subsequently attempt to correct it using machine learning. For this, we simulate a 2D forced turbulence test case with JAX- CFD, an open-source CFD solver Kochkov et al. [17]. The solver uses a finite volume formulation for spatial discretisation on a staggered grid. Pressure is stored at the centre of each cell while the velocity components are defined on the corresponding faces. For diffusion, it uses a second-order central difference approximation of the Laplace operator. The advection term is solved using a second-order accurate scheme based on the Van Leer flux limiter [33]. The solver uses first-order Euler temporal discretisation and explicit time-stepping for advection and diffusion. Incompressible Navier-Stokes equations are solved using a pressure projection method. Depending on the dimensions of the grid, the code uses either a fast diagonalisation approach with explicit matrix multiplication [34] or a real-valued fast Fourier transform. It is worth noting that the main scope of the present study is to develop, assess and demonstrate the framework of reducing the numerical error in low-precision CFD. Hence, the choice of relatively low-order schemes does not unduly affect the present conclusions. We use a 256 × 256 grid with periodic boundary conditions and set the Reynolds number (based on the domain dimension and initial maximum velocity) to 4000. In order to produce a statistically stationary turbulent flow, we apply the Kolmogorov forcing function [35] f given by the equation: where y is the vertical coordinate,x is the unit vector along the horizontal axis and u is the component of velocity alongx. These forcing terms are added to the right-hand side of the x and y momentum Navier-Stokes equations. We note that in sub-grid modelling [17], the error between the coarse and fine-grid simulations is considerable. On the other hand, in the present study, the forcing function ensures that the turbulent flow does not decay in time thereby resulting in a noticeable error accumulation with time.
The open-source solver is implemented in JAX [15], a high-performance numerical computing library. This allows us to run the simulation on GPUs and efficiently calculate gradients using automatic differentiation. We run this solver using either float64, float32 or bfloat16 numerics by setting the datatype of the initial velocity field. A run that uses float64 is treated as the reference high-fidelity simulation, and a run that starts at the same initial velocity field downcast to float32 or bfloat16 is treated as the corresponding low-fidelity simulation.

Low-precision simulation characteristics
We estimate the temporal evolution of the Mean Squared Error ( like MSE fail to capture the differences between the fields, which we explore further in Section 4. It is worth mentioning that MSE is more useful in the early part of the decorrelation process, where we can see that the bfloat16 error sets in significantly early and rises more steeply compared to float32.
The average bfloat16 MSE saturates at a lower value than the average float32 MSE. The MSE saturation level for bfloat16 being marginally lower than that of float32 does not imply that bfloat16 is more accurate than float32. Once the fields have become completely decorrelated, i.e, the vortical structures are completely different, MSE is no longer a very useful metric to determine similarity. Similar observations have been made in the literature [36].
We measure the computational throughput of the solver on our hardware to be 6112 (steps/s) for float64, 10951 for float32 and 7891 for bfloat16.
Although, in principle, bfloat16 should be faster than float32, it is observed to be slower than float32 because the hardware (V100 GPU) is not optimised for 16-bit operations [37]. Due to the faster accumulation of error, we choose to use the 16-bit simulation for the rest of our experiments, where we use deep learning techniques to improve the accuracy of the predictions.
In order to use deep learning to learn a transformation from a lowprecision flow into the corresponding high-precision flow, there must exist some generalisable structure to a flow snapshot determined by whether it was produced by the low-or high-precision solver. As a preliminary experiment to test for the presence of such a generalisable structure, we train a binary classifier on our data. In Fig. 3, the low-precision (LP) and high-precision (HP) snapshots are generated using the same solver and flow parameters but with different numerical precision (bfloat16 and float64, respectively). The CNN classifier is trained and subsequently applied on a held-out validation set. The trained classifier distinguishes low-and high-precision snapshots with 85% accuracy, significantly greater than random. This result suggests that there is an inherent structure to the flow snapshots produced by the low-precision solver and that deep learning can model it.

Methods to reduce error
In this section, we attempt to use deep learning to learn a function that can correct the error arising in a small duration of time between the lowprecision and high-precision solvers. We use a neural network that interacts with the solver through differentiable programming [16,17].  For our experiments, we define ∆ to be the total time duration (number of solver iterations) that the model sees in a single training sample. We define N as the number of corrections the neural network makes to the low-precision simulation during the time duration ∆. Therefore T , given by the relation T = ∆/N , is the number of iterations the solver steps forward before passing the output to the neural network for a correction.

Data generation
The Pseudocode describes the process of generating data for training and validating the model. The initialization function takes a random seed as an argument and generates different realizations of synthetic turbulence having different vortical structures but the same spectral content. This is in contrast to generating different flow realizations by adding a random numerical noise to the flow field, which would in fact quickly dissipate. The generated velocity field is filtered to have a maximum amplitude of 10 and a maxi-  Figure 4: The Convolutional Neural Network architecture used to make corrections to the low-precision snapshot. C @ K × K denotes a convolutional layer with a K × K kernel and C output channels. mum wavenumber of 4. It then steps forward by 1500 timesteps and discards this initial transient. The returned velocity field is used to initialize both the high-and low-precision simulations by performing the appropriate typecasting. Starting from this initial condition, the velocity field is saved every T iterations N times, covering a total time duration of ∆ = T × N . The simulation is repeated using both high-and low-precision numerics, which constitutes a single training sample of N high and low-precision snapshots each. Note that the low-precision snapshots are only involved during the validation phase since, during training, they are generated on the fly by the hybrid solver.

Model architecture
We use a convolutional neural network (CNN) architecture consisting of 6 Residual Blocks (Fig. 4). These blocks consist of 2D convolutional layers followed by ReLU activation functions and contain "skip connections", which significantly improve the gradient flow through the network [38]. We use periodic convolutional layers instead of the standard convolutional layers  We study the effect of the hyperparameters T , N and ∆ in detail in Section 4. For the model and optimisation hyperparameters, we stick to simple default values as Kochkov et al. [17] report that their model performance does not have a consistent dependence on these parameters. We use the Adam optimisation algorithm [39] with a learning rate of 1e-4 and batch size of 2, and each model takes around 24 hours to train on our hardware.

Advantages of the hybrid solver
The advantages of this hybrid solver approach over a pure ML supervised learning approach are as follows. First, physics constraints such as conservation of mass and momentum are more easily achieved since the actual time stepping is performed by the physics solver, not by the neural network. Second, when a correction is computed, the input to the neural network is a flow field that resulted from a correction by the same neural network at an earlier timestep. Furthermore, the current correction will affect inputs to the neural network correction function at later timesteps. This allows the neural network to account for the future performance of a correction during the learning process. In a standard supervised learning approach, by contrast, the inputs to the neural network are pre-computed prior to training and are not affected by past corrections in any way [16,17,40]. Figure 6 visualises two example timesteps produced by such a model. The output is highly distorted and physically inaccurate. Therefore we do not pursue further experiments using the standard supervised learning approach. In the subsequent section, we demonstrate the improvements produced by the hybrid solver over this method.

Experiments and Results
We experiment with training the model on data generated using ∆      solvers on our workstation using a single NVIDIA V100 GPU. For a given ∆, the more frequently the neural network is invoked to make a correction, the more computational overhead is introduced. This presents a trade-off, which can be seen in Tables 1 and 2. Making more frequent corrections results in lower error but incurs greater computational cost resulting in lower throughput. The throughput can be further improved by training and evaluating the neural network using bfloat16 arithmetic, which has been shown to preserve their learning capability [26]. However, due to the lack of support for bfloat16 on the Tensor Cores on our available hardware, [37] this could not be done in the present study. For this same reason, a direct comparison of throughput with the base 16-bit solver is not meaningful. characteristics of numerical simulations [36]. Therefore it is not a sufficient metric on its own to capture some of the differences seen in Fig. 8, which can be better characterised by inspecting flow properties such as the kinetic energy spectrum. Therefore, we compare the kinetic energy spectra E(k) = 1 2 |u(k)| 2 of the low-precision, corrected and high-precision sequences in each experiment. Figure 9 shows the average of the scaled kinetic energy spectrum E(k) · k 5 over 1000 snapshots from the long-run test set. These 1000 snapshots are taken between 3000 and 150000 timesteps. From Fig. 2, it is evident that the corrected and reference flows are uncorrelated at these timesteps. Generally, both the low-precision and corrected sequences yield spectra which match closely with the high-precision spectra in the low to mid frequency range (k = 4-20). However, we see that the graininess caused by low numerical precision results in surplus energy accumulation at high frequencies (larger k). The models trained with N = 2 fail to produce accurate spectra compared to the high-precision simulation at all frequency ranges. The learned solvers with more frequent corrections (N = 30 and all the models deviate from the high-precision spectra.
We compute the divergence of the velocity fields in the high-precision, low-precision and corrected sequences using finite difference derivatives and inspect its mean value over the domain. We observe that all three flows have a mean divergence of order 1e-11, which shows that the correction made by the neural network preserves the net-zero divergence property of the field. the corrections made to the low-precision flow delay its decorrelation with the reference flow. Compared to the coarse-grid simulation of Kochkov et al. [17], the rate of decorrelation of the low-precision simulation is significantly slower. Consequently, our models achieve a smaller relative delay in decorrelation than their models.
As a test of generalization, we take the neural network obtained using the best-performing set of hyperparameters (T = 3, N = 50) and use it to generate corrections to a simulation performed at a 10 times higher Reynolds number (Re=40000 instead of 4000). Figure Figure 9, the role of the proposed NN correction appears to be identical to that of a filter that damps energy content over a range of frequencies. An experiment has been performed (refer to Appendix A) in these lines where we have compared the spectra and vortical structures replacing the NN correction with a spectral filter. The physics-informed NN correction function learned from data is observed to be superior when com-25 pared to a simple filtering operation. Further investigations can explore the efficacy of a combination of NN correction (to control mid-high frequencies) with a standard filtering operation (to control high-frequency spectra), considering the computational overheads.

Limitations
While our method achieves improvements in error metrics and physical quantities, the hybrid solver does not beat the reference high-precision solver in computational throughput. This is due to limitations in hardware available to carry out the study, as mentioned in Sections 2.2 and 4. The computational efficiency can be further optimized by running on hardware designed for low-precision computation, training the neural network itself using bfloat16 [26], applying convolutions in the Fourier domain [41], and applying model quantization techniques [42]. Improving the throughput of the high-precision solver is an important direction for future work.
The motivation for this method stems from related works by Kochkov et al. [17] and Um et al. [16]. In both these studies and in the current study, snapshots from a Navier-Stokes solver are used as the targets during training, which means that the neural network is implicitly trained to make corrections that satisfy the governing equations. A limitation to this strategy is that the corrections themselves need not explicitly satisfy the governing equations of the flow. Instead, by using snapshots from a real solver as the targets during training, the neural network is implicitly trained to predict a flow field that satisfies the governing equations. A direction for future work that addresses this limitation is to combine our method with the method proposed by Raissi 26 et al. [6], where the network's output is explicitly regularised to satisfy the governing equations by including an additional physics-based term in the loss function.

Conclusion
In this work, we explored the error accumulation due to low precision in a Kolmogorov forced turbulence test case and used deep learning together with a differentiable solver to improve the simulation. The learned ML-CFD hybrid reduces the Mean Squared Error accumulation over time durations longer than it saw during training, improves the kinetic energy spectra and preserves the mean divergence of the flow. We show that making more frequent corrections to the simulation achieves a greater reduction in error, although it comes with an increased computational cost. The neural network's corrections improve the visual fidelity of the low-precision flow, making it look less grainy and distorted. Furthermore, the correction is dependent on only 2 parameters (the frequency of corrections and total time window seen during training), apart from the usual parameters involved in neural network training. We quantified the dependence of the model's performance on these parameters and identified trends which will help implement the method in reality. The GPU memory savings obtained from the reduced numerical precision can enable simulations with more mesh elements than would otherwise be possible. Furthermore, since the hybrid solver is fully differentiable, it is suitable for use in control and inverse problems. While established industrial solvers written in FORTRAN or C are not directly compatible with our deep learning method implemented in JAX, new high-order accurate differentiable solvers are under development using the JAX framework [43,44], which solidifies the long-term viability of our method in real-world uses. We discussed the limitations of the method and identified opportunities to extend and improve it in future work. Although the current study focuses on non-reacting turbulent flows, the methods and the neural network framework discussed here can be readily extended to reacting flows.
where φ is the filtered velocity field and u is the unfiltered velocity field.
The strength of the filter is controlled by the tunable parameter α f in the can be carried out along these lines to explore the possibility of developing 'precision-based' filters. The efficacy of a combination of NN correction (to control mid-high frequencies) with a standard filtering operation (to control high-frequency spectra), considering the computational overheads, can also be verified. [5] Anuj Karpatne, William Watkins, Jordan Read, and Vipin Kumar.