Numerical wave propagation aided by deep learning

We propose a deep learning approach for wave propagation in media with multiscale wave speed, using a second-order linear wave equation model. We use neural networks to enhance the accuracy of a given inaccurate coarse solver, which under-resolves a class of multiscale wave media and wave fields of interest. Our approach involves generating training data by the given computationally efficient coarse solver and another sufficiently accurate solver, applied to a class of wave media (described by their wave speed profiles) and initial wave fields. We find that the trained neural networks can approximate the nonlinear dependence of the propagation on the wave speed as long as the causality is appropriately sampled in training data. We combine the neural-network-enhanced coarse solver with the parareal algorithm and demonstrate that the coupled approach improves the stability of parareal algorithms for wave propagation and improves the accuracy of the enhanced coarse solvers.


Introduction
The computation of wave propagation is a critical ingredient in many important applications. For example, in inverse problems involving the wave equation as the forward model, efficient numerical simulation of wave propagation directly corresponds to a more precise and faster imaging pipeline.
In this work, we will focus on the model initial value problem: u(x, 0) = u 0 (x), u t (x, 0) = p 0 (x), with periodic boundary conditions. Furthermore, we shall assume that (i) the wave speed c(x) is only piecewise smooth with non-trivial total variation, and (ii) the wavelengths in the initial wave field are at least one order of magnitude smaller than the domain size.
Performing large-scale high-fidelity simulations of wave propagation in heterogeneous media is a challenging task. Standard numerical methods typically require very fine spatio-temporal grids to fully resolve the smallest wavelength in the solutions, the geometry in the wave speed's discontinuities, and to reduce excessive numerical dispersion errors. Furthermore, the stability and accuracy requirement imposes an additional restriction on the time step size, making long-time simulation even more difficult.
Many existing multiscale methods tackle the wave equation by assuming a separation of scales in the medium's wave speed. The assumption on more global structures such as periodicity or stationary ergodicity is also common. Such assumptions allow a cleverly designed algorithm to achieve a computational complexity that is essentially independent of the need to resolve the smallest scale globally in space-time. These are algorithms may employ either online or offline computations to approximate the effective wave propagation accurately. We mention the Heterogeneous Multiscale Methods (HMM), see e.g. [29][1] [11], which typically solve the effective problem, free of small parameters, defined by localized online computations. The Multiscale Finite Element Method (MsFEM) style algorithms [10], including the Localized Orthogonal Decomposition (LOD) method, e.g. [2], compute specialized basis functions for each given multiscale medium. See also [12], and [21]. These methods do not require scale separation, but need to fully resolve the medium by the basis functions. As the basis functions are medium-specific, these algorithms typically aim to solve the same equation repeatedly with different initial conditions. See also [12], and [21]. It is also possible to construct low-rank approximations of the numerical wave propagator using data. Druskin et. al. [9,7] introduced a data-driven reduced order model (ROM) for the wave equation. Specifically, using wave field time snapshots the authors construct a wave propagator allowing very accurate inversion of the medium.
In [20], we proposed a multiscale algorithm in which the computation of a coarse, under-resolving solver is corrected by online, using data computed, in-parallel and in localized subdomains, by another accurate solver.

The proposed approach
Our approach departs from the above multiscale methodologies. We propose a supervised deep learning framework to enhance the accuracy of a low-fidelity and com-putationally cheaper solver for (1). After describing how neural networks are used in our approach, we will present our choice of neural network model (the architecture, the input, and output variables), and our approach to generating training data that consistently sample the causality in wave propagation in a class of media with piecewise smooth wave speeds. We will also discuss the rationales behind our approach and report several representative numerical studies.
We assume that a time-reversible coarse solver, G ∆t u ≡ G ∆t [u, c], is given. It advances a given wave field u(x, t) = (u, ∂ t u) to time t + ∆t, using the given wave speed profile c. Additionally, an accurate fine solver, F ∆t u ≡ F ∆t [u, c] is also available. The fine solver is used to generate data for improving the coarse solver. As these two solvers operate on grids of different resolutions (a coarse and a fine grid), we will extend the coarse solver to one defined on the same grid as the fine solver for comparing the wave fields propagated by the two solvers. The extension is defined via a restriction (projection) operator R, which maps fine grid functions to the coarse grid. We shall also need a prolongation (e.g. interpolation) operator I, which maps coarse grid functions to the fine grid. R and I are typical elements in a standard multigrid algorithm.
We reorganize the time stepping by the fine solver into In general, R can be thought of as a low-pass filter, and IR is not identity; i.e. IRu = u. The second term above, F ∆t u n − F ∆t (IRu n ), should consist mostly the higher frequency components of F ∆t u. As for the first term, since R filters out the higher frequency parts in u, it is reasonable to approximate F ∆t (IRu n ) by a computationally cheaper strategy using a coarser grid -we consider IG ∆t (Ru n ).
The second term on the right hand side of (3) serves as an attempt to add back the missing high frequency components, using information from the previous iteration. The iteration in (3) is often unstable when the solvers have too little dissipation. In such cases, the stability and convergence of (3) will rely heavily on IG ∆t R being a good approximation of F ∆t IR. However, in heterogeneous wave media, F ∆t IRu will likely produce higher frequency modes, even if Ru have only low frequency modes.
Consequently, we should not expect the approximation by IG ∆t R to be globally sufficient.
One idea is to replace I in IG ∆t R by something more elaborate. Our objective is to construct neural networks that approximate the operation using the data (G ∆t Ru, F ∆t u), with u and c sampled from certain distributions of interests. We then use the resulting enhanced solver for wave propagation on the fine grid. Note that in most parts of this paper the solvers' dependence on the wave speed, c, is hidden for brevity of notation. We will rely on the parareal iteration to add back the missing high frequency components. The enhanced parareal scheme take the same form as above u k+1 n+1 := θ ∆t (G ∆t R) u k+1 n + F ∆t u k n − θ ∆t G ∆t Ru k n , k = 0, 1, . . . , K, with u 0 n+1 := θ ∆t G ∆t Ru 0 n .
The wave solution operator, and most numerical solvers, are linear and reversible with respect to the initial data but nonlinear to the wave speed. So is θ ∆t . However, the operator θ ∆t (u, c) is a so-called "near-identity" operator for sufficiently smooth wave speeds and wave fields. Therefore, nearby a fixed reference speed function c 0 , the linear operator θ ∆t (·, c 0 ) has potential to offer an adequate, though non-optimal, approximation of F ∆t I(G ∆t ) −1 .
It may be helpful to inspect the approximation of an individual Fourier mode of the wave field and consider errors in the phases and the amplitudes. The additive correction in the parareal iterations is very effective in reducing the amplitude errors. However, the phase errors generally have a detrimental effect on the stability of parareal iterations. See e.g. [3][20] [26] [16]. It is therefore imperative that θ ∆t corrects the numerical dispersion errors in G ∆t Ru compared to F ∆t u. Furthermore, as discussed above, we also expect that θ ∆t to introduce some of the higher frequency modes that come out of propagating a smooth given wave field through the given non-constant media.
Notice that for each instance of c, F ∆t IG −1 ∆t is a linear operator, provided that the prolongation I is linear. With the nonlinear activations and biases in the network removed, the optimization model defines a regression model to construct data-driven (factorized) linear approximations of the operator associated with F ∆t IG −1 ∆t . In which regime will this type of linear approximation work? We shall present some experiments and comparisons regarding this observation.
In Section 2, we present the proposed learning model and our approach in generating training data. In Section 3, we report results showing that a properly trained, simple, multi-level network can be an effective corrector of the coarse solver in a range of media. In Section 4, we apply the proposed networks to the parareal scheme in four wave speed models.

Other related work
The authors in [27][23] train a convolutional neural network (CNN) for correction of low-fidelity solutions. As a supervised learning task, the data consists of lowfidelity and high-fidelity pairs, computed using the second-order and twentieth-order finite-difference, respectively. The data are time snapshots of the solutions for various media and sources. The experimental results in these papers indicate that the trained CNN can remove the numerical dispersion in complex velocity models. This result is encouraging since their setup is quite straightforward.
In [19], a neural network model is proposed to learn wave propagation on a fixed medium. The Physics-Informed Neural Network (PINN) [22] approach to setup the neural network's optimization model. This means the use of the given PDE on a chosen setup of points in the domain as the penalty term. The fully-connected neural network learns the linear relation between the initial wave source and the resulting wave field at later time. Each forward feed outputs the estimated wave field at the specified location. Applying the network for simulation on a different wave medium requires retraining the network.
Another approach to using neural networks in a parareal framework is discussed in [18]. The authors propose parareal physics-informed neural network (PPINN). In particular, the sequence of fine solvers, mapping solutions from t n to t n+1 , n = 1, 2, .., are fully connected neural networks (PINNs) trained with loss functions that incorporate penalize the network outputs misfit to given differential equations. See [22]. The key idea of [18] is using the the parareal scheme to train the sequence of PINNs in parallel. However, to the best of our current knowledge, the accuracy and stability of the PINN method for wave problems are not well understood.
Our approach is similar to [27][23], but with additional key findings and further improvements. Most differently, our algorithm extensively uses the wave energy variables; i.e. the spatial gradient ∇u and the time derivative of u weighted by the wave speed. We think that this strategy allows the algorithm to retain more physical properties of the wave system. This setup also leads to the input variables being dimensionless, which allows the algorithm to scale better. Such approach involving the energetic variables are used successfully in [28][20].

The numerical solvers
In this section, we describe the coarse and fine solvers used in this paper. Both solvers are defined by the standard central differencing for the partial derivatives, applied on uniform Cartesian grids for the spatial domain and uniform stepping in time.
Starting with the discretization of the spatial derivatives: and e 1 , e 2 denote the standard basis of R 2 . The central scheme in time (which is the Leap Frog scheme for (9)) can be written as the velocity Verlet time integrator for where λ = k/2h 2 and D 2 This scheme scheme is widely use in practice for large scale computations, particularly for media with less smooth wave speeds.
We will denote the mapping defined in (11) by S h,k (u). Both the fine and the coarse solvers will be defined by S h,k with different k and h.
The fine solver F ∆t used in this paper is defined by the same scheme (11) on a finer grid: δxZ 2 × δtZ + . More precisely, The coarse solvers G ∆t * is defined by running (11) for M time steps on the coarse grid ∆xZ 2 × ∆tZ + ; i.e.
We assume that both the fine and the coarse solvers are stable on the respective grids. Furthermore, we assume that the fine solver is sufficiently accurate for wave speed c ∈ (0, c max ], where c max is an upper bound for the wave speed in the class of wave media of interest. If the two solvers are defined on different Cartesian grids (∆x > δx), we need to employ additional operators to map grid functions defined on the two grids. In this paper, R denotes the restriction operator that maps a finer grid function defined on ∆xZ 2 to one on δxZ 2 . I denotes the prolongation operator that maps a coarse grid function to fine grid.
With a slight abuse of notations, we will also use u, u t , and u = (u, u t ) to denote the corresponding discrete versions computed on the fine grid.
Wave energy components The solvers approximately preserves the wave energy or its discretized version, which also defines a discrete semi-norm on the grid: Due to this property, we shall also use a discretized version of the wave energy as a semi-norm for comparing wave fields computed by the central scheme. In this paper, we will design algorithms that are defined as functions of the energy components (∇u, u t ) of the wave field u = (u, u t ).
On the spatial grid hZ 2 ∩ [−1, 1] 2 with periodic boundary conditions, define the mapping into energy components as where ∇ h u is computed by numerically. Λ † h denotes the pseudo-inverse of Λ h and is defined in Appendix A.1.
The enhanced solvers Our goal is to construct neural network functions H w , parameterized by w, such that For a class of piecewise smooth wave speed functions. In other words, H w corrects the energy components computed by the coarse solver G ∆t * R, according to those computed by F ∆t * . The time step size of the enhanced coarse solver u The wave field vector: (u, ∂ t u) θ ∆t * The correction operator G ∆t * The coarse solver defined on a coarse grid F ∆t * The fine solver θ ∆t * G ∆t * R The enhanced solver on the fine grid.
The neural networks used for defining The discrete wave energy norm of u.
An elementary training data set The training data set derived from D t With H w , we define the enhanced solver as

Correction of dispersion errors
Here we present a simple and classical dispersion property of the central scheme. The point is to identify a regime in which θ ∆t * (u, c) is an identity plus a minor perturbation. In such a regime, the enhanced solver should approximate the perturbation. We analyze in a simplified setup involving plane waves: exp i(ωt + kx) propagating in a medium with constant speed c in one dimension. We assume that the fine solution operator F ∆t to be exact, mapping the given wave field at time t to t + ∆t * : which has the linear dispersion relation ω(c, k) = ±ck. Here we exclusively use k to denote the wave number (not the parareal iteration). On the other hand, we approximate the fully-discrete coarse solver by a semi-discrete scheme. More precisely, in this scheme denoted as G sd ∆t * , central differencing with step size ∆x is used to approximate the spatial derivatives, and the resulting system is integrated in time exactly. We denote the coarse solutioñ . The semi-discrete scheme G sd ∆t * gives us the numerical dispersion relation Define the difference between the positive-signed ω and ω ∆x as the numerical dispersion error We see the well-known property that higher wave numbers lead to larger dispersion errors, and asymptotically, the need to decrease ∆x at a faster rate than 1/k. We can thus approximate the energy components of the fine solution, (∇u, c −1 u t ) by correcting the numerical dispersion errors in the coarse solutions (∇ũ, c −1ũ t ). The correction can be written as a convolution: where F [·] denotes the Fourier transform from the frequency k to x. For details, readers refer to Appendix A.2. We notice that in Equation (16), for small k∆x, ∆t * , the leading order term in the correction is an identity map. This fact could serve as a theoretical motivation for adding a skip connection between every corresponding pair of layers in the multilevel network described below. The numerical dispersion relation (16) and (18) also predict that the main role of the neural network would be to overcome the challenges resulting from higher frequency wave components, to say the least.
In Section 3.2, we presents an example illustrating the numerical dispersion errors in a piecewise constant wave wave medium. There, we will compare the correction of such errors by the proposed networks.

The neural networks and training
In this section, we describe the network architecture, the optimization model, and the generation of training examples.
The network's input variables are the energy components of a wave field on the coarse grid and the wave speed function -these are functions on teh coarse grid. The outputs are the energy components of the processed wave field, but on the fine grid. The network does not explicitly involve any approximation of the derivatives of the wave field.

The JNet architecture
Our chosen network architecture is a U-net architecture [25] with skip connections, though in our setup the downward-upward path in the network resembles more like The network contains a cascade of convolution layers encoding the input tensor into a so-called feature space. Then a cascade of convolution layers decodes the feature space to produce the final output image. The encoding pathway utilizes average-pooling layers to downsample the intermediate output, while the decoding pathway uses the bilinear interpolation. The network structure is also similar to the multigrid network MgNet [15].
The input tensor has 4 channels consisting of the energy components of the coarse solution and the wave speed functions defined on the coarse grid. Figure 1 illustrates a 3-level network. The networks defined on such structure is the focus of this paper. A network with L down-sampling processes are referred to as an L-level network. In particular, we shall report computational results obtained from trained 3-level and 6-level networks. The ones with ReLU activations will be referred to as JNets. The networks with ReLU replaced by the identity function and no biases will be referred to as the linear networks. In this paper, we will compare Finally, as a reference, the forward feed of a 3-level JNet, implemented in PyTorch with a 128 × 128 × 4 input tensor, takes roughly 0.028 seconds on the CPU of a moderately equipped laptop. On the same computer, the fine solver, implemented in NumPy, takes roughly 0.058 seconds to evolve a given initial wave field of size 128 × 128 × 2 to time t = 0.1.

Data sets and optimization model
To learn the coarse-to-fine solution map H w : X → Y, the training data should contain sufficiently representative examples, sampling the wave speed c of interest and the wave fields of relevance -those propagated by the coarse and fine solvers, starting from a class of initial conditions of interest. In our setup, The wave speed c and the wave field u are sampled from the distributions described below.
Both the coarse and the fine solvers use the standard second order scheme defined in Section 1.3. The discretization parameters of the solvers are tabulated in Table 2.
The data sets discussed below will be made available for download.
The wave speed and initial wave field samples D 0 and D p 0 (∆t * ) The data set for the medium wave speed comprises of randomly selected subregions (of different areas and rotations) of the full Marmousi model [8] and the BP model [6]. The wave speed in each of the subregions will be mapped (subsampled and rescaled 1 ) to a 128 × 128 array, which is regarded as a grid function defined on hZ 2 ∪ [−1, 1) 2 , h = 2/128. See Figure 2. Figure 3 illustrates the statistics in the wave speeds sampled from these two data sets. Additionally, we remark that Wu et al. [30] developed a comprehensive methodology to synthesize the mediums with realistic earth layers and faults. Their approach may be use to further enrich the data set.
For each of the sampled subregions, an initial wave field, u 0 = (u 0 , ∂ t u 0 )) is sampled from the Gaussian pulses of the form: The pair (c, u 0 ) thus sampled is collected in D 0 .
Next, we generate a set of training examples that have more complex initial wave fields. For each (c, u 0 ) in D 0 , we collect u k n that are computed by the Procrustes Parareal scheme, as defined in (26)-(27), for n = 0, 1, . . . , N, and k = 0, 1, 2, 3, 4. We shall refer to this collection as D p 0 (∆t * ).
The example wave fields D t (∆t * ) and D p t (∆t * ) We recognize the need to sample the strong causality in the wave dynamics. Ergo we sample the trajectories passing through the wave speed-initial wave field pair in D 0 and D p 0 . Figure 4 demonstrate the scheme which we use to generate the training two training sets. The trajectories are discretized by a finite number of points computed from the fine solver (one could use the coarse solver for this as well). The points from the trajectories are paired with the associating wave speed function and collected into the training data sets.
D p t (∆t * ) This data set consists of the pairs We generate 10, 000 examples from D 0 and D p 0 (∆t * ), and name the resulting databases D t (∆t * ) and D p t (∆t * ) respectively. When the context is clear, we will use D t and D p t for simplicity. Figure 5 shows the Fourier mode statistics of the wave fields in D t and D p t . We observe that the wave fields in D p 0 have more high frequency modes.
m X Z 0 n z u G q d V U 9 u T y u 1 / T y O I t p F e + g Q W e g c 1 d A N q q M G o u g R P a N X 9 G Y 8 G S / G u / E x a S 0 Y + c w 2 + g P j 8 w d L j 5 a y < / l a t e x i t >      m X Z 0 n z u G q d V U 9 u T y u 1 / T y O I t p F e + g Q W e g c 1 d A N q q M G o u g R P a N X 9 G Y 8 G S / G u / E x a S 0 Y + c w 2 + g P j 8 w d N I Z a z < / l a t e x i t > G t ⇤ < l a t e x i t s h a 1 _ b a s e 6 4 = " a j 1 q q 0 h l F w R 0 R J 5 n V f D j B 2 p m C m X Z 0 n z u G q d V U 9 u T y u 1 / T y O I t p F e + g Q W e g c 1 d A N q q M G o u g R P a N X 9 G Y 8 G S / G u / E x a S 0 Y + c w 2 + g P j 8 w d N I Z a z < / l a t e x i t > G t ⇤ < l a t e x i t s h a 1 _ b a s e 6 4 = " a j 1 q q 0 h l F w R 0 R J 5 n V f D j B 2 p m C m X Z 0 n z u G q d V U 9 u T y u 1 / T y O I t p F e + g Q W e g c 1 d A N q q M G o u g R P a N X 9 G Y 8 G S / G u / E x a S 0 Y + c w 2 + g P j 8 w d N I Z a z < / l a t e x i t > G t ⇤ < l a t e x i t s h a 1 _ b a s e 6 4 = " a j 1 q q 0 h l F w R 0 R J 5 n V f D j B 2 p m C m X Z 0 n z u G q d V U 9 u T y u 1 / T y O I t p F e + g Q W e g c 1 d A N q q M G o u g R P a N X 9 G Y 8 G S / G u / E x a S 0 Y + c w 2 + g P j 8 w d N I Z a z < / l a t e x i t > G t ⇤ < l a t e x i t s h a 1 _ b a s e 6 4 = " a j 1 q q 0 h l F w R 0 R J 5 n V f D j B 2 p m C i U = " > A A A C A X i c b V D L S s N A F J 3 U V 6 2 v q h v B z W A V x E V J V N R l Q R G X F e w D m h g m 0 0 k 7 d P J g 5 k Y o o W 7 8 F T c u F H H r X 7 j z b 5 y 0 W W j r g Q u H c + 7 l 3 n u 8 W H A F p v l t F O b m F x a X i s u l l d W 1 9 Y 3 y 5 l Z T R Y m k r E E j E c m 2 R x Q T P G Q N 4 C B Y O 5 a M B J 5 g L W 9 w m f m t B y Y V j 8 I 7 G M b M C U g v 5 D 6 n B L T k l n f s g E C f E p F e j 9 z U v m I C C I b 7 o 5 F b r p h V c w w 8 S 6 y c V F C O u l v + s r s R T Q I W A h V E q Y 5 l x u C k R A K n g o 1 K d q J Y T O i A 9 F h H 0 5 A E T D n p + I M R P t B K F / u R 1 B U C H q u / J 1 I S K D U M P N 2 Z 3 a u m v U z 8 z + s k 4 F 8 4 K Q / j B F h I J 4 v 8 R G C I c B Y H 7 n L J K I i h J o R K r m / F t E 8 k o a B D K + k Q r O m X Z 0 n z u G q d V U 9 u T y u 1 / T y O I t p F e + g Q W e g c 1 d A N q q M G o u g R P a N X 9 G Y 8 G S / G u / E x a S 0 Y + c w 2 + g P j 8 w d L j 5 a y < / l a t e x i t > F t ⇤ Figure 4: The sampling of the solution manifold: On each random sample of medium wave speed, the coarse and fine solvers propagate initial wave fields consisting of random Gaussian pulses. Of each data point, x corresponds to a blue box above, and y a corresponding red box.  The optimization model and training We use a simple stochastic gradient descent algorithm to minimize the mean squared error (MSE) where D is the training data set and |D| denotes the number of training examples. D is either D t (∆t * ) or D p t (∆t * ) defined above. Here the summands correspond to the approximation errors in the discrete wave energy (13). Similar strategies of using wave energy to compare wave fields for wave propagation purposes have been used successfully, for example in [24] [28], and [20].
The neural networks are trained by the mini-batch stochastic gradient descent (SGD) with momentum, starting with the initial conditions prescribed by PyTorch's built-in routines. The training data is randomly divided into batches of 128 examples. We report the MSE in a typical training progress for each data set in Figure 6. Notice that the training errors on the right subplot appear approximately one order of magnitude smaller than those on the left. This is a subject of future investigation. It is possibly due to the Procrustes solutions's being closer to each other in some suitable sense.
In Table 3, we further report a comparison of the final training errors and testing errors between the 3-level and 6-level JNet, using D p t .

Propagation
In this section, we study the enhanced propagator θ ∆t * in a few setups. We shall report the computed results using the relative energy error where u is the computed wave field and v is the reference solution.

Comparison of propagation errors on constant media
We test the enhanced solvers on constant media of different wave speeds, using initial wave fields consisting of pulses of various widths. The networks are trained on data sets of different cardinalities, sampled from the same distribution.
The test initial wave field is defined as follows: where σ is sampled uniformly σ −1 ∈ [5,20]. Larger valuers of σ −1 correspond to wave fields whose Fourier modes for higher wave numbers have larger magnitudes ("sharper" wave fields). Coarse and fine solutions are propagated in a constant medium c ∈ [0.1, 3.0]. The errors in the energy is defined as We remark that the dependence of c for G ∆t * and F ∆t * are hidden from the notation. We report a set of comparisons of e(σ, c) in Figures 7 and 8 for ∆t * = 0.1 and 0.2 respectively. The results are obtained from the 6-level linear and ReLU JNets, with the parameters reported in Table 2. The networks trained on D t (0.1) and D p t (0.2). In Figure 7, from the two rows of error images, we observe that while linear networks produce roughly comparable errors to those from the ReLU networks in the region of lower wave speed and smaller σ −1 . On the opposite regime (larger c and σ −1 ), the nonlinear ReLU networks perform better. As discussed in Section 1.3, sharper wave fields and faster wave speeds result in larger nummerical dispersion errors. The approximation power in the nonlinear networks seems to contribute the better performance. We also notice that the ReLU networks trained with 10,000 examples seems to achieve slightly larger minimal errors than those by the networks trained with 5,000 examples. We will see in Section 4 that a more significant number of training examples does result in better generalization capacity for the ReLU networks and perhaps not surprisingly have relatively little impact on the linear networks. Figure 8 presents a different situation. With ∆t * = 0.2, the difference between the coarse solver and the fine solver is larger. Here we see a clear advantage in using the nonlinear networks.

Numerical dispersion and refraction
In this section, we study the enhanced solvers in a wave medium described by piecewise constant wave speed of moderate contrast. Figure 9 shows the model layered medium and the anisotropic initial Gaussian pulse. We apply different solvers to propagate that initial wave field. The wave fronts associated with different wavenumbers travel at different speeds and transmit through the interfaces (the wave speed discontinuities) at different times. Figure 10 shows the energy of the propagated wave fields. The networks used in these simulations are trained on D P t . The proposed networks can provide a certain level of non-trivial correction to the dispersion and refraction in the coarse solution.

More complex wave speeds
We demonstrate the effectiveness of the trained neural networks in proving the accuracy of coarse solver. We compute the evolution of the wave field with the initial condition u 0 = (e −200(x 2 +y 2 ) , 0).
We consider four test media: • Waveguide: c(x, y) = 0.7 − 0.3 cos(πx),   linear 2K data linear 5K data linear 10K data relu 2K data relu 5K data relu 10K data  Table 2. Figure 13 presents a comparison of the coarse solver and a few enhanced solvers θ ∆t * G ∆t * R. Compared to the training and testing errors presented in Table 3, we see that the effect of the "downsampling" operator R dominates to a large degree. We encourage the readers to compare this result to the one presented in Figure 18 in the next section.
In Figure 14 we show the "phantom" energy in the output of H w 0. While this can be regarded as an undesirable effect, it does improve the amplitude of the propagated wave field. We also observe that the trained 6-level network produces drastically reduced amount of "phantom" energy.

Improvement by parareal iterations
One of the shortcomings of employing a current deep learning strategy for scientific computing tasks is the lack of rigorous error estimates. Our idea is to use the proposed deep learning strategy in a self-improving loop, in which one can show that the computed errors will contract along the iterations. Here, we adopt the  parareal framework, originally proposed in [17], as the potentially self-improving looping mechanism.
Parareal methods can be regarded as a fixed point iteration that involves the coarse and the fine solvers: where the loop is enumerated in k and the subscript n denotes the time stepping. The initial conditions for the iterations are defined by u 0 n+1 =G ∆t * u 0 n , n = 0, 1, . . . , N − 1.
In the context of this paper, u n k := (u n k , ∂ t u n k ) is defined on the fine grid, andG ∆t * := IG ∆t * R.
The convergence of parareal methods are analyzed for different contexts in several papers. See e.g. [5], [13], [4][26] [16]. We adhere to a simple version, which states that at the discrete level with the vector ∞-norm, the error amplification factor for the parareal iteration is the product of ||F ∆t * −G ∆t * || and 1−r N 1−r , with r ≡ ||G ∆t * ||. The iteration is convergent when the coarse solver is sufficiently accurate, depending on the magnitude of ||G ∆t * || and the size of the iterative system, N . It is a challenge to construct a stable and convergent parareal method for purely hyperbolic problems, especially when the coarse solver is under-resolving the medium.
As we have seen in the serial computation examples in Section 2.2, our network reduces the errors in the coarse solutions by 90% in very challenging media. Hence, we expect the parareal iteration involving the enhanced coarse solver to be more stable and, in turn, be used to improve approximation accuracy.
Following the framework proposed as the θ-parareal schemes [4] [20], the focus of this section rest on schemes of the form: where the operator θ ∆t * under consideration include where Ω is a unitary matrix derived from the online data {u k n }. • θ ∆t * is defined by a JNet, trained off-line.

The Procrustes parareal method
The Procrustes parareal method is defined as follows: whereG ∆t * := IG ∆t R, and for k > 0, Ω k solves the following optimization problem: The data matrices G k and F k are defined as follows: • The n-th column of G k corresponds to the vector ΛG ∆t * u k n , n = 0, 1, . . . , N. • The n-th column of F k corresponds to the vector ΛF ∆t u k n , n = 0, 1, . . . , N. For k = 0, u 0 n are defined in (24). We call this method the Procrustes parareal method since Ω k is constructed by solving the Orthogonal Procrustes Problem [14].
The Procrustes parareal algorithm is introduced in [20]. In the original version, the fine and coarse solvers communicate on the coarse grid. This means that the data matrices records the wave fields on the coarse grid and Ω k is smaller. In contrast, in the version presented above, the the solvers communicate on the fine grid. This allows the resulting algorithm to correct for details corresponding to higher wave numbers.
Both versions of the Procrustes parareal method are remarkably stable for wave propagation. We point out again, that due to this stability, we are able to use this method to generate training examples that lead to more performant neural networks. See Section 2.2.

The neural network approach
Of course, the center of this paper is and H w is a pretrained neural network discussed above. We initialized the parareal iterations by the coarse solutions:

Numerical simulations and comparisons
First, we follow up the dispersion study in Section 3.2 and present the wave energy field computed in the first four parareal iterations in Figure 11. There we observe the effect of the additive correction of the parareal updates -with accurate phase, the amplitudes are restored effectively.
In Figures 15-19, we observe that the initial errors (k = 0) from different enhanced solvers are similar to the results reported in Figure 13. The initial errors are dominated by the effects of the restriction operator in G ∆t * R and being repeatedly applied in the time stepping process. However, as k increases, the effect of having better phase errors in the Jnets can be observed. In some situations, the advantage of having more dissipation can also be observed.
We present a comprehensive accuracy study in Figure 15, comparing the solvers that we have discussed so far. There, we observe the importance of the training data size for the nonlinear networks. We also notice that the linear networks perform surprisingly well in the studied setup (particularly the linear network trained with on 2000 examples). One possible explanation is that the operator to be approximated, is closer to the identity matrix with a nonlinear perturbation (both F ∆t * and G ∆t * are nonlinear functions of c); as least for ∆t * = 0.1 and the class of wave speeds under consideration. Ergo, linear regression may yield reasonable approximations. However, for larger ∆t * , we have seen in Figure 8 that linear networks are inferior. Figure 16 shows a comparison of the networks trained on D t and D p t . We observe that the network trained on D p t yields better performance after a few iterations. It is possible that the wave fields in D p t are closer to the ones encountered in the simulations performed in this experiment -the wave fields are computed by parareal style updates. In Figure 17 we compare the accuracy of the three-level JNets trained for ∆t * = 0.2 and 0.25. We first observe from the errors at k = 0 that the enhanced solver with the smaller ∆t * is more accurate. Except for the BP-model, the performance of the parareal iterations with the two networks seem comparable.
In Figure 18, we present a comparison of the performance of the proposed parareal correction, (25), using the three-level and the six-level networks, both trained on D p t . The iterations with the three-level network appear to yield smaller errors because the three-level network has larger numerical dissipation (See Figure 10 and Section 3.2).
In Figure 19, we compare the Procrustes approach defined in (26) and the coarse solver enhanced by a 6-level JNet with ∆t * = 0. 25.
So what can we conclude from these examples? First, the phase accuracy is of upmost importance in the performance of the parareal-style iterations. Second, the parareal scheme itself is sensitive to the amount of dissipation in the system. For systems with little or no dissipation, such as the hyperbolic system that we consider in this paper, the size of the iterative system, N = T /∆t * will have a critical role: the larger it is, the smaller the difference between F ∆t * and G ∆t * has to be in order to have stability and achieve the desired correction. Of course, the stability and convergence for such type of iterations is well understood, e.g. [31]. Our examples not only confirm such theories, but also demonstrate the feasibility of adopting a data-driven approach, e.g. the Procrustes approach [20] using online data or the deep learning approach using off-line data. Finally, we noticed that the BP model presents more difficulty for the proposed strategy. One possible cause may be the higher contrast in the BP model -particularly the large discontinuity across the periodically extended top and bottom boundaries -and that the training data should include more instances of wave propagating through such type of interfaces. This is a subject of future investigation.

Summary
In this paper, we presented a deep learning approach for improving the accuracy of wave propagation by an under-resolving numerical solver. The data used for training the neural networks come from simulations involving the coarse solver to be corrected and a more accurate fine solver. We incorporated the developed deep learning framework into the parareal scheme, thereby improving its convergence property and the accuracy of the simulated wave propagation. Experimental results show that the presented approach dramatically enhances the stability and accuracy of the assimilated parareal scheme in a class of challenging media.
There is much room for generalization and improvement. We presented an approach to learn the correction operator θ ∆t * for fixed ∆t * . Would it be possible to learn θ ∆t * as a function of ∆t * ? The presented numerical examples comparing networks trained on D t and D p t demonstrated the importance of sampling the computationally relevant, strongly causal wave fields. A more rigorous understanding of the wave fields manifold and devising efficient sampling algorithms would undoubtedly lead to more effective learning results.
Since ω − ω ∆x = ck ε c,∆x (k) = ω(c, k)ε c,∆x (k). where εωt ≡ ε c,∆x (k)ω(c, k) t. Applying Fourier transform to both sides of (37) for physical domain, the matrix multiplications become convolutions Here, the convolution and the Fourier transform are defined component-by-component. If the term εωt is sufficiently small (e.g. small wave number, slow medium wave speed, and small ∆t * ), the correction operator is a near-identity map.