Parallel GPU-based Implementation of One-Way Wave Equation Migration

We present an original algorithm for seismic imaging, based on the depth waveﬁeld extrapolation by the one-way wave equation. Parallel implementation of the algorithm is based on the several levels of parallelism. The input data parallelism allows processing full coverage for some area (up to one square km); thus, data are divided into several subsets and each subset is processed by a single MPI process. The mathematical approach allows dealing with each frequency independently and treating solution layer-by-layer; thus, a set of 2D cross-sections instead of the initial 3D common-oﬀset vector gathers are processed simultaneously. This part of the algorithm is implemented suing GPU. Next, each common-oﬀset vector image can be stacked, processed and stored independently. As a result, we designed and implemented the parallel algorithm based on the use of CPU-GPU architecture which allows computing common-oﬀset vector images using one-way wave equation-based amplitude preserving migration. The algorithm was used to compute seismic images from real seismic land data.


Introduction
Seismic imaging is a procedure which allows construction of the subsurface images from seismic data. Eventually, a seismic image is a convolution of a recorded signal and two Green's functions (one for source position, the other one for the receiver position). This procedure should be applied to all possible source-receiver pairs. Thus, the most computationally intense part is the computation of the Green's functions. To simplify this step, depth extrapolation of the wavefield was suggested in [1], where the wavefield is computed by solving the initial value problem for the one-way wave equation (OWE). A pseudo-spectral method to solve one-way wave equation was suggested recently in [2,3].
OWE-based migration has several computational features, which make it suitable for the processing of large common-offset datasets. First, Greens functions for different sources-receivers positions can be computed independently and then combined to construct the image in (ω-x) domain. Second, Greens functions and thus images are extrapolated layer-by-layer, thus at each step of the algorithm one deals with 2D cross-sections of the solution defined on plane z = const, and we can process a high number of solutions simultaneously. As a result, one can consider any combination of the Greens functions for a fixed frequency at a fixed depth and construct an image; i.e. common-shot, common-receiver, common-azimuth, etc.

Mathematical Background
Consider initial value problem for the one-way wave equation stated in (ω − x) domain: To solve equation (1), we suggest using the pseudo-spectral method presented in [2]. This approach allows to perform explicit stepping in vertical direction; i.e. assuming the solution to be known at a plane z = z 0 , we can compute it at a plane z = z 0 + ∆z by the following rule: where k x and k y are the spatial frequencies, F [] and F −1 [] denote the forward and inverse Fourier transforms with respect to the directions x and y, v j are the nodal velocities, so that for each point v(x, y, z) ∈ [v n , v n+1 ], parameters α j are the coefficients of interpolation, as described in [2].
Using delta-functions as initial data one can compute Green's functions, corresponding to the positions of the sources and the receivers, and construct an image: where x s = (x s , y s ) is a vector of sources coordinates at the daylight, and G(ω, x, y, z, x s ) is the Green's function corresponding to the source position x s . Same notations are used for the receivers. Function f is the impulse recorded by the receiver at point x r and emitted by the source at point x s . Choice of the domains of integration D s and D r define the type of constructed images (common-shot, common-receiver, common-offset etc.). Note, that a coarse grid (50 m) in a vertical direction is used to solve OWE, after that the solution is interpolated to the grid with the step of 5 m to construct an image. In this work, we use a linear frequency-dependent interpolation, which can be applied directly to the image, rather than to the Green's function, which drastically reduces the number of computations.

Parallel Implementation
To account for data parallelism, we divide the input data into subsets, to be processed by a single MPI process. We use the coordinates of the mid-points to parametrize the imaging domain, so we apply the 2D domain decomposition and consider all the sources and receivers corresponding to the mid-points from a subdomain as a single dataset or "stencil". Data division leads to the loss of the algorithm scaling because some of the sources/receivers belong to several datasets. Thus, we need to compute the Green's functions for these sources/receivers several times; however, we achieve the number of the Green's functions recomputations as low as 2.3 on average.
Parallel implementation of the computational part of the algorithms includes two aspects. First, for each time-frequency, we use a loop with respect to depth; i.e. we compute the Green's functions for all sources and receivers positions, combine them and multiply by the corresponding signal. Thus, at fixed depth level, we construct all possible images I k (ω m x, y, z l ). The GPUs do the computation of the Green's functions and construction of images. After that, a set of 2D cross-sections of the images I k (ω m x, y, z l ) are uploaded to RAM. Next, we apply all-to-all MPI communications to exchange the images, so that each MPI process accumulates single commonoffset images for all stencils. Second, we sum up images from different stencils (integrate over sources/receivers positions) and interpolate the result within a slab [z, z + ∆z]. CPU performs these operations in parallel with the computation of the Green's functions described in the previous paragraph. The output data of each MPI process is a single common-offset vector image for all stencils. The block-scheme of the algorithm is presented in Fig. 1.

Weak Scaling
To estimate a weak scaling of the algorithm, we consider the SEG Salt model (open-source model), for which we compute 17 images, having 2600 datasets. We perform simulations using 17, 34, 51, and 68 MPI processes, so, that each node deals with one dataset at a time. We provide the times needed for simulation in Tab. 1. The loss of the efficiency is caused by the MPI exchanges, which are implemented with enforced synchronization.

Real-Data Example
We use the algorithm to construct the seismic image using real onshore seismic data. The size of the model is 15525 m InLine (x-direction) and 11250 m CrLine (y direction), depth is 5000 m. We compute 40 common-offset vector images, using 320 datasets, and 40 computational nodes. The total wall-clock time is about 320 hours; thus, the computational time is 11520 nodehours to construct the set of 40 images. A slalom-line section of the obtained 3D seismic image is presented in Fig. 2.

Conclusions
We presented an original algorithm of seismic migration, based on the solution of the oneway wave equation. The algorithm combines MPI, OMP, and CUDA technologies. Dataflow is parallelized via MPI so that each node deals with a single dataset. Computations of the Green's functions and the images are performed by GPU. After that, the images are passed between the nodes using MPI. Additional, computations and I/O are implemented via OMP technology.