Data-driven Seismic Waveform Inversion: A Study on the Robustness and Generalization

Acoustic- and elastic-waveform inversion is an important and widely used method to reconstruct subsurface velocity image. Waveform inversion is a typical non-linear and ill-posed inverse problem. Existing physics-driven computational methods for solving waveform inversion suffer from the cycle skipping and local minima issues, and not to mention solving waveform inversion is computationally expensive. In recent years, data-driven methods become a promising way to solve the waveform inversion problem. However, most deep learning frameworks suffer from generalization and over-fitting issue. In this paper, we developed a real-time data-driven technique and we call it VelocityGAN, to accurately reconstruct subsurface velocities. Our VelocityGAN is built on a generative adversarial network (GAN) and trained end-to-end to learn a mapping function from the raw seismic waveform data to the velocity image. Different from other encoder-decoder based data-driven seismic waveform inversion approaches, our VelocityGAN learns regularization from data and further impose the regularization to the generator so that inversion accuracy is improved. We further develop a transfer learning strategy based on VelocityGAN to alleviate the generalization issue. A series of experiments are conducted on the synthetic seismic reflection data to evaluate the effectiveness, efficiency, and generalization of VelocityGAN. We not only compare it with existing physics-driven approaches and data-driven frameworks but also conduct several transfer learning experiments. The experiment results show that VelocityGAN achieves state-of-the-art performance among the baselines and can improve the generalization results to some extent.


I. INTRODUCTION
S Eismic full-waveform inversion techniques are commonly used in geophysical exploration to determine site geology, stratigraphy, and rock quality. These techniques provide information about subsurface layering and rock geomechanical properties. In particular, seismic full-waveform inversion infers a 2D/3D map of seismic velocity from observations (Fig. 1). The seismic velocity depends on and therefore predicts subsurface material properties. There are two primary ways of solving this problem, depending on the complexity of the forward model that is used. The simpler approach is via travel time inversion [41], which has a linear forward operator, but provides results of inferior accuracy and resolution [27]. Acoustic-and elastic-waveform inversion (AEWI) techniques [8,12,16,45] provide superior solutions by modeling the wave propagation in the subsurface, but the forward operator is non-linear and computationally expensive to simulate, and the problem is illposed, without a unique solution [45], Acoustic-and elastic-waveform inversion (AEWI) can be solved in either time domain or frequency domain [11,16,44]. The major challenges of solving AEWI mostly come from three folds:ill-posedness, cycle skipping, and high computational cost. Similar to other geophysical exploration methods, AEWI suffers from the limited data coverage, which results in extremely under-constrained inverse problems. Due to the fact that AEWI is highly non-linear and sensitive to the initial guess, a naive approach to the AEWI problem typically converges to a local minima, which is also called cycle skipping. Having low-frequency components in inversion is critical to alleviate this cycle skipping issue. To make the matter worse. Solving AEWI problems is also computationally expensive. Most of the existing approaches to solve AEWI rely on iterative nonlinear optimization techniques. At each iteration, it costs O(n 3 ) to obtain the gradient, provided with a 2-D n × n subsurface model.
To mitigate those aforementioned issues, many regularization approaches have been proposed and developed in recent years, which includes Tikhonov-like regularization [5,16,34], total-variation regularization [2,11,24,25,26], high-order regularization techniques [42], and prior-based methods [30,52]. Most of those existing regularization and prior-based techniques are hand crafted, meaning that are loosely (if at all) related to the physical problem at hand. Furthermore, all these solutions are developed under the physics-driven AEWI framework. Hence, the expensive computational costs will be inherited and unavoidable.
More recently, with the successes of deep learning in computer vision community [17,54,56], researchers have developed various data-driven seismic AEWI techniques [3,32,35,38,46,47]. Data-driven frameworks take the waveform data as the input and directly outputs its corresponding velocity image. In this work, we study generative adversarial network (GAN) [10] based method. GAN has been proved to be effective in areas of photo inpainting [17,56], image denoising [48], super-resolution [22], image deblurring [21], and so forth. Motivated by these successes, we solve the AEWI problem using GAN. Specifically, our model consists of two parts: generator and discriminator. Generator is an encoder-decoder structure which maps the raw seismic waveform data into velocity image. Discriminator is a convolutional neural network (CNN) designed to classify the real velocity image and fakes velocity image. There are two major benefits using GAN to solve our seismic waveform inversion problems. Firstly, our model learns regularization term directly from data through the discriminator and further impose the learned regularization term to the generator. The regularization term is used to differentiate between ground truth velocity map and generated velocity map. This type of GAN-based regularization has been recently discussed in Li et al. [23] and yields supreme results for computer-vision tasks. Secondly, our GAN-based inverse problem model is an end-to-end framework which is similar to image-to-image translation problem from computer vision [9,18], which means the velocity map can be output in real-time once the training is completed.
Compared with physics-driven methods, the major disadvantage of data-driven methods is that they suffer from robustness and generalization issue. The deep neural network which is trained on a specific dataset tends to perform worse on another dataset. To alleviate the issue, we incorporate our datadriven method with the network-based deep transfer learning. Network-based deep transfer learning means the reuse of network which is pre-trained in the source domain, and then transfer the network parameters and structures to the target domain. In our project, we apply the fine tuning strategy which means that all of the model's parameters for the new dataset are updated.
To summarize, the main contributions of our work are: • To the best of our knowledge, we are the first to apply conditional adversarial network on AEWI. Our model transfers the inverse process of physics-driven methods into an image mapping problem. As a result, it can alleviate the local minima and low computational efficiency issues.
• We develop a modified encoder-decoder structure which is more suitable for AWEI. Besides, we combine mean absolute error (mae) loss with mean square error (mse) loss to further improve the quality of velocity images. Compared with the other deep learning baselines, our model is able to generate more accurate velocity images.
• We show that the GAN-based regularization technique can yield better reconstruction accuracy than the encoderdecoder-based inversion method.
• We perform a series of experiments to demonstrate the robustness of our model, validating that our model does not just "memorize" the training data while it learns the intrinsic physics law from the training set.
• We conduct additional experiments to demonstrate that datadriven method plus transfer learning is a feasible way to alleviate the generalization issue. In the following sections, we first briefly provide the related work in Section II. We also describe the fundamentals of physics-driven versus data-driven methods, and deep neural networks (Section III). We then develop and discuss our novel inversion method -VelocityGAN. Section IV describes the data we tested on, experimental setup, and experimental results we obtained. Finally, concluding remarks are presented in the Conclusions Section.

A. Data-driven Inverse Problems
Acoustic-and elastic-waveform inversion (AEWI) is a specific type of inverse problems. We first provide relevant literatures in solving inverse problems from other domains. In particular, we focus on deep neural network related techniques [1,14,19,49,50,55]. In general, those different deep-learning based methods for solving inverse problems can be categorized into four types [29]: 1) to learn an end-to-end regression with vanilla convolutional neural network (CNN), 2) to learn higher-level representation, 3) to gradual refinement of inversion procedure, and 4) to incorporate with analytical methods and to learn a denoiser. An interesting work under the first category is AUTOMAP, which was recently developed by Zhu et al. [55]. The authors developed an end-to-end reconstruction algorithm for MRI imaging, where the encoder consists of three fully connected network to read in sensordomain data and the decoder consists of three additional convolutional and de-convolutional layers to yield the imagedomain output. A common use of CNNs is to learn a compressed representation prior to constructing an output image. Several existing works use the effectiveness of autoencoders to learn relevant features to solve inverse problems in imaging. As an example, Zeng et al. [50] employ the autoencoder's representation-learning capability to learn useful representations of low-resolution and high-resolution images. A shallow neural network is then trained to learn a correspondence between the learned low-resolution representation and the high-resolution representation. In the third category, CNNs are used to learn a residual between two or more layers by the skip connection from the input of the residual block to its output. This network structure is particularly well suited to inverse problems such as image restorations when the input and the output images share similar content. The work of Yao et al. [49] and Kim et al. [19] both belong to this category. Another type of research effort to solve inverse problems using neural networks is to incorporate analytical solutions. The work developed in Hammernik et al. [14] also falls into this category. Hammernik et al. [14] reformulates a generalized compressed sensing reconstruction as a variational model, which is embedded in an unrolled gradient descent iterative scheme. Key parameters such as those used in activation functions are learned through offline training procedure. In the inference stage, the previously learned model will be applied online to unseen data. Another example under this category is the one developed in Adler andÖktem [1]. They unrolled a proximal primal-dual optimization method, and replaced the proximal operators using CNNs, and successfully applied to CT image reconstruction problem.

B. Data-driven Acoustic-and Elastic-waveform Inversion
Particularly in seismic waveform inversion, there have some recent development of data-driven waveform inversion techniques, which can be categorized into two groups: an end-to-end learning [3,35,46,47] and low-wave number learning [32,38]. The end-to-end strategy directly learns a mapping correspondence from seismic data domain to the velocity model domain. The low-wave number strategy learns low-wave number from data and followed by traditional fullwaveform inversion iteration. Comparing these two strategies, the end-to-end learning strategy is more aggressive, which usually requires much more complex networks structures to account for the nonlinearity nature of the full-waveform inversion. Encouraging results have been recently reported in Wu and Lin [47] due to significant amount of training sets are utilized.

C. Deep Transfer Learning
A great number of deep transfer learning methods are developed in the recent years. There are mainly four types of deep transfer learning approaches, which are instance-based deep transfer learning, mapping-based deep transfer learning, network-based deep transfer learning, and adversarial-based deep transfer learning [39]. Our work belongs to the networkbased deep transfer learning. Two types of network-based deep transfer learning are widely used in practical applications: fine tuning and feature extraction. Feature extraction refers to the reuse of a pre-trained model and only update a few layer weights for the target domain. For example, the authors in [31] reuse front-layer trained on ImageNet to compute intermediate image representation for images in other datasets. Fine tuning means all of the model parameters for a new task are updated.

III. THE INVERSION MODELS
We firstly present some overview of the governing physics equation (acoustic and elastic wave equation), physics-driven AEWI method, and data-driven inversion method in Section III-A. In Section III-B, we provide details on our VelocityGAN and its network structure. In Section III-C, we provide the loss functions of our VelocityGAN. In Section III-D, we discuss the connections to inverse and regularization theory.
A. Acoustic-and Elastic-Waveform Inversion: Physics-Driven Approach

1) Governing Physics -Wave Equation:
Mathematically, the seismic acoustic-wave equation (or "forward model") is where ρ(r) is the density at spatial location r, K(r) is the bulk modulus, s(r, t) is the source term, p(r, t) is the pressure wavefield, and t represents time. The elastic-wave equation is written as where C(r) is the elastic tensor, and u(r, t) is the displacement wavefield. To simplify the expression, the forward modeling problems in Eqs. (1) and (2) can be written as where P is the pressure wavefield for the acoustic case or the displacement wavefields for the elastic case, f is the forward acoustic or elastic-wave modeling operator, and m is the velocity model parameter vector, including the density and compressional-and shear-wave velocities. We use a time-domain stagger-grid finite-difference scheme to solve the acoustic-or elastic-wave equation. Inference of unknown subsurface properties relies on indirect and limited geophysical measure-ments taken at or near the surface. Therefore, seismic full-waveform inversion is extremely under-constrained and can be severely ill-posed. Various explicit regularization techniques have been developed to stabi-lize the computation of seismic inversion. This regularized physics-driven seismic inversion can be posed as is the data misfit, || · || 2 stands for the L 2 norm, λ is a regularization parameter and R(m) is the regularization term. The regularization term measures the "complexity" of the model f (m) so that the minimization in Eq. (4) favors simple solution that are consistent with the data. Explicit regularization techniques such L 1 -norm [2,11,24,25] or L 2norm [8,34,43] have been developed for seismic inversion, but these regularizers do not fully express an expert's prior knowledge.
2) Data-driven Approach: Different from the physicsdriven methods, end-to-end data-driven methods transfer the minimization process into a mapping problem. The parameters of m are directly learned from where g = f −1 (·) is the inverse operator of f (·). We can therefore obtain the loss function as below where (m i , d i ) are N pairs of velocity map and the corresponding seismic data. Most of the existing end-to-end datadriven AEWI methods use encoder-encoder structure to learn the mapping function of g in Eq. (6) [3,35,46,47].

B. VelocityGAN: Data-Driven Approach
The overall architecture of VelocityGAN is shown in Fig. 2. It consists of an image-to-image generator and a CNN discriminator. We discuss below the structure of the generator and the discriminator, respectively. The overall architecture of VelocityGAN. We apply an encoder-decoder structure as the generator ("Inversion Generator") and a convolutional neural network as the discriminator.

1) Generator:
To better understand the network structure, we first recall the governing physics of our input seismic shotgather imagery (Fig. 3) and output velocity map (Fig. 4). In this paper, the input seismic shot-gather imagery is the combination of the acoustic waves received by different receivers and the visualization of an exemplar 2D seismogram is shown in Fig. 3. Specifically, there are 3 source functions and 32 receivers, which correspond to s(r, t) and P (r, t) in Eq. (1) (or u(r, t) in Eq. (2)), respectively. The source function may contain both P-wave and S-wave. Correspondingly, the seismic waveform data collected is a tensor with a dimension of 32 × 1000 × 6, where the first dimension is 32 receivers, the second dimension of 1000 is the time sequence length of waveform trace, and the third dimension of 6 represents the total 2 channels of the 3 source functions. As the shown in Fig. 3, the 2D seismogram consists of 32 1D time series signals, and each signal contains a pulse which contains the information of subsurface structure. Correspondingly, we also present several velocity images in Fig 4, which is the output of our model. The dimension of the output velocity map is (m, n), where the first dimension of m stands for depth and the second dimension of n stands for horizontal offset. The value of each pixel in the velocity image stands for the absolute velocity value at each location. The grid spacing between pixels is 5 meters. Therefore, the total size of velocity map in the real world is 5m × 5n meters. The linear geologic feature shown in the velocity map in Fig. 4 is the geologic fault.
As we discussed above, there is no direct spatial similarities between the input seismic gather-shot imagery and output subsurface velocity map. See Figs. 3 and 4. So we do not penalize the mismatch between the input and output like [36,56]. Besides, since the height (1000) and width (32) of the input is unbalanced, we apply several convolutional layers with k × 1 kernels, k means the length of convolutional kernel in height dimension. The particular structure of generator is shown in Table I. To extract the waveform features of each receiver, 9 convolutional layers with k × 1 kernels are first deployed. Each convolutional layer is followed by a BatchNormalization layer and a LeakyReLU layer. After the dimension of height is reduced to 32, 3 × 3 convolutional kernels with stride 2 are then added to encode the whole extracted features. In the last layer of encoder, 8 × 8 convolutional kernels are used to eliminate the influence of spatial information. With regards to the decoder, it consists of 5 upconv blocks, a center cropping layer, and a convolutional layer. Each upconv block consists a transposed convolutional layer, batch normalization layer, and an activation layer. The transposed conlutional layers are applied to increase the height and width dimensions of image and decode the extracted features. The convolutional layer is designed to map features into the same dimension with ground truth labels. The center cropping layer is used to crop the feature maps into a desired dimension. To limit the value of output into a specific range, the center cropping layer is followed by a Tanh layer. We have also tried to replace the transposed convolution with upsampling layer like Ronneberger et al. [36] and Shelhamer et al. [37]. It yields worse performance because the geologic fault, i.e., the linear geologic feature, of velocity map is not as good as these structure.   2) Discriminator: Similar to Radford et al. [33], we adapt our discriminator from a CNN architecture. Particularly, it consists of five convolution blocks, a global average pooling layer, and fully connected layers. Each convolutional block involves a combination of Convolutional, BatchNormalization, LeakyReLU, and MaxPooling layer. We apply "PatchGAN" classifer [17] in the discriminator to capture local style statistics. We set the patch size as 4 and calculate the mean loss value of all patches in an image. Since we would like to generate accurate velocity models, especially in the part of geological fault and interfaces, "PatchGAN" is more suitable than "GlobalGAN" for our task.

C. Loss Function
Wasserstein GAN (WGAN) with gradient penalty [13] has been proved to be robust of a wide variety of generator architectures. Considering the modified structure in our generator, we use Wasserstein loss with gradient penalty to distinguish the real velocity map and generated velocity map. The loss function of discriminator is formulated as where P g means the distribution of velocity map which is predicted by the generator of VelocityGAN, P r is the distribution of the ground truth velocity map, and Px is random samples from both P g and P r .
For the generator, we want the predicted velocity map can not only fool the discriminator but also reveal the accurate information of geological structure. Therefore, the loss function is a combination of the adversarial loss and content loss. Consistent with Eq. (7), the adversarial loss is − Ẽ x∼Pg D(x).
The content loss is set as a combination of mean absolute error (mae) and mean square error (mse). In our experiments, we observe that mse loss is good at capturing the geological faults while mae loss performs better on revealing the geological interfaces. Therefore, the loss function of generator is formulated as where w and h are the width and height of the velocity map respectively, v(·) represents the real pixel value of the velocity map andṽ(·) means the predicted pixel value. λ 1 and λ 2 are hyper-parameters to control the relative importance of the two loss term. In our experiments, we pick λ 1 and λ 2 by balancing mae and mse loss during the training process. The specific values of λ 1 and λ 2 are discussed in Section IV-A2.

D. Connection to Regularization Theory
There is a close connection of GAN to the regularization techniques used in inverse problems [23]. To see the connection, we can rewrite Eq. (8) as where the target mapping, g, in Eq. is not only an adversarial loss term but also plays the role of regularization, that is learned from training data.The contentloss terms (mae and mse) in Eq. (8) or Eq.(9) promotes the velocity-map consistency. In particular, we will use GANs to learn a classifier to discriminate between the distribution of the ground truth velocity maps and the distribution of generated velocity maps. This discriminator effectively penalizes velocity models that do not "look like" the velocity models that are used for training. The usual approach to alleviate ill-posedness of inverse problems is to incorporate prior knowledge with a regularization term that penalizes solutions that are inconsistent with this prior knowledge. Most of the existing regularization techniques employ generic functions (e.g., L 1norm or L 2 -norm penalties on coefficients) that are loosely (if at all) related to the physical problem at hand. On the other hand, regularization learned from data can be more effective and customized for the problem at hand.

IV. EXPERIMENTS
We introduce the datasets and training details in Section IV-A. We discuss the experiment settings in Section IV-B. Following that, we compare and analyze the results of different methods. Last but not least, we present generalization experiments and provide a feasible way to solve the generalization issue.
A. Datasets and Training Details 1) Datasets: In practical applications, velocity models are estimated by physics-driven methods (usually an optimization algorithm). It can be unrealistic and expensive to obtain a large-scale dataset consists of seismic waveform and velocity models. To verify the efficacy of our VelocityGAN, we therefore generate a dataset including velocity images and corresponding seismic waveform data generated using Eq. (1). The velocity images that we generated are varied with different tilting angles, layer thicknesses, and layer velocities etc. They can be a good representation of the real velocity images [28]. Although our VelocityGAN is validated using seismic acoustic wave equation, the method developed in this paper can be directly adapted to elastic scenario shown in Eq. (2) as well.
We create a main dataset to evaluate the efficiency and effectiveness of VelocityGAN. The dataset contains 50,000 velocity models with 150 by 100 dimension, along with their corresponding seismic waveform. This dataset contains complicated geological layers in a velocity image. Furthermore, most geological layer interfaces are curved. We name this dataset as "CurvedData". For this dataset, 3 common-shot gather of synthetic seismic data with 32 receivers is posed at the top surface. We use a Ricker wavelet with a center frequency of 50 Hz as the source time function and a staggeredgrid finite-difference scheme with a perfectly matched layered absorbing boundary condition to generate 2D synthetic seismic reflection data [40,51]. The synthetic trace at each receiver is a collection of time-series data of length 1, 000. Hence, the input size is (32, 1000, 6) where 1000 is the time sequence length, 32 is the number of receivers and 6 is the channel number.
We also create two small datasets for generalization experiments. The velocity images of these two datasets are similar with the images in CurvedData except for the number of faults. We adjust the number of geological faults to zero or two in order to evaluate our VelocityGAN in a more generalized condition. We use 2-Fault CurvedData and 0-Fault CurvedData to represent these two datasets in the following parts. There are 2,000 pairs of velocity images in 2-Fault CurvedData and 1,000 pairs of velocity images in 0-Fault CurvedData.
2) Training Details: For each dataset, we randomly select 20% data as the testing set, 10% data as the validation set to adjust the hyperparameters. We use the remaining images as the training set. The input of our model is normalized to range (−1 ∼ 1). Constrained by the memory of GPU, we set the size of mini-batch to 50. Following the optimization strategy of [4], we perform 5 gradient descent steps on the discriminator, and then perform one step on the generator. The learning rate of our Adam [20] optimizer is set to 10 −4 in the first epoch. We linearly decay the learning rate to 0 over the remaining epochs. For discriminator loss (Eq. (7)), we choose λ as 10. For generator loss (Eq. (8)), λ 1 and λ 2 are set to 50 and 100 in CurvedData. All of our models are implemented on a single GTX 1080-Ti using PyTorch framework.

B. Experiment Settings
Velocity image generation experiments are conducted to evaluate the effectiveness of our VelocityGAN. We choose the following algorithms from both physics-driven and data-driven methods as our baselines: • AEWI-Pre [53]: A wave-energy-based precondition method is applied to reduce the artifacts in the gradients caused by the geometrical spreading and defocusing effects.
• AEWI-MTV [24]: A modified total-variation regularization (MTV) is used as a regularization term in AEWI optimization process. MTV is designed to preserve sharp interfaces in piecewise constant structures.
• U-Net [36]: Based on a typical encoder-decoder structure, U-Net adds skip connections between mirrored layers in the encoder and decoder stacks.
• FCN [37]: FCN consists of an encoder network and a corresponding decoder network. We apply upsampling layers in the decoder network and adjust the dimension of convolutional layers to make it work on our project.
Similar with the existing works [6], [7] on depth estimation, we adopt the following metrics to evaluate the accuracy of velocity image reconstruction: • mean absolute error (mae): mae = 1 For qualitative experiments, we present several velocity image samples and vertical velocity profiles to provide an intuitive comparison. It is worthwhile to mention that in CurvedData, we include a small geologic fault in the velocity model. Geologic faults play an important role in siting the wells in subsurface applications because of its high permeability property. However, it can be technical challenging to image a geologic fault zone due to the limited imaging resolution and data coverage. We will compare our method to others not only in the overall reconstruction quality, but also in the local region such as fault zone. Besides, we also compare the implementation time between physics-driven methods and data-driven methods.

C. CurvedData
In the real world, the geological layers are usually yields irregular shape. To address the curved layer estimation, we create a challenging dataset -CurvedData. Using this data, geological faults will disappear with the constraint of mae loss. Mse loss is good at revealing geological faults but does not perform well on reconstructing the layer interfaces. Therefore, we use a combination of mae and mse loss to generate more accurate velocity images. In our quantitative experiments, we not only compare VelocityGAN with physics-driven models but also do ablation study on the combination of loss.
1) Quantitative Results: Table II shows the quantitative results of our ablation study on CurvedData. We can see that our proposed models still perform much better than the physics-driven models. Among our proposed models, we conduct the ablation study. We observe that VelocityGAN with a combination of mae and mse loss can get better prediction results than a single loss. Though the VelocityGAN with a single loss achieve relatively higher scores in some measurements such as rel, acc. (t=1.10), the VelocityGAN with a combination of mae and mse loss obtains a better tradeoff under all the metrices. The quantitative experiments on CurvedData validate that the generator structure, adversarial training strategy, and the combination of losses are all useful improvements and can boost the reconstructed accuracy.
2) Qualitative Results: We illustrate reconstructions of velocity images using different baseline methods in Figure 5. Consistent with our discussion on loss function, VelocityGAN with mae loss is good at handling with the boundary of geological layers, however, it ignores the geological faults and high-velocity areas. VelocityGAN with mse loss can generate geological faults but the boundary of geological layers are fuzzy. VelocityGAN with a combination of mae and mse can achieve a better tradeoff between the quality of geological layer interfaces and faults. Besides, physics-driven methods does not perform well on CurvedData. There are many oscillations in deep region and high-velocity areas. An interesting find in our experiments is that the comparison of Generator with mae (L1) loss and VelocityGAN with mae loss. For Generator with mae loss, the geological fault disappears in the velocity images. However, VelocityGAN with mae loss can still reveal correct geological fault, though it is fuzzy in some particular area. This phenomenon further demonstrate the effectiveness of adversarial training strategy.
We present the vertical velocity profile of CurvedData in Fig. 6. VelocityGAN still outperforms physics-driven methods on the accuracy of vertical velocity. For VelocityGAN with different loss functions, we observe that the framework with a combination of mae and mse perform well in both low-velocity regions and high-velocity regions. VelocityGAN with a single loss sometimes miss the geological fault zones in low-velocity regions. For example, in the second row and the middle column of Fig.6, both VelocityGAN-L1 and VelocityGAN-L2 fail to reconstruct the geological fault zone between position 0 and 10. In contrast, VelocityGAN with a combination L1(mae) and L2(mse) loss reveal the geological fault well. In the second row of Fig.6, we also compare VelocityGAN-L1 with Generator-L1. Though both of them do not perform well in the low-velocity regions, VelocityGAN-L1 is able to reveal geological fault in high-velocity regions, which is better than Generator-L1. We attribute this phenomenon as the influence of adversarial training strategy.
According to the aforementioned comparison, we conclude that our VelocityGAN yields more accurate reconstruction of velocity image in obtaining both global and location geological features.

D. Generalization Experiments
In machine learning theory, the test error of a trained model on unseen data is given as [15] Error test = Error train + Error Generalization , where Error train is the training error and Error Generalization is the generalization error. With large amount of training data and reasonable loss function, we can usually control the training error, while the generalization error will then dominate the test error. It is well known that deep neural networks are over parameterized meaning there is significantly larger number of parameters than the amount of training data. Minimizing the same loss function might lead to multiple global minima, which all minimize the training error, but some of them might not generalize well. Conventionally in machine learning community, cross validation techniques are usually used to measure the test error. However, in our problem, cross validation may be misleading due to the fact that no matter how to split the data, all the training, validation, and test data come from the same distribution. Therefore, we analyze the generalizability of VelocityGAN by studying its performance using specially designed test sets, which are inspired by actual field experiments.
In CurvedData, all velocity images contain one fault. To conduct the generalization experiments, we generate extra velocity models and their corresponding seismic data as our transfer learning data. Specifically, the transfer learning data includes 0-Fault CurvedData and 2-Fault CurvedData. Based on these two datasets, we compare the reconstruction results of physics-driven methods, VelocityGAN which is trained on CurvedData (VelocityGAN-org), VelocityGAN with transfer learning (VelocityGAN-TF), and VelocityGAN without transfer learning (VelocityGAN). The specific procedure of the transfer learning strategy is that we first train VelocityGAN on CurvedData and save the model weights. We then finetune the VelocityGAN weights on 0-Fault CurvedData or 2-Fault CurvedData.
The quantitative results of generalization experiment are presented in Table III and Table IV. In Table III, VelocityGAN-TF outperforms AEWI-Pre, VelocityGAN, and VelocityGANorg. AEWI-MTV gets a better results than VelocityGAN-TF in 0-Fault CurvedData. For 2-Fault CurvedData, VelocityGAN-TF achieves the best quantitative results of all these models. We attribute the reason as the size of training dataset. For 0-Fault CurvedData, there are 700 pairs of velocity images and seismic data as the training set. In contrast, 2-Fault CurvedData has 1,400 pairs of velocity images and seismic data as the training set, which is two times larger than 0-Fault CurvedData. According to the experiments, we can conclude that the performance of data-driven methods depend on the size of training set. A bigger dataset can achieve a better results. Overall, in our dataset, the performance of VelocityGAN-TF is competitive with the AEWI-MTV while outperforming all others.
Besides, we compare the visual appearance of our models on 0-Fault CurvedData in Fig. 7. Consistent with the quantitative results, VelocityGAN which is totally trained on 0-Fault CurvedData performs worst among these models. It is because that the size of 0-Fault CurvedData is not enough to train a good deep neural network. However, the visual appearance of VelocityGAN-org and VelocityGAN-TF is much better than AEWI-Pre and AEWI-MTV. For physics-driven approaches, especially for AEWI-Pre, there are many oscillations in deep region and high-velocity areas, which can be observed in the second and third rows as shown in Fig. 7. The geological interfaces and faults which are generated by data-driven methods are cleaner and sharper. It may because we only calculate the average loss of four selected images for physicsdriven methods, the quantitative results can be mis-leading. Compared with VelocityGAN-TF, we can clearly observe that VelocityGAN-org still contains parts of geological faults. The comparison further substantiate that the transfer learning strategy can finetune the deep neural network effectively.
We randomly select 6 pairs of velocity images from 2-Fault CurvedData and present their reconstruction results in Fig.  8. As the figure shows, VelocityGAN-TF performs slightly better than VelocityGAN-org, while much better than the other baselines including AEWI-MTV. In both 0-Fault CurvedData and 2-Fault CurvedData, VelocityGAN-org demonstrates its generalization ability to some extent. For example, when the distance between two faults is relatively large (the first four row of Fig. 8), VelocityGAN-org and VelocityGAN-TF are able to correctly locate two faults, though some of them are fuzzy and unclear. When the distance is relatively small (the fifth and sixth row of Fig. 8), VelocityGAN-org and VelocityGAN-TF cannot reconstruct correct faults. In this condition, physics-driven approaches, AEWI-Pre and AEWI-MTV, do not perform well either.
To summarize the experiments on 0-Fault CurvedData and 2-Fault CurvedData, we conclude that our VelocityGAN has generalization ability to some extent. For instance, Velocity-GAN which is trained only on 1 fault velocity images can also output velocity images which have 0 or 2 faults. With transfer learning strategy, VelocityGAN can further improve its generalization effect, which is competitive with physicsdriven methods.

V. CONCLUSION
We develop a data-driven method and we call it "Ve-locityGAN," to solve the seismic full-waveform inversion problem. We design a modified encoder-decoder structure as the core block of image-to-image target. Based on the encoderdecoder structure, conditional adversarial training strategy with improved loss function are applied to further boost the Ground Truth AEWI-Pre AEWI-MTV Generator-L1 Generator-L2 Generator VelocityGAN-L1 VelocityGAN-L2 VelocityGAN Fig. 5: Examples of different methods on CurvedData. The images from left most columns to right are ground truth, reconstruction results using AEWI-Pre [53], AEWI-MTV [24], Generator-L1, Generator-L2, Generator, VelocityGAN+L1, VelocityGAN+L2, and VelocityGAN. Our VelocityGAN yields the most accurate reconstructed velocity images among both the physics-driven methods and data-driven baselines. The experiment results substantiate the effectiveness of the adversarial training strategy and a combined loss. reconstruction of velocity images. Compared with physicsdriven methods, VelocityGAN is a more promising tool for subsurface velocity estimation, because it can alleviate the local minima and expensive computational cost issues. We conduct quantitative and qualitative experiments to demonstrate the effectiveness and efficiency of our VelocityGAN from various aspects. The results substantiate that our model outperforms both the physics-driven methods and the selected deep learning baselines. Furthermore, we also provide extensive experiments to discuss the generalization effectiveness of VelocityGAN. According to the results, we conclude that VelocityGAN has the basic generalization ability and can be improved by transfer learning strategy.