The potential of self-supervised networks for random noise suppression in seismic data

Noise suppression is an essential step in any seismic processing workflow. A portion of this noise, particularly in land datasets, presents itself as random noise. In recent years, neural networks have been successfully used to denoise seismic data in a supervised fashion. However, supervised learning always comes with the often unachievable requirement of having noisy-clean data pairs for training. Using blind-spot networks, we redefine the denoising task as a self-supervised procedure where the network uses the surrounding noisy samples to estimate the noise-free value of a central sample. Based on the assumption that noise is statistically independent between samples, the network struggles to predict the noise component of the sample due to its randomnicity, whilst the signal component is accurately predicted due to its spatio-temporal coherency. Illustrated on synthetic examples, the blind-spot network is shown to be an efficient denoiser of seismic data contaminated by random noise with minimal damage to the signal; therefore, providing improvements in both the image domain and down-the-line tasks, such as inversion. To conclude the study, the suggested approach is applied to field data and the results are compared with two commonly used random denoising techniques: FX-deconvolution and Curvelet transform. By demonstrating that blind-spot networks are an efficient suppressor of random noise, we believe this is just the beginning of utilising self-supervised learning in seismic applications.


Introduction
Noise consistently appears as an unwanted companion to the desired signal in seismic recordings. As such, noise suppression is a fundamental step in all seismic processing workflows [1]. Arising from local site conditions, as well as being excited by a seismic source, the total noise field can be seen as the sum of many noise components arising from different sources, each with their own characteristics [2]. Typically, noise suppression procedures identify a defining property that easily distinguishes the targeted noise from the desired signal and leverage that to separate the former from the latter. In this paper, we consider the random component of the noise field and leverage its non-predictable nature to build a suppression procedure.
Random noise suppression has been extensively investigated by the seismic community with the majority of the proposed techniques falling into one of the following categories: prediction-, transformation-and decomposition-based. Prediction-based approaches typically employ prediction-filters which aim to leverage the predictable nature of the coherent signal and therefore act as noise suppressors. Examples of such approaches include t-x predictive filtering and fx deconvolution both of which can be applied in a stationary or non-stationary manner (e.g., [3,4,5,6]). Transformationbased approaches transform the data into a domain in which the signal and noise can be easily distinguished due to their individual characteristics. By exploiting the sparse nature of seismic data in the curvelet domain, the curvelet transform is an example of a commonly used transformation-based denoising procedure (e.g., [7,8,9]). Similarly, other transformation-based methods have been proposed in the literature that use different transforms, for example, the wavelet- [10], shearlet- [11], and seislet-transforms [12], among others. Finally, decomposition-based procedures express the seismic data as the composition of weighted basis functions and suppress those associated to the noise components. Such decomposition procedures utilise the likes of spectral decomposition [13], Emperical Mode Decomposition [14], and Singular Value Decomposition [15], among others.
With the increased interest in the use of Machine Learning (ML) in geophysics, a new class of random noise suppression procedures have been proposed. The majority of these approaches fall into the realm of Deep Learning (DL) and use a supervised training approach, which requires clean data for training to accompany the noisy input data. As is prevalent across seismic applications of DL, a number of studies have considered the use of synthetic seismic datasets for training a Convolutional Neural Network (CNN) (e.g., [16,17,18]). Whilst these experiments have shown promising denoising results on synthetic datasets they struggle on field data [19]. Alternatively, others have used conventional denoising procedures to create their 'clean' counterparts to the noisy input data for their CNN denoising procedures (e.g., [20]). However, as no suppression procedure is perfect, noise residual is likely to exist within the clean counterparts and is likely to hamper the networks denoising capabilities. Unsupervised DL procedures have no such requirements of noisy-clean pairs for training. Recently, [19] illustrated how an encoder-decoder network could be trained on noisy seismic data for random noise attenuation whilst [21] detailed how an alternative convolutional network architecture can be used without any requirement on windowing the data. Both these approaches were shown to outperform the FX-deconvolution noise suppression procedure.
Considering the broader scientific community, most DL approaches for random noise suppression of images, or imagelike data, are typically supervised and therefore have the requirement of paired noisy-clean datasets for training [22]. This is often an unrealistic requirement -not just in seismology but across many other fields where there is no monitoring technique in which a clean dataset can be collected. In 2018, [22] proposed Noise2Noise which illustrated how, under the assumption of a stationary signal, a Neural Network (NN) could be trained to denoise an image based on training over two noisy samples. Whilst this removes the requirement of noisy-clean pairs, it requires noisy-noisy pairs in which the signal is consistent but the noise is varying within each pair -a problematic requirement for many monitoring applications. Building on this, [23] proposed Noise2Void (N2V) which requires only a single noisy image for training. Under the assumption that noise is statistically independent between samples, a blind-spot network is used to predict a central sample's value based on neighbouring samples. As the noise is independent between samples, the noise's contribution to the sample's value cannot be predicted and therefore only the signal's contribution is predicted, resulting in an efficient denoising procedure. Whilst N2V is an ML approach, it can also be considered as a prediction-based approach; wherein, it leverages the ability to predict the signal and the inability to predict the noise resulting in a denoised image.
Previously applied to natural images and microscopy data among others, in this paper we investigate the adaptation of the N2V workflow to handle the highly oscillatory seismic signal and pseudo-random noise. Through an extensive hyper-parameter analysis, we identify the optimum hyper-parameters for the seismic denoising scenario considering both immediate improvements in the image domain and those observed for down-the-line tasks, such as seismic inversion. The paper concludes by illustrating the potential of N2V through an application to a field dataset.

Theory: Blind-spot networks
Noise2Void [23] is based on the concept of blind-spot NNs, which aim to predict a central pixel value based on neighbouring pixels. Operating on patches x j of a single image x, N2V works by replacing a set of non-adjacent pixels x j i i = 1, 2, N p from each patch, herein referred to as active pixels, with randomly selected neighbouring pixels, that pertain to the receptive field of the chosen network Ω j i , as illustrated in Figure 1. The corrupted patches become the input to a NN whilst the corresponding original patches represents the target values. In theory, the NN architecture, denoted as f θ where θ refers to the trainable parameters, could be anything that can realistically map between the input and target values. In this paper, we follow the original N2V NN architecture: a 2-layer UNet styled after [24], as illustrated in Figure 1. As opposed to standard NN image processing tasks, the loss function here is not computed on every pixel in the image, instead it is only evaluated for the active pixels, i.e., those that were corrupted in the input image:θ = argmin where p = 1, 2 refers to the norm used in the loss -Mean Absolute Error (MAE) or Mean Squared Error (MSE), respectively -and N s is the number of available training samples (i.e., patches extracted from the image). [23] illustrated how MSE is the preferred choice for denoising additive WGN with respect to N2V. However, MAE is a prevalent choice for seismic deep learning applications. In Appendix 1, we provide a mathematical formulation that explains under which circumstances MSE and MAE should be used, respectively. Finally, we also highlight that no theoretical guarantee can be provided in the case of correlated noise therefore, we decided to experiment with both loss functions in the following numerical examples.
Once trained, the model is applied directly to the full seismic data. Note that at this stage, windowing of the seismic data is not required due to the ability of CNNs to handle dynamically varying input sizes. However, in the scenario where the data dimensions are not compatible with the down-/up-sampling of the UNet, then the input data are zero-padded to achieve an acceptable input data size.

Performance metrics
Whilst theoretically, N2V can be applied at any processing stage where random noise is observed in the seismic data, in this paper we focus on seismic images after time migration. A common challenge with many denoising procedures is that as part of the noise suppression process not only is the noise suppressed but also the signal is significantly damaged; something that may have a negative effect on down-the-line tasks. With this in mind, three performance metrics are considered: 1. Image Peak Signal-to-Noise Ratio (PSNR), 2. Frequency correlation, and 3. Post-stack inversion PSNR.
The image PSNR is calculated as where x represents the clean data,x the modified data (either noisy or denoised), and n t and n x represent the number of time samples and receivers, respectively. To quantify the effect of the N2V denoising, we compute the percent change in the PSNR between the noisy and denoised images. This can be written as, where P SN R noisy and P SN R n2v are the P SN R values computed from the noisy and denoised images, respectively.
The change in frequency is quantified using the sample Pearson's correlation coefficient, r xy where the aim is to return the noisy data's amplitude spectra to that of the clean data. This is computed as: where X and Y are the amplitude spectra of the clean and denoised data, respectively, averaged over the spatial-axis, X andȲ are the sample means of X and Y , respectively, and n f is the number of samples in the spectra. Similarly to the image PSNR, to analyse the effect of N2V we compute the percent change in the sample Pearson's correlation coefficient when computed with the noisy versus denoised data, where r cl,n is the correlation coefficient between the clean and noisy data and r cl,N 2V is the correlation coefficient between the clean and denoised data.
As a final metric of comparison, the denoised data are used as input for a standard down-the-line task, namely post-stack inversion [25]. By doing so, we can inspect the effect of denoising and possible signal damage on our ability to estimate an acoustic impedance model of the subsurface. Post-stack inversion assumes that each trace of the post-stack seismic data can be represented by a simple convolutional model as follows: where AI(t) is the acoustic impedance profile in the time domain and w(t) is the time domain seismic wavelet. We rewrite this expression for the entire set of traces in the post-stack data in a compact matrix-vector notation, d = WDm, where d and m are vectorized seismic data and the natural logarithm of the acoustic impedance model, W is a convolution operator and D is a first derivative operator. The model vector is then estimated by means of L2-regularized inversion using the PyLops computation framework [26]. The PSNR values for the inverted models are calculated as in equation 4 where x is the true model andx is the inverted model. As with the above two performance metrics, for the inversion PSNR we calculate the percent change between the inversions for the noisy and denoised inversions.
Finally, for the field dataset there is no 'clean' dataset for comparison and as such, the above metrics cannot be easily computed. In this situation, we perform qualitative comparison of the raw and denoised images, frequency content, and inversion products. In addition to this, we benchmark the N2V approach against two commonly used techniques for random noise suppression: FX-deconvolution and sparsity-promoting inversion with Curvelet transform. FX-deconvolution [27] is based on the concept that the coherent part of a seismic trace can be predicted in the FX domain from the previous traces via spatial predictive filters. Our results are based on the Madagascar implementation of FX-deconvolution [28] with the same parameters used by [29] for this dataset, namely a window of 20 traces with a filter length of 4 traces. The second method is instead based on the principle that seismic data have a sparse representation in the Curvelet domain [8] whilst random noise maps incoherently across the Curvelet domain. We employ sparsity-promoting inversion with the Fast Iterative Shrinkage-Thresholding Algorithm (FISTA) solver [30] and soft-thresholding to attenuate random noise in our seismic data.

Data
To be able to perform a quantitative analysis of our denoising procedure the majority of this study is performed on synthetically generated data. Utilising the SEG Hess VTI model and a 30 Hz Ricker wavelet, 30 2D slices are created in order to mimic the multiple in-lines or cross-lines that would be available from a 3D survey. Two different synthetic datasets are created with this base waveform data: one with White, Gaussian Noise (WGN) as illustrated in Figure  2(a,d) and one with 5-100 Hz band-passed noise as illustrated in Figure 2(b,e).
The paper concludes with an application of the N2V denoising procedure on a field dataset from a land acquisition in China that is heavily contaminated by random noise. Previously analysed by [6] and [29], a single 2D line from the post-stack volume was released under the Madagascar framework as part of a continued effort towards reproducible research. Figure 2(c,f) illustrates the seismic image and its respective amplitude spectra. The sampling rate for the datasets (synthetic and field) is 2 ms.

Data preparation
Following the procedure of N2V, patches are randomly extracted from the seismic lines to form the training dataset. To further increase the size of such dataset, common data augmentation techniques of both rotation and polarity reversal  4 Numerical Examples

Synthetic with WGN
The initial example portrays a layman's application of N2V onto seismic data utilising the same noise properties (i.e., WGN) and hyper-parameters as detailed in the original N2V study on natural images [23] and displayed in Table 1. Figure 4 shows the progression of the loss (equation 1) during the training period for all the active pixels in the 4500 training patches and the 500 validation patches. An application of the trained model to a 2D line from the synthetic dataset is illustrated in Figure 5. The training took 12.5 minutes on an Nvidia Quadro RTX 4000 while the application  In the image domain, the PSNR has increased by 73% whilst in the frequency domain there is a much higher similarity between the spectra of the noise-free wavefield (black) and the denoised data (green), as opposed to the noisy data (red). When inversion is performed on the data, it is clear that the inversion on the denoised data produces an acoustic impedance model that is closer to the true model than that of the noisy data, with significantly fewer artefacts. Overall, for seismic data contaminated by WGN, it is shown that N2V can accurately learn to reproduce the seismic signal from surrounding samples without recreating the noise. This significantly improves the data quality both for current tasks (in the image domain) and down-the-line tasks, such as inversion.

Synthetic with band-passed noise
The second example focuses on providing a more realistic example using band-pass filtered noise to identify the potential of N2V for seismic applications. In this example, we have performed a hyper-parameter sweep to identify the optimum hyper-parameters for denoising of band-pass filtered random noise. Over 860 parameter combinations were considered from the values detailed in Table 2. Due to the computational cost of training, only one model is generated per hyper-parameter combination however, 100 additional synthetic datasets are generated to analyse each models' performance. Figure 6 illustrates the performance of a subset of the hyper-parameters for a fixed window-size (32 − by − 32) and fixed loss function (M AE).
As detailed above, the models are evaluated on the PSNR gain in both the image and acoustic impedance domains, as well as the increase in the correlation with the amplitude spectra of the noise-free data. Considering Figure 6, the general trend of the experiments shows that as training progresses, i.e. the number of epochs increases, the PSNR gain in the image domain (top row) slightly decreases whilst the PSNR gain in the acoustic impedance domain (bottom row) moderately increases. Finally, the batch size is shown to have a limited influence in comparison to the other hyper-parameters. The optimum hyper-parameter combination is a sum of the ranking of each combination across all three scoring criteria.   The optimum hyper-parameter combination, as detailed in the middle column of Table 1, is used to train the band-passed N2V model with Figure 7 illustrating the loss function progression during training. Figure 8 shows the result of the trained network applied to a synthetic slice contaminated by band-pass filtered noise. Similar to the WGN results, a PSNR increase is observed in the image domain, alongside an increase in the similarity with the amplitude spectra of the noise free data. However, more substantial signal leakage is observed around the central salt body in comparison to the WGN results, as well as some noise residual remaining in the denoised result. Despite this, the inversion on the denoised image results in a cleaner subsurface model than that from the noisy image.

Field data application
To conclude, the N2V workflow is applied to a land dataset, which is known to be contaminated by random noise. Trained using 4500 patches, the model is applied to the full 2D line. The resulting denoised image is shown in Figure 9 and we compare it with results from the conventional FX-deconvolution and Curvelet denoising procedures. Figure  10 provides a zoomed-in comparison for three areas of interest spanning the model's depth range. Considering the difference between the original and denoised datasets (bottom row of Figure 9), the Curvelet approach has removed the most noise however we argue that it has possibly over-smoothed the data (effectively reducing the resolution) as well as introduced some linear artefacts (particularly noticable in the top row of Figure 10). This is likely due to the fact that since the Curvelet transform explains an image as the superposition of localised oriented wave packets, the denoising process may have slightly corrupted the relative weighting of the different wave packets whilst trying to suppress the noise. On the other hand, while the FX approach has also removed more energy than the N2V approach, the resulting denoised image is still heavily contaminated by noise, which is particularly observable in the closeups of Figure 10. In  (e) portray the differences between the noisy and denoised datasets and between the noise-free and denoised datasets, respectively. Whilst (g), (h) and (i) are the results of an L2-regularised inversion for the clean, noisy and denoised data, respectively.
addition to this, all approaches have resulted in a certain amount of signal leakage, even though of different nature from method to method.
Supporting what is observed in the image domain, Figure 11 illustrates the differences in the amplitude spectra between the different denoised datasets. Both the FX-deconvolution and Curvelet domain results have reduced the energy across all bandwidths with the Curvelet approach outperforming the FX approach between 60-100 Hz. The N2V results show less reduction in the bandwidths around which the signal is expected with a reduction in energy being observed above 85 Hz. However, even at higher frequencies where signal is likely not to be present, the N2V approach does not reduce the amplitude spectrum to the levels of the other two procedures.
Finally, Figure 12 shows the inversion products for the original data and the denoised results. Similar to the observations in the image domain, the N2V results seem to have more details than the overly-smoothed Curvelet results. Conversely, the partial attenuation of genuine signal in the FX data leads to lower contrasts between features in the inverted acoustic impedance model.

Discussion
Blind-spot networks offer a solution to the predicament of requiring noisy-clean pairs of training images for deep learning denoising procedures. Previously utilised in applications on the likes of natural images [23], Computed Tomography (CT) images [31] and Synthetic Aperture Radar (SAR) images [32], we have shown that under the right circumstances, N2V can also be a powerful denoiser for seismic data. N2V relies on the assumption that noise is statistically independent between pixels, or, as in the seismic case, between each spatio-temporal sample. In reality, noise in seismic data is always correlated to some extent. Despite this, we have shown that, whilst providing the best results on WGN, N2V can still efficiently denoise both synthetic band-pass filtered noise as well as recorded noise from a field acquisition. It was observed that for seismic denoising the number of epochs had to be significantly reduced in comparison to the initial N2V applications, whilst the number of active pixels had to be increased. The reduction of epochs hinders the network from learning to replicate mildly correlated noise whilst still providing adequate training time to learn the dominant signals in the data. Whereas, increasing the number of active pixels acts as a regulariser to the training procedure by introducing additional corruption into the training dataset.
Typically in seismic applications, denoising is performed either prior to interpretation or as preparation for down-the-line tasks such as inversion. In this paper, we took a backseat approach and choose a hyper-parameter selection that was a compromise between the three performance metrics. This resulted in a PSNR gain of 39.28% in the image domain and 1.32% in the inversion domain for the realistic synthetic example (Hess VTI model with bandpass filtered noise). However, if the denoising was being performed on data only for direct interpretation, we could have selected the best hyper-parameters for this task, which would have resulted in an image domain PSNR gain of 50.24%. Similar can be said for inversion, where the inversion PSNR gain would have been 6.27% for the optimum hyper-parameter combination (as illustrated in Figure 6). Typically DL procedures are accompanied by a lengthy training time, often rendering the approaches significantly more computationally expensive than conventional procedures [33]. However, due to the small number of epochs required, the N2V approach can be trained and subsequently applied within a matter of minutes for the field data -7 minutes for our field data experiment training. Where post-stack volumes are available, an extension to 3D denoising would be possible through the adaption from 2D convolutional layers in the NN  N2V Figure 12: Comparison of the effect of the different random noise suppression procedures when the denoised datasets are fed into an L2 inversion.
to 3D convolutional blocks. This would likely further improve the denoising procedure at the price of increasing the computational cost and memory requirements of the network.
The potential of N2V was illustrated using post-stack seismic data, however there is no limitation on the processing stage at which blind-spot networks could be applied for denoising. The post-stack scenario was used due to the availability of a field dataset that is known to be contaminated by random noise and that has been extensively investigated by others as a benchmark dataset for random noise suppression procedures (e.g., [6,29]. However, in theory the technique could equally be applied to shot-gathers, receiver gathers, or even passive seismic data, assuming each of these are contaminated by random noise. One known limitation of N2V is the assumption of statistical independence between samples. [34] proposed an extension to the N2V workflow to adapt the approach for structured noise suppression. Structured N2V utilises selective masking to minimise any contribution of correlated noise into the prediction of the active pixels value. This extension suggests the potential of utilising self-supervised networks for the suppression of correlated noise signals in seismic data.
Finally, in the initial publication of N2V, the authors acknowledge that: "Intuitively, N2V cannot be expected to outperform methods that have more information available during training.". In other words, N2V is likely to be outmatched by well trained, supervised denoising networks. However, as noted above, creating seismic datasets with the noisy-clean pairs required for training traditional supervised procedures is not trivial. When noisy-clean pairs are generated using a previous denoising technique, such as by the Curvelet transform, the inclusion of DL can only serve to speed up the original denoising procedure. As the network learns from the training samples provided then it cannot outperform the denoising technique which was used to generate the training data. Alternatively, when synthetic data is generated to act as the training dataset, the clean image will definitely not contain any noise residual. However, generating synthetic data that accurately represent field data is a well-known challenge [35]. Therefore, whilst certain steps can be taken to reduce the synthetic-field data gap for DL applications [36], there is no guarantee that a synthetically trained network will be as effective when applied to field data. Recently, [37] proposed the first blind-spot network procedure that was shown to perform on-par with, and sometimes outperform, supervised denoising approaches for natural images contaminated by independent and identically distributed additive Gaussian noise. Future studies will consider circumstances under which self-supervised networks, of varying architectures and training procedures, can outperform supervised networks trained on synthetic seismic data, and vice versa.

Conclusion
We have shown how blind-spot networks can be applied to accurately predict seismic signals without replicating noise, and as such, provide a powerful random noise suppression procedure. As a self-learning procedure, no additional data are required for training, removing the common barriers of most deep learning denoising procedures that often require a 'clean' training dataset. The Noise2Void method has been successfully applied on two synthetic and one field datasets. Whilst originally developed for random, additive, white noise, our numerical results show that such networks can be successfully trained to also remove partially correlated noise provided that the number of training iterations is reduced whilst the number of corrupted pixels is increased. A numerical validation of the correspondence between statistical models for the additive noise and the choice of the training loss function is finally provided in Figure 13. MSE and MAE provide the best denoising performance in terms of PSNR for Gaussian and Laplace noise, respectively.
Finally, as noise in seismic data is generally correlated in time, space, or both, we observe that neither of the above defined models is correct. On the other hand, if we assume the noise to have a certain correlation length in either time and/or space, we can express the noise within the correlation window as n ∼ N (0, Σ) where Σ is the covariance matrix of the noise. To take into account such correlation we must therefore write x ≈ f θ (Ω x ) − n ∼ N (f θ (Ω x ), Σ) where we have grouped the nearby correlated pixels to form the vectors x and n. The corresponding probability density function becomes: and the training loss can be written as: We observe that if the covariance matrix is unknown, neither MAE nor MSE can correctly approximate this loss function. In this case, empirical evidence in Figure 13 supports our choice of using MAE in the case of mildly correlated noise, although more sophisticated denosing models that take into account noise correlation will be investigated in the context of seismic data denoising.