Using a Deep Neural Network and Transfer Learning to Bridge Scales for Seismic Phase Picking

The important task of tracking seismic activity requires both sensitive detection and accurate earthquake location. Approximate earthquake locations can be estimated promptly and automatically; however, accurate locations depend on precise seismic phase picking, which is a laborious and time‐consuming task. We adapted a deep neural network (DNN) phase picker trained on local seismic data to mesoscale hydraulic fracturing experiments. We designed a novel workflow, transfer learning‐aided double‐difference tomography, to overcome the 3 orders of magnitude difference in both spatial and temporal scales between our data and data used to train the original DNN. Only 3,500 seismograms (0.45% of the original DNN data) were needed to retrain the original DNN model successfully. The phase picks obtained with transfer‐learned model are at least as accurate as the analyst's and lead to improved event locations. Moreover, the effort required for picking once the DNN is trained is a small fraction of the analyst's.


Introduction
Seismic monitoring plays a significant role in the oil and gas industry, underground mines, carbon capture and storage, and the geothermal industry due to its value for both reservoir management and for risk mitigation. Valuable information, such as fracture development and elastic properties of the subsurface, can be recovered from data recorded with seismic monitoring systems. For example, the spatial dimensions and temporal evolution of hydraulic and/or reactivated natural fractures are usually estimated by tracking seismic events. The location and origin time of these microseismic events are determined by arrival times of seismic phases at multiple seismic sensors. These same arrival times of primary (P) and secondary (S) waves are also used for subsurface seismic imaging to measure elastic properties of the subsurface. Manually picking arrival times of seismic phases is a very time-consuming task especially for small-scale projects since high temporal sampling rate is required. Therefore, reliable automatic phase pickers are essential for these projects. Traditional automatic pickers such as short-term average/long-term average (STA/LTA; Allen, 1978) and autoregression-Akaike information criterion (AR-AIC; Sleeman & van Eck, 1999) pickers require intensive human involvement and refinement, and they do not benefit from knowledge of previous picks because they treat each measurement individually. When applied to seismic data, the accuracy of traditional automatic pickers may not be satisfactory, particular for noisy data. Recent applications of machine learning-based/deep learning-based automatic seismic phase pickers (e.g., Y. Chen, 2018;Ross et al., 2018;Zhou et al., 2019;W. Zhu & Beroza, 2018) have shown remarkable accuracy and processing speed for seismic signals originated from natural earthquakes. L. Zhu et al. (2019) trained a deep learning picker for the 2008 M w 7.9 Wenchuan earthquake sequence and fine-tuned the deep learning model for an Oklahoma data set. However, whether these deep learning phase pickers can be used for industrial seismic monitoring remains unclear, and training such phase pickers from scratch requires a huge amount of labeled data.
We use seismic data from Experiment 1 of the enhanced geothermal system (EGS) Collab project to test whether one of the deep learning-based automatic phase pickers, PhaseNet (W. Zhu & Beroza, 2018), is useful for mesoscale monitoring systems. The experiment was conducted at the 4,850-ft level of the Sanford Underground Research Facility (SURF) located in Lead, South Dakota (Kneafsey et al., 2019). The test bed consists of one injection, one production, and six 60-m-long monitoring boreholes. The seismic monitoring system was equipped with multiple types of geophysical instruments including 24 hydrophones and 12 accelerometers. An eight-core workstation with an automated processing flow was deployed at the experiment site. The processing scripts are capable of detecting seismic events (triggered), finding initial P-wave phase picks, and inverting for initial seismic event locations and origin times. Seismic event locations and origin times were improved with human reviewed and refined phase picks. The original seismic catalog was processed by Schoenball et al. (2020). Several hydraulic stimulations were performed since May 2018. We focused on seismic signals associated with stimulations between 22 May and 21 December 2018. These seismic signals have 3 orders of magnitude difference in spatial and temporal scales from the original training data used by the deep learning models mentioned earlier.
In this paper, we directly applied the PhaseNet model (W. Zhu & Beroza, 2018) to the seismic data from Experiment 1 of the EGS Collab project. Although the results are reasonable, we show that retraining the PhaseNet model significantly boosts performance. The process, called transfer learning (TL), has been successfully applied to image (e.g., Y. Zhu et al., 2011), text (e.g., Do & Ng, 2005), and time series (e.g., Ismail Fawaz et al., 2018) classifications. TL requires only a few thousand seismograms because the weights of the DNN were trained initially by a different data set of 0.7 million seismograms (from natural earthquakes). The performance of the resulting TL model was compared with a traditional automatic picker, the original PhaseNet model, and human analysts. We then applied the TL model to all the seismograms from the triggered seismic events. We used the resulting TL-derived phase picks and double-difference tomography (tomoDD; Zhang & Thurber, 2003, 2006 to constrain subsurface seismic velocities and update seismic event locations. The results were compared with those using manual picks.

Data
Our data consist of seismograms from triggered microseismic events between May 2018 and December 2018 at the Experiment 1 site of the EGS Collab project, manually picked P-wave and S-wave arrival times, and the original seismic catalog from Schoenball et al. (2020). We used 35 seismic sensors (one hydrophone was defective) with a 100-kHz sampling rate that were deployed in six 60-m-long monitoring wells (supporting information Figure S1). We detected and located the microseismic events using a standard STA/LTA routine, the PhasePAPy package (C. Chen & Holland, 2016), and a modified version of Hypoinverse (Klein, 2002). We cut the triggered seismograms to 0.11-s-long segments around the P-wave arrival times and filtered with a band-pass filter between 3 and 20 kHz to remove system noise introduced by the recording system both at low and high frequencies. The filtered seismograms show clear similarity with those used to train the original PhaseNet ( Figure S2). We used a total of 69,444 waveform segments (from 1,932 seismic events). Initial P-wave arrival times were automatically measured with the PhasePAPy package. We manually reviewed and refined the P-wave arrival times. We picked all S-wave arrival times manually. Additional details about the monitoring system and data preprocessing procedures of the original seismic catalog can be found in Schoenball et al. (2020).

Method
We designed a workflow (Figure 1), TL-aided double-difference tomography (TADT), which takes advantage of two existing technologies, deep neural networks (DNNs) and seismic double-difference tomography. We started with the pretrained DNN model, PhaseNet (W. Zhu & Beroza, 2018), which was trained with over 0.7 million seismic recordings from in and around Northern California for natural earthquakes. The PhaseNet model was trained using 30-s-long seismograms sampled at 100 Hz. The earthquake-station distance for these data is on the order of tens of kilometers. Our monitoring system for our data samples at 100 kHz and the source-sensor distance is on the order of 10 m. Despite the 3 orders of magnitude differences in both sampling rate and source-sensor distance between the PhaseNet data and our data (Table S1), we found that PhaseNet produced acceptable results when applied to our data. To improve the performance further, we updated the PhaseNet model with a subset of seismic data that meets the training data requirements (three-component seismograms with both P-and S-wave picks) for PhaseNet. We then applied the resulting TL model to all the triggered seismograms (30 ms long) to obtain TL-derived P-and S-wave phase picks. We used the tomoDD package (Zhang & Thurber, 2003, 2006 to update the seismic catalog and simultaneously image the subsurface tomographically. The TADT workflow allows us to reduce the human effort significantly.

Transfer Learning
During the TL process, we use the same network architecture as PhaseNet and initialize the weights with the PhaseNet model. We visually inspected the selected seismograms and excluded 343 (9%) incorrect phase picks. The remaining 3,478 seismograms belong to 1,872 distinct seismic events. The total number of seismograms we used is only 0.45% of that used by the original PhaseNet. We randomly divided the seismic events into training, validation, and test sets. The training set (2,443 waveforms) was used to retrain the DNN model, the validation set (345 waveforms) was used to select the optimal model from different training runs, and the test set (690 waveforms) was used to evaluate performance ( Figure S3). We used a Gaussian distribution with a standard derivation of 0.1 ms centered on the manual picks to represent manual pick uncertainty. We allowed the entire neural network to change during TL and used the Adam optimizer (Kingma & Ba, 2014). We used a learning rate (determines the step size of each iteration) of 0.01, a batch size (number of training samples used each time) of 20, and 100 epochs ( Figure S4). Our tests indicate that using filtered data leads to better performance than raw data. We applied a band-pass filter with corner frequencies of 3 and 20 kHz. Main seismic energy is contained in this wide passband . The frequency filtered seismograms were fed into the neural network for training, which is different from W. Zhu and Beroza (2018). For a fair comparison, the band-pass filter was applied to all the seismograms throughout this study.

Double-Difference Tomography
The original seismic catalog was processed with a homogeneous seismic velocity model (Schoenball et al., 2019). Here we used the double-difference tomography package tomoDD (Zhang & Thurber, 2003, 2006 to simultaneously minimize the uncertainty in seismic event locations due to spatial seismic velocity variations and to constrain the 3-D subsurface P-wave and S-wave velocity model for the seismically active region. Since the tomoDD package was originally designed for kilometer-scale problems, we made some modifications (e.g., input and output format, and coordinate system) specifically for meter-scale projects. We estimated both P-and S-wave seismic velocity models. We relocated 1,743 seismic events and discretized a 3-D volume of 77 m (easting), 83 m (northing), and 40 m (vertical) with nodes 1 m apart in each direction (then interpolated to 0.1 m by the tomoDD package). Following the formula from Yilmaz (2001), the node size is within the range of the Fresnel zone (Text S2). The initial model was homogeneous with a P-wave speed of 5.9 km/s and an S-wave speed of 3.5 km/s. These two velocities were obtained from curve fitting of traveltime observations (traveltime versus distance). Numerous previous studies (e.g., Chai et al., 2019;Syracuse et al., 2016) have shown that appropriate inversion parameters are required for a  (Zhang & Thurber, 2003, 2006.

Geophysical Research Letters
well-constrained seismic velocity model. We used an L-curve analysis (similar to Hansen, 1992) to find the optimal set of inversion parameters. An optimal weight of 10 was used for smoothing and 200 for damping (see Zhang & Thurber, 2003). We obtained the final velocity models and updated seismic catalog after eight iterations. The final models fit the observations better than the starting homogeneous model ( Figures S5 and S6).

Results
Our results are new phase picks, updated seismic event locations, and 3-D seismic velocity models. Hyper Text Markup Language (HTML)-based interactive visualizations (similar to Chai et al., 2018) were used to inspect seismic event locations and seismic velocity models.

Phase Picks
The TL model we obtained measures phase picks from seismograms with high accuracy. Some randomly selected waveforms and associated phase picks from the test data set are shown in Figure S7. We can see that the TL results agree with manual picks even when the background noise level is high. Inspecting the data when TL results differ from manual picks, we noticed that the TL model is able to correct some human errors or skip difficult-to-pick signals (more often for P waves than for S waves). The difference between TL results and manual picks is just slightly larger than the threshold (0.1 ms), for many cases in Figure S7. The TL model is more prone to error when signals are very complex.
We compared the TL results with those using the ObsPy (Beyreuther et al., 2010) implementation of an AR picker (Akazawa, 2004), the original PhaseNet, and human analysts (Figure 2). We use precision, recall, and F1 score (Text S1) to quantify and compare the performance. We estimated the performance of human analysts by having three analysts manually pick the phase arrival times from the same 100 three-component seismograms. For each seismogram, we considered the median of the three manual picks as the ground truth. We also measured the human performance for each analyst by comparing results from each analyst against the ground truth. The human performance in Figures 2a and 3b was the average of the three analysts. The original PhaseNet produced much better results than the AR picker for both P and S waves. The TL model outperformed the original PhaseNet with an improvement of roughly 0.1 in precision and 0.3 in recall, highlighting the importance of retraining the DNN with our data. The TL model is the only one among the three automatic pickers that has a performance comparable to human analysts. The TL model performed slightly better on S waves than human analysts, which could be due to the larger SNRs compared to P waves. The TL model achieved human performance in a fraction of the time.
We used a fivefold cross-validation to measure the uncertainties in the performance matrices. We divided all the available waveforms (including training, validation, and test set) into fivefold (equal parts) based on seismic events. Five TL models were trained using one of the five combinations of fourfold for each training and validated with the other fold. The performance of these five TL models was used to compute the uncertainty of the performance. The performance improvement is larger than the measurement uncertainty (Figure 2c). We also trained TL models with different number of training waveforms and tested the TL models with the

10.1029/2020GL088651
Geophysical Research Letters same test data set. As expected, F1 scores of the TL models for both P and S waves improve as more training data were included (Figure 2d).
When we apply the TL model to all the triggered seismograms, the TL model finds more S-wave picks than the human expert. We performed 3-D double-difference tomography using manual picks and TL-derived picks with the same inversion parameters. Although fewer P-wave picks were obtained by the TL model compared to the human analyst, the updated seismic locations using TL-derived picks show a more compact distribution compared to that using manual picks (see the next section for details). Specifically, we found 18,543 acceptable P-wave picks and 8,935 S-wave picks from the human expert using a total of 69,444 seismograms. The TL model identified 12,050 acceptable P-wave picks (20% of which were included in the training data set) and 13,297 S-wave picks (18% of which were included in training data set).

Updated Seismic Locations
We examine seismic locations associated with the May 2018 stimulations (22-25 May), June 2018 stimulation (25 June), and December 2018 stimulations (21 and 22 December) in Figure 3. Details of these stimulations can be found in Schoenball et al. (2020). Compared to the original locations, the updated locations from double-difference tomography using either manual picks or the TL-derived picks show more detailed geometry of the fractures. For the May 2018 stimulations (Figures 3a-3c), the updated locations show two parallel fractures that are not obvious in the original locations. Since these two fractures intercepted one monitoring borehole, we were able to confirm these fractures with independent temperature data recorded in the borehole using distributed temperature sensing with 0.25-m spatial resolution (Fu et al., 2020). Using the TL-derived picks leads to more tightly clustered locations. For the June 2018 stimulation (Figures 3d-3f), both the original and updated locations show two fractures. Compared to the original, the updated locations show a slightly tighter pattern delineating the activated fractures. For the December 2018 stimulations (Figures 3g-3i), the original locations show two intersecting fractures, but the geometry of these fractures was not well constrained, especially near the two ends of the fractures. When we relocated the events using the manual picks, these two fractures showed a tighter pattern. When TL-derived picks were used for relocation, these two fractures were constrained even better and we can see the two ends of the fractures more clearly. As indicated by the updated locations, the TL-derived picks are equivalent to or better than the manual picks in imaging the activated fractures. The average uncertainty of the seismic event locations is 0.2 m ( Figure S8).

3-D Seismic Velocity Model
We constrained both P-wave and S-wave seismic velocities with double-difference tomography. The P-wave velocity model shows significant spatial heterogeneity when either manual picks or TL-derived picks were used in the tomographic inversion. Slices of the 3-D P-wave and S-wave velocity models obtained using TL-derived picks are shown in Figure 4. The P-wave velocity model contains some small-scale high-velocity anomalies. To first order, the P-wave velocity is lower at a greater depth. The S-wave velocity model shows a

Geophysical Research Letters
smoother pattern with a low-velocity zone imaged at an elevation below 105 m. The average P-wave velocity agrees with that obtained from active source surveys .
To identify the volume that we can reliably image, we performed checkerboard tests using data simulated according to both manual picks and TL-derived picks. For the checkerboard tests, we started with an artificial model with alternating high and low velocities. Synthetic P-wave and S-wave phase picks were computed using the artificial model matching the actual observations between seismic event and seismic sensor pairs. The synthetic phase picks were then used in tomography with a uniform starting model. The recovered (inverted) model is compared to the true model to identify the volume that is well constrained by the data. To measure the volume, we first compute the absolute difference between the recovered model and the true model at each grid cell. A grid cell is considered well constrained when the recovered seismic velocities are less than 0.1 km/s away for P waves or 0.06 km/s for S waves from the ground truth. The well-constrained volume is smoothed by applying a 3-D spatial Gaussian filter with a standard derivation of 1 m in each direction to all of the well-constrained grid cells. Slices of the recovered P-and S-wave velocity models using manual picks and TL-derived picks are shown in Figures S9 and S10. For the P-wave velocity model, the well-constrained volume is 2,678 m 3 for manual picks and 2,465 m 3 (8% decrease) for TL-derived picks. For the S-wave velocity model, the well-constrained volume is 815 m 3 for manual picks and 1,895 m 3 (133% increase) for TL-derived picks. The checkerboard tests suggest that the uncertainty of the well-constrained volume is less than 0.1 km/s for P waves and 0.06 km/s for S waves.

Discussion and Conclusions
We present a workflow that integrates TL and double-difference tomography. As demonstrated with the EGS Collab data, the workflow can produce better seismic event locations, improve subsurface imaging capabilities, and reduce the overall time cost compared to the original labor-intensive workflow. Our results also show that the TL model obtained by retraining the PhaseNet DNN leads to human-level performance despite the significant differences in the study area size, sensor geometry, and sampling rate between the data used to develop and train PhaseNet and our data. Other types of geophysical observations such as Pwave receiver functions, surface-wave measurements, and gravity observations (e.g., Chai et al., 2015;Maceira & Ammon, 2009;Syracuse et al., 2016) can be inverted together with TL-derived phase picks when data are available.
Since phase picks are the basis for both locating seismic events and imaging the subsurface, it is valuable to determine seismic phase picks quickly and reliably. The PhaseNet model leads to better picks than many traditional autopickers such as the ObsPy implementation of the AR picker (W. Zhu & Beroza, 2018). A TL model initialized with the PhaseNet model and retrained with only around 3,600 three-component seismograms and associated manual picks outperforms the original PhaseNet model by over 10% in terms of precision and recall. The TL model performs equally to or slightly better than a human expert. The TL model found fewer (32%) P-wave picks but more (48%) S-wave picks than the human expert. Since the double-difference tomography results that used these TL-derived phase picks show better seismic event locations compared to those using manual picks, it is likely that the TL model removed low-quality P-wave picks and added high-quality S-wave picks ( Figure S11). The speed of the TL model (or PhaseNet) is about 1,900 times (excluded training time) faster than the human expert. Weights of the TL model show first-order similarities and small differences compared to the PhaseNet model ( Figures S12-S14). A comparison (Figures S15 and S16) of convolutional features of hidden layers for an example input using the PhaseNet model and the TL model suggests that these small changes lead to better hidden features (more impulsive peaks). Similar to the original PhaseNet model and some other deep learning models (e.g., Liu et al., 2019), the TL model can handle noisy data reasonably well ( Figure S17) as demonstrated using noise simulated with the technique of Chai et al. (2017).
Double-difference tomography tests using manual picks and TL-derived picks show that the latter lead to better seismic event locations and a larger (133% increase) well-constrained volume for the S-wave velocity model. Even though we obtained fewer P-wave picks with the TL model compared to the human expert, the well-constrained volume for the P-wave velocity model only decreased slightly. The improved seismic event locations allow us to see detailed structures of the fracture planes, which in turn will help us better constrain 10.1029/2020GL088651 Geophysical Research Letters the fracture geometry. Two parallel fracture planes were confirmed with independent borehole observations (Fu et al., 2020).
Our results show that we can reduce the time cost significantly, and improve results, by adding TL into the proposed workflow. Seismic phase picking is labor intensive and thus expensive. It took on the order of several days to determine all the seismic phase picks from the 69,444 seismograms recorded. For the presented workflow, the analyst would only need to manually pick around 3,500 high-quality seismograms. Retraining the PhaseNet model once took around 1 hr (using thirty-two 2.1-GHz Intel Xeon cores). We retrained the PhaseNet model around 20 times. Processing all the seismograms with the TL model took only 9 min on a laptop computer (with six 2.9-GHz Intel i9 cores). Even including the retraining time, the presented workflow takes much less time than human labor. The speed can be increased with greater computational power. Moreover, the TL model can be directly used on future seismic data from the same recording system without retraining. The proposed workflow is an economical way to monitor subsurface fracture evolution and image subsurface seismic structure with high resolution.
The workflow is also applicable to new study areas. When applying this workflow to new sites, we suggest screening the data set before training to reduce erroneous labels, splitting the data based on seismic events to avoid data leakage, and performing cross-validation to assure that the refined model is not biased toward a subset of seismic events. To achieve human performance, thousands of training seismograms from a diverse set of seismic events may be needed.