Deep‐Learning‐Based Phase Picking for Volcano‐Tectonic and Long‐Period Earthquakes

The application of deep‐learning‐based seismic phase pickers has surged in recent years. However, the efficacy of these models when applied to monitoring volcano seismicity has yet to be fully evaluated. Here, we first compile a data set of seismic waveforms from various volcanoes globally. We then show that the performances of two widely used deep‐learning pickers deteriorate systematically as the earthquakes' frequency content decreases. Therefore, the performances are especially poor for long‐period earthquakes often associated with fluid/magma movement. Subsequently, we train new models which perform significantly better, including when tested on two data sets where no training data were used: volcanic earthquakes along the Cascadia subduction zone and tectonic low‐frequency earthquakes along the Nankai Trough. Our model/workflow can be applied to improve monitoring of volcano seismicity globally while our compiled data set can be used to benchmark future methods for characterizing volcano seismicity, especially long‐period earthquakes which are difficult to monitor.


Introduction
Detecting and identifying onsets of seismic phases is fundamental to locating seismicity.Manual inspection by experienced analysts is viewed as the gold standard but is extremely laborious and time-consuming.This makes it difficult to handle the ever-increasing volumes of seismic data and periods with extremely high seismicity rate such as during volcanic unrests.On the other hand, early automatic methods, such as the short-term average over long-term average method (STA/LTA) (Allen, 1978), suffer from low accuracy and require a number of parameters to be tuned carefully.Over the past two decades, the matched-filter technique has been shown to be an effective method (Chamberlain et al., 2017;Gibbons & Ringdal, 2006) to search for repeating or near-repeating earthquakes based on waveform similarity.However, this method is only capable of detecting earthquakes in the vicinity of known template events.In recent years, deep-learning-based phase pickers/event detectors (e.g., Kriegerowski et al., 2019;Mousavi et al., 2020;Ross et al., 2018;Soto & Schurr, 2021;Zhu & Beroza, 2019) have been gaining increasing attention due to their picking accuracy being comparable to human analysts (Chai et al., 2020) and high efficiency.Their application has surged in recent years, including for delineating seismicity in fault zones, subduction zones, oceanic transform faults, and volcanoes (e.g., Chen et al., 2022;Garza-Girón et al., 2023;Gong et al., 2023;Jiang et al., 2022;Liu et al., 2023;Liu et al., 2024;Tan et al., 2021;Wilding et al., 2023;Zhang et al., 2024).However, it can be difficult to predict deep-learning models' performance for outof-distribution data that are not well represented by training data (Teney et al., 2022;Wenzel et al., 2022).
Seismicity which often correlates with magmatic/volcanic processes and sometimes represents eruption precursors (Acocella et al., 2023;White & McCausland, 2019) is an important monitoring observable at volcanoes.Two types of earthquakes are commonly observed in volcanic regions: volcano-tectonic earthquakes (VTs) and long-period earthquakes (LPs), which are classified mainly based on their waveform frequency content but may imply different source processes (e.g., Chouet & Matoza, 2013;Matoza & Roman, 2022;Saccorotti & Lokmer, 2021, and references therein).VTs share common spectral characteristics with regular tectonic earthquakes and have impulsive onsets.They mostly originate from shear fractures in the solid part of an edifice or the underlying crust, hence only indirectly indicate magmatic activity.In comparison, most conceptual source models of LPs involve fluids, for example, resonating fluid-filled cracks (Chouet & Matoza, 2013), thermal stresses in cooling magmas (Aso & Tsai, 2014), pressurization of exsolved volatiles from stalled magmas (Wech et al., 2020), and rapidly growing bubbles in ascending magmas (Melnik et al., 2020).Therefore, LPs are often interpreted as a more direct evidence of fluid movement (e.g., Song et al., 2023).However, compared to VTs, LPs are more difficult to detect and catalog because they are depleted of high frequency content and have emergent phase onsets (Pitt et al., 2002;Shapiro et al., 2017), which also hinders the study of their source mechanisms.Some recent studies have applied existing deep-learning phase pickers, which were trained using regular tectonic earthquake waveforms, to monitor volcano seismicity (e.g., Bannister et al., 2022;Garza-Girón et al., 2023;Li et al., 2023;Mittal et al., 2022;Retailleau et al., 2022;Suarez et al., 2023;Wilding et al., 2023).However, there is currently no large-scale, systematic evaluation of the efficacy of these existing models for volcano monitoring.For instance, their performances for volcanic earthquakes may be impaired by different waveform characteristics, emergent onsets of long-period events, and high/different background noise in volcanic regions (Lapins et al., 2021).While there have been a few deep-learning studies using seismic data near volcanoes (Armstrong et al., 2023;Kim et al., 2023;Lapins et al., 2021;Manley et al., 2022;Titos et al., 2020), limited data distributions (individual volcano) make these models less generalizable to other volcanic regions.In addition, most deeplearning studies that involved long-period earthquakes focused on event classification (e.g., Manley et al., 2022;Titos et al., 2020), but none of the existing phase picking models explicitly included long-period earthquakes in their training and tests (e.g., Armstrong et al., 2023;Kim et al., 2023;Lapins et al., 2021).
In this study, we first compile a data set of seismic waveforms from various volcanic regions.We then show that the performances of two widely used deep-learning pickers, PhaseNet (Zhu & Beroza, 2019) and EQTransformer (Mousavi et al., 2020), deteriorate when applied off-the-shelf to volcanic seismic data, especially for long-period earthquakes.We then train new models that achieve significantly better performances for monitoring volcano seismicity.

Data Set of Seismic Waveforms From Volcanic Regions
We assemble a data set of 157,082 LP waveforms (35,181 events),157,308 VT waveforms (38,335 events), and 20,000 noise waveforms recorded by 261 seismic stations around 34 volcanoes in Alaska from 1997 to 2017 (Power et al., 2019) 1.In Alaska, about 75% of the LPs are from Shishaldin, Spurr, Gareloi, and Iliamna, and about 50% of the VTs are from Spurr, Martin, Makushin, Augustine, and Redoubt.More than 80% of the LPs and VTs in Hawaii are from Kılauea.In Japan, about 30% of the LPs are from Mount Meakan, Mount Fuji and Mount Yakedake and about 10% of the VTs are from Mount Kurikoma.In Northern California, ∼70% of the LPs are from Mammoth Mountain while ∼45% of the VTs are from the Clear Lake Volcanic Field.The LPs in the Cascade Range are dominated by Mount St. Helens, Mount Baker and Newberry (∼70%).See Table S1 in Supporting Information S1 for data set splitting, Figure S1 in Supporting Information S1 for the event locations, Figure S2 in Supporting Information S1 for the recording stations, Figure S3 in Supporting Information S1 for the distribution of all volcanoes, Figure S4 in Supporting Information S1 for the time span of the data set, and Figures S5-S9 in Supporting Information S1 for the distributions of signal-to-noise ratios, source depths, epicentral distances and earthquake frequency content.All the event waveforms have both manually picked P and S phase arrivals.The LPs had already been labeled by analysts.We choose the noise waveforms from the same stations as the event waveforms by visual inspection.Most waveforms contain three components (77%) (Figure S10 in Supporting Information S1) and are from earthquakes located within 50 km of an active volcano (95%) (Figure S11 in Supporting Information S1).Since there are far more available VTs than LPs, we only include a similar number of VT waveforms as the number of available LP waveforms.We remove data with large spikes and errors (e.g., events with S pick prior to P pick).For waveforms from Japan, we download event waveforms whose length may vary for different events and different stations.For waveforms from the US, we download event waveforms starting from 60 s before the P pick and ending 60 s after the S pick.Hence waveforms in our data set have different lengths, which will be trimmed in the subsequent processing stages.Compared with previous data sets, for example, STEAD (Mousavi et al., 2019) and INSTANCE (Michelini et al., 2021), our data set has a wider distribution of frequency index by design (Figure 2b and Figures S6 and S7 in Supporting Information S1) which is a measure of the dominant frequency content of an earthquake (Buurman & West, 2010;Matoza et al., 2014) (Text S1 in Supporting Information S1), suggesting it includes a greater variety of seismic events, although the waveform number of our data set (334,390) is one fourth that of STEAD (1,265,657) or INSTANCE (1,291,537).While a small amount of data from a single volcano may be enough to tune methods that lean on expert knowledge to study that volcano, a large and global data set is necessary for a deep neural network to learn generalizable knowledge from analyst picks.To the best of our knowledge, this is the first data set of seismic waveforms compiled from various volcanic regions globally for machine learning.

Evaluation of Existing Deep-Learning Phase Pickers
We use 15,078 LP waveforms and 15,057 VT waveforms from Alaska, Hawaii and Japan to evaluate two most widely used models: PhaseNet (Zhu & Beroza, 2019) and EQTransformer (Mousavi et al., 2020), which are the best performing architectures in a recent benchmark study (Münchmeyer et al., 2022).PhaseNet is a U-net with 1D convolutional layers originally trained on earthquakes from Northern California.EQTransformer is a stack of convolutional layers, long short-term memory (LSTM) units, and self-attentive layers originally trained on the global data set STEAD (Mousavi et al., 2019).We divide the testing waveforms into subsets according to frequency index values to evaluate how the model performance varies with the dominant frequency content.We randomly extract 30 s windows around the manual picks of the testing waveforms.For each waveform, the same window is used to test different models.Since EQTransformer operates on a 60 s window, we will only focus on the 30 s target window of the output (Münchmeyer et al., 2022).We use precision, recall and F1-score to evaluate The tick labels on the x axis represent the centers of frequency index bins with a width of 0.2.The F1 scores here are slightly higher than those in Figure 3a because noise waveforms, to which frequency index is not applicable, are not included in this test.(b) Comparison of frequency index distributions of INSTANCE (Michelini et al., 2021) and STEAD (Mousavi et al., 2020) and our data set.(c) The distributions of signal to noise ratio for testing waveforms in different frequency index bins.

Geophysical Research Letters
10.1029/2024GL108438 the results.Precision is the fraction of output picks that are actually correct.Recall is the fraction of manual picks that are correctly identified by the model.F1 score is the harmonic mean of precision and recall (Text S2 in Supporting Information S1).The original versions of PhaseNet and EQTransformer are tested here since they are most widely used.Considering that the original EQTransformer and PhaseNet were trained under the TensorFlow framework (Abadi et al., 2015) that is different from the platform we use (pyTorch) and that they were not trained on the same data set, we also include the variants of EQTransformer and PhaseNet trained on the INSTANCE data set (Michelini et al., 2021) for comparison, which were trained by Münchmeyer et al. (2022) and available in the SeisBench package (Woollam et al., 2022).The model output is time series of "probability" of P and S. To get predicted picks from the probability time series output by the models, we first extract segments of probability curves above a given threshold and the peak positions of these extracted segments are considered as pick times.The model-specific threshold is tuned (Figure S12 in Supporting Information S1) on the validation set (Table S1 in Supporting Information S1).
The recalls, precisions and F1 scores of the original models decrease systematically with decreasing frequency index (Figure 2).For example, the F1 score of PhaseNet decreases from ∼0.9 to ∼0.6 for P picking and from ∼0.85 to ∼0.35 for S picking as the frequency index decreases from ∼0.5 to ∼ 1.7.Compared with precision, the recall exhibits a greater deterioration, which can be as low as 0.3 for P picking and 0.2 for S picking, indicating that most LPs in the test set have been overlooked.We observe a similar trend for the models trained on INSTANCE (Münchmeyer et al., 2022).This is unlikely to be related to changes in signal-to-noise ratio distribution since we do not observe significant systematic changes in signal-to-noise ratio with frequency index (Figure 2c and Figure S14 in Supporting Information S1).In addition, both the performances on LPs and VTs decrease with signal-to-noise ratio, and these models show lower performances for LPs than for VTs at the same signal-to-noise ratio (Figure S15 in Supporting Information S1).Our results suggest that these existing models will likely underreport LPs compared to VTs when directly applied to monitoring volcano seismicity (Bannister et al., 2022;Garza-Girón et al., 2023;Li et al., 2023;Mittal et al., 2022;Suarez et al., 2023;Wilding et al., 2023), which is not ideal since LPs often indicate fluid/magma movements (Chouet & Matoza, 2013;Matoza & Roman, 2022;Song et al., 2023).Therefore, we decided it would be valuable to train a new phase picker specifically for volcano seismicity.

Training Deep-Learning Phase Pickers for Volcano Seismicity
Among our data set, 151,431 LP waveforms, 151,657 VT waveforms and 20,000 noise waveforms from Alaska, Hawaii and Japan corresponding to 70,352 events are grouped into a training set (83.64%), a validation set (5.49%) and a test set (10.87%) (Table S1 in Supporting Information S1).Here, the earthquake waveforms in the test set are the same as those presented in the previous section.An extra test set comprising 5,651 waveforms from 1,295 LP events and 5,651 waveforms from 1,869 VT events near 18 volcanoes along the Cascadia subduction zone in the Western United States (Figure 1d) is used to test how our model generalizes to a region where no training data have been used.In addition, 6,224 waveforms of 2,356 tectonic low-frequency earthquakes (LFEs), which also have lower frequency indices, along the Nankai trough in Japan are used as another test set to investigate whether our model works for tectonic LFEs associated with shear slip on the subduction zone plate interface (Obara & Kato, 2016).We use our data set to train two new models based on the PhaseNet and EQTransformer architectures implemented in the SeisBench package (Woollam et al., 2022).All the waveforms are resampled to 100 Hz.We normalize each component of a waveform by removing the mean and dividing it by the maximum value.We perform data augmentation by randomly modifying the waveforms at each step of training.The modifications include randomly shifting waveforms, adding gaps to waveforms, adding Gaussian noise and superimposing a training example on the shifted and normalized version of another training example that is randomly multiplied by a factor.Each type of augmentation is performed with a given probability.Normalization is performed before and after data augmentation.The labels for phase arrivals are Gaussian functions with peaks aligning with manual picks.Event types are not included as labels because the models are only aimed at phase picking while event classification can be conducted in a postprocessing stage using frequency index.At each step of training, a batch of waveform examples are randomly selected, normalized, randomly augmented, labeled, and input into the Adam optimization algorithm (Kingma & Ba, 2015) to adjust the model weights.

10.1029/2024GL108438
The validation set is used to tune hyperparameters.We try various learning rates 0.0001/0.0005/0.001and batch sizes 512/1024 to obtain a series of models.Each model is trained for 400 epochs.Loss function on the validation set is monitored for each epoch and the model snapshot at the epoch with the lowest validation loss is used as the final model.For each model, we test different decision thresholds and choose the one with the highest F1-score as the optimal threshold.Then we evaluate each model on the validation set and choose the one with the highest F1score (Tables S2 and S3 in Supporting Information S1).The preferred learning rate and batch size for PhaseNet are 0.0005 and 512, respectively.They are 0.001 and 1024 for EQTransformer, respectively.We also compare two initialization strategies: using (a) random weights and (b) the network weights pre-trained on the INSTANCE data set (Melnik et al., 2020;Münchmeyer et al., 2022), and we choose the resulting model with the highest F1score on the validation set (Table S4 in Supporting Information S1).
We first test our models on subsets with different frequency index values as described in the previous section.Our models trained for volcano seismicity show significant performance improvement for waveforms with low frequency index values compared to existing models, with F1 scores for P and S picking of ∼0.9 and ∼0.8, respectively (Figure 2).There is also a slight improvement for waveforms with high frequency index.The overall performances of various models on the whole test set are shown in Figure 3a, where our models show the best performances for both LPs and VTs for both P and S picking.For the LPs, the EQTranformer-based network trained in this study achieves an F1 score of 0.9 for P picking and 0.86 for S picking, which are 0.42 (P picking) and 0.16 (S picking) higher than those of the original EQTransformer.The performance improvement is smaller for the VTs: the retrained EQTransformer achieves F1 scores 0.11 and 0.05 higher than the original EQTransformer model for P and S picking respectively.The EQTransformer trained on INSTANCE has similar performance to the original EQTransformer.A similar amount of improvement is obtained by the PhaseNet-based network trained on our data set.Furthermore, our models give lower picking residuals as indicated by the narrower histograms of residuals (Figures S18 and S19 in Supporting Information S1), with 2σ (i.e., two standard deviations) of the retrained EQTransformer being 0.38 s for P residuals and 0.46 s for S residuals for LPs, and 0.21 s (P) and 0.32 s (S) for VTs.The retrained EQTransformer shows only a marginally higher F1 score than the retrained PhaseNet, suggesting that the data set plays a more important role than the network architecture in differences in model performances.
Subsequently, we use the test set from Northern California and the Cascade Range in the Western United States (Figure 1d) to investigate how our models generalize to regions where no training data are used (Figure 3b).All the models show great performance for VTs, with F1 scores for P picking larger than 0.9 and F1 scores for S picking larger than 0.86, and our models achieve the highest F1 scores (0.95).Notably, the existing pickers perform poorly for LPs, with F1 score ranging from 0.37 to 0.61.Although all the models experience some performance degradation for LPs compared with the previous test, our retrained models still perform significantly better than the existing models, with F1 scores ranging from 0.71 to 0.76.The performance variation with frequency index for this test set (Figure S16 in Supporting Information S1) also suggests that our models have better generalization abilities when applied to a new region.The poorer performances for LPs could be partly explained by the LP waveforms in this test set having lower signal-to-noise ratios than VT waveforms (Figures S5 and S16 in Supporting Information S1).
Finally, we investigate whether our models also work for tectonic LFEs since both tectonic LFEs and volcanic LPs appear to have similar frequency content, though they are often inferred to reflect different source processes (Aso et al., 2013).Our training set does not explicitly include any tectonic LFE.Here we test the models on LFEs along the Nankai trough from Japan.The result is shown in Figure 3c.Our retrained models outperform the original models and the INSTANCE-based models by 0.22-0.49for P picking and 0.08-0.31for S picking, with F1 scores of ∼0.8.We further confirmed that our models also work for regular tectonic earthquakes, since they achieve F1 scores of 0.89 and 0.75 for P and S picking respectively when tested on the INSTANCE data set (Michelini et al., 2021), which is slightly better than the original EQTransformer and PhaseNet but unsurprisingly inferior to the models trained on the INSTANCE data set (Figure S28 in Supporting Information S1).

Comparison With Existing Methods
Deep-learning-based pickers have higher accuracy and require less parameters to manually tune than traditional pickers, for example, STA/LTA (Allen, 1978) and the Baer-Kradolfer picker (Baer & Kradolfer, 1987), as demonstrated in previous studies (e.g., Mousavi et al., 2020;Münchmeyer et al., 2022;Zhu & Beroza, 2019).Also, deep-learning-based pickers have greater flexibility than template matching as they are not limited by the availability of suitable template events, and they can contribute to generating more templates.Compared with previous deep-learning models aimed at tectonic earthquakes, our models can better pick volcano seismicity and thus are a step toward improving volcano monitoring.Our compiled data set can also be used to benchmark future methods for monitoring volcanic earthquakes.
Our study is different from a few recent studies that have also trained phase pickers on volcanic earthquakes (Armstrong et al., 2023;Kim et al., 2023;Lapins et al., 2021) in two aspects.First, the previous studies focused exclusively on one volcano and thus it is unclear how well these models can generalize to other volcanoes, while we use data around 146 active volcanoes from different regions.Further efforts in the community might be necessary to cultivate a better data set that covers the wide variety of volcanic signals observed.Second, LPs were not considered in the previous studies despite being an important form of volcano seismicity, while we explicitly included LP earthquakes to build a data set with a more balanced distribution of frequency content.We subsequently demonstrated that our models perform well for both LPs and VTs, and can be generalized to other volcanoes.
Finally, our study is different from recent studies which focused on tectonic LFEs (Lin et al., 2023;Münchmeyer et al., 2024;Thomas et al., 2021) in terms of training data and targets.These studies focused on tectonic LFEs which are a manifestation of creep or slow fault slips (Behr & Bürgmann, 2021), while our target is to pick volcano seismicity including both VTs and LPs.The capability of our models to pick tectonic LFEs is a side benefit and demonstrates that (a) our models are generalizable to other tectonic environments and (b) tectonic LFEs and volcanic LPs have relatively similar waveform characteristics.

Different Ways of Performance Evaluation
The presented evaluation results for different models depend on the metrics used and how they are calculated, which may vary in different studies.Therefore, it might not be appropriate to directly compare the values reported in different papers.For instance, some studies calculate true positive (TP), false positive (FP), true negative (TN) and false negative (FN) based on waveform traces so that any of the four outcomes TP/FP/TN/FN is assigned to each testing waveform (e.g., Mousavi et al., 2020;Zhu & Beroza, 2019).In this case, a waveform is considered as a true positive as long as there is a predicted pick sufficiently close to the manual pick even if there may also be some falsely predicted picks for the same waveform.Hence, false predictions may be underreported.In contrast, the definition of positive and negative in this paper is based on sampling points, where any of TP/FP/TN/FN is assigned to each sampling point of a waveform rather than the whole waveform (Text S2 in Supporting Information S1).The different definitions of FP and FN lead to different values of recall and precision.We have also calculated the model performances using the definition of positive/negative based on waveform traces (Mousavi et al., 2020;Zhu & Beroza, 2019), and the results (Figure S25 and S26 in Supporting Information S1) show similar trends as those presented in the previous section (Figures 2 and 3) except that the absolute values are slightly higher.
Alternatively, Münchmeyer et al. (2022) decomposed the evaluation into three tasks: event detection, phase identification and onset time picking.This evaluation workflow avoids the ambiguity in the definition of positive/ negative for phase picking.However, it uses the maximum probability value within the tested window as the prediction result, which may be inconsistent with the practical application of a deep-learning picker where a trigger algorithm is used to retrieve picks from an output probability curve.Nevertheless, our models also show better performances than existing models when evaluated on the three tasks following Münchmeyer et al. (2022)'s workflow (Figures S20-S22 and Tables S5-S6 in Supporting Information S1), although existing models also perform well on the task of event detection which is easier than phase picking.Therefore, our models show consistently better performances than existing models regardless of the method of performance evaluation.

Conclusion
In this study, we first compile a data set of seismic waveforms from various volcanic regions globally, which has a wider distribution of frequency index than previous data sets of tectonic earthquakes.We then show that existing deep-learning-based phase pickers do not generalize well for volcanic earthquakes, with their performances deteriorating as the earthquakes' frequency content decreases, hence direct applications for monitoring volcano seismicity is suboptimal due to potential biases.Finally, we train and test new models using our data set.The test results show that our models can better pick P and S phases of VTs and LPs, and can be generalized to other regions not included in our training data set, including for tectonic LFEs.Therefore, our results can benefit future efforts to improve monitoring of volcano seismicity.waveform downloading.We use the network architectures implemented in the SeisBench package (Woollam et al., 2022).We train the networks under the PyTorch framework (Paszke et al., 2019) using the pytorchlightning package (github.com/Lightning-AI/pytorch-lightning).

Figure 1 .
Figure 1.Geographical distribution of active volcanoes (filled semicircles) with seismic events used in our data set.The volcanoes are colored by the number of VTs (left halves, in blue) and LPs (right halves, in red) within 50 km.The seismic waveforms of volcano-tectonic earthquakes and volcanic long-period earthquakes from Japan (a), Alaska (b) and Hawaii (c) are split into a training set, a validation set and a test set, while the data from volcanoes along the Cascadia subduction zone in the Western United States (d) and the tectonic low-frequency earthquakes (LFEs) (purple circles in a) from Japan are only used for testing.

Figure 2 .
Figure 2. (a) Performances of various models on subsets of testing waveforms with different frequency index values.The tick labels on the x axis represent the centers of frequency index bins with a width of 0.2.The F1 scores here are slightly higher than those in Figure3abecause noise waveforms, to which frequency index is not applicable, are not included in this test.(b) Comparison of frequency index distributions of INSTANCE(Michelini et al., 2021) and STEAD(Mousavi et al., 2020) and our data set.(c) The distributions of signal to noise ratio for testing waveforms in different frequency index bins.

Figure 3 .
Figure 3. F1 scores of different models evaluated on the testing waveforms from (a) the same regions as the training data, (b) volcanoes along the Cascadia subduction zone in the western United States where no data have been used in the training and (c) tectonic LFEs in Japan.The precision and recall are given in Figures S23 and S24 in Supporting Information S1.