Simultaneous Classification and Location of Volcanic Deformation in SAR Interferograms Using a Convolutional Neural Network

With the evolution of interferometric synthetic aperture radar into a tool for active hazard monitoring, new methods are sought to quickly and automatically interpret the large number of interferograms that are created. We present a convolutional neural network (CNN) that is able to both classify the type of deformation, and to locate the deformation within an interferogram in a single step. We achieve this through building a “two headed model,” which returns both outputs after one forward pass of an interferogram through the network. We train our model by first creating a data set of synthetic interferograms, but find that our model's performance is improved through the inclusion of real Sentinel‐1 data. We also investigate how model performance can be improved by best organizing interferograms such that they can exploit the three channel nature of computer vision models trained on very large databases of labeled color images, but find that using different data in each of the three input channels degrades performance when compared to the simple case of repeating wrapped or unwrapped phase across each channel. We also release our labeled Sentinel‐1 interferograms as a database named VolcNet, which consists of ∼500,000 labeled interferograms. VolcNet comprises of time series of unwrapped phase and labels of the magnitude, location, and duration of deformation, which allows for the automatic creation of interferograms between any two acquisitions, and greatly increases the amount of data available compared to other labeling strategies.


Introduction
In recent years, work to extend volcano monitoring to all of the world's ∼1,400 subaerial volcanoes has resulted in the application of several machine learning methods to ground deformation maps produced by interferometric synthetic aperture radar (InSAR).Convolutional neural networks (CNNs) have been used in Anantrasirichai et al. (2018Anantrasirichai et al. ( , 2019a)), Valade et al. (2019), and Bountos et al. (2021) to determine if individual interferograms contain deformation.This approach has been extended to more subtle deformation signals that are not visible in a short temporal baseline interferogram, both through using cumulative time series (Anantrasirichai et al., 2019b), and through calculating velocity maps from time series (Beker et al., 2023).Time series have also been used by Sun et al. (2020) to detect subtle deformation, with independent component analysis (ICA) by Gaddes et al. (2018) to detect signs of unrest relative to a baseline stage of a volcano's behavior, and with the CUSUM algorithm by Albino et al. (2020) to detect signs of unrest.However, in all of the examples detailed above, each algorithm demonstrates very limited knowledge of the diverse types of deformation that may be measured at volcanoes.The algorithms presented in Anantrasirichai et al. (2019a) and Beker et al. (2023) assigns all data containing deformation to one label, whilst the algorithms presented in Gaddes et al. (2018) and Albino et al. (2020) alerts users to changes in the signals present, but does not identify the type of deformation present.Consequently, we seek to improve upon these approaches by developing a CNN that is able to differentiate between different types of deformation, and to detect the spatial extent of it.
Figure 1a shows the hierarchy of computer vision object/signal identification methods.The algorithm presented in Anantrasirichai et al. (2018) contains a model that performs classification and, by breaking larger images into smaller tiles that are each classified, the algorithm as a whole is able to perform localization.This approach has the limitation that the deep learning model used in this algorithm does not need to learn how to determine the location or size of the object (or signal) of interest, and at a more fundamental level, remains a classification and not localisation model.However, in the field of computer vision, CNNs have been developed that are able to perform both classification and localisation on images that contain either single or multiple objects.The location of an object is either indicated through encompassing it in a rectangle (e.g., localisation or object detection, Simonyan and Zisserman (2014), Redmon et al. (2016)) or, in more complex algorithms, indicating the exact outline of an object by identifying which pixels comprise it (e.g., instance segmentation, He et al., 2017).These approaches should provide more detailed information on the spatial extent of a signal of interest than a classification model that is repeatedly used on different areas of the representation.Consequently, we endeavor to advance the state of the art through developing a CNN that is able to both localize deformation within an interferogram, and to classify different types of deformation (the hierarchy of which we show in Figure 1b).
When constructing a CNN to perform both classification and localisation with data derived from SAR satellites, a new CNN could be designed before all the parameters within it are trained.However, this approach fails to utilize both the successful structures and the learned parameters of CNNs that have been successfully applied to other computer vision problems (e.g., the classification of natural images in Krizhevsky et al. (2012) and Simonyan and Zisserman (2014), the instance segmentation of biomedical images in Ronneberger et al. (2015), or the detection of buildings in satellite imagery in Zhang et al. (2016)).In order to describe how we can utilize these successes, we must first introduce the structure of a CNN in more detail, which we do with the use of Figure 1c.In this figure, a CNN can be seen to comprise of a convolutional part, and a fully connected part.The convolutional part comprises of filters that are convolved across an image to extract deep representations, whilst downsampling is performed simultaneously to reduce the spatial size of the features as their depth increases.In the case of the example network shown in Figure 1c, a three channel (color) image of size (224 × 224 × 3) pixels is transformed into a spatially smaller but deep (7 × 7 × 512) representation by this process.In the second part, this 3D representation is flattened into a vector (which in this example would be of size (7 × 7 × 512 = 25,088)), before a traditional neural network comprising of interconnected neurons is used to create the desired model outputs.The size of the last layer of this second part is dependent on features such as the number of different classes present in the data and, in this example case with two neurons in the last layer, would be used in a case in which there were only two different classes.
Consequently, when using an existing model on a new problem, any change in the output classes will require redesigning and retraining the fully connected part of the network.Therefore, it is common to retain the structure of the convolutional layers (i.e., part one of the model) but design a new fully connected network (i.e., part two of the model) that outputs the classes required by the new problem.Both parts of this model could then be trained using the data available in the domain that the new model is being built for.However, as the filters contained in the convolutional blocks of the first part of a CNN detect features within images such as edges, these are generally similar across domains.Therefore, an approach termed transfer learning reduces the number of parameters to be trained by retaining not only the structure of the convolutional part of the network, but also the weights learned within it when it was applied to the very large labeled data sets available in the field of computer vision.In the field of volcano monitoring, examples of deformation are relatively scarce in the available labeled data, transfer learning has been used successfully by other studies (e.g., Anantrasirichai et al., 2018).
The weights learned in the convolutional filters of a CNN are of great importance to a network's ability to detect features, as the filters must be sensitive to the patterns that these features present in an image.As networks such as AlexNet (Krizhevsky et al., 2012) and VGG16 (Simonyan and Zisserman (2014), named after the University of Oxford Visual Geometry Group) were originally developed to compete in the ImageNet competitions (Deng et al., 2009), the filters have been trained to detect the type of features present in natural images (e.g., photographs of a person, or car).When performing transfer learning, it is these filters that must be sensitive to the patterns presented in a deformation signal if the network is to correctly classify and locate it.
To develop a model that can both differentiate between different types of deformation and to detect its spatial extent, we first explore which of the formats an interferogram can be expressed in (e.g., wrapped or unwrapped) that allow for the filters in models trained on natural images to excel.We then explore whether the performance of a model that uses the chosen data format can be improved through using real data in addition to synthetic data during training.To produce our final model, we also determine which of a selection of common pre-trained encoders maximize the performance of our model.

Classification With Different Data Formats
As the most common CNNs for computer vision are trained on images comprising of a channel for each of the red, green, and blue values for each pixel, other data that are to be used with the network would also ideally be three channel.However, when considering an image of interferometric phase, these images contain only a single value for each pixel, and so consist of only one channel.This difference in the number of channels can be circumvented through duplicating the one channel interferogram in each of the three input channels of a CNN, or by discarding parts of the filters of the first convolution (e.g., a filter of size (5 × 5 × 3) be reduced to (5 × 5 × 1)).However, in this section of our study we wish to determine if this approach can be improved upon by utilizing the three channel structure of many pre-trained CNNs to input more data to the model.

Synthetic InSAR Data
When two SAR images are combined to form a single interferogram, the resulting image is a 2D array of complex numbers (Hanssen, 2001).Whilst the magnitude of each of these complex numbers relates to the underlying Earth and Space Science 10.1029/2024EA003679 brightness and coherence of a given pixel, it is common for only the argument to be displayed, as these phase values can be used to infer ground movement.However, the phase values of an interferogram are wrapped in the range [ π, π] as only the fractional part of the phase value can be measured, but this ambiguity can be estimated to produce an unwrapped interferogram (Chen & Zebker, 2001).We postulate that in addition to the use of either wrapped or unwrapped data duplicated to fill three channels, the original complex numbers of an interferogram could be used in two channels, and so allow the network to use interferometric amplitude as an indicator of the reliability of the phase.
However, we can also consider external data to feed into the CNN.When a human observer interprets an interferogram, they are likely to use data such as a digital elevation model (DEM) as this can be used to help determine if a signal is due to deformation, or due to a topographically-correlated atmospheric phase screen (APS).This problem is of particular importance at stratovolcanoes, as the cones typical of these volcanoes can be several kilometres high, and therefore be capable of creating large and spatially stationary signals in interferograms.The body of literature that covers the application of InSAR to volcanic deformation is replete with studies that consider which of the two mechanisms are responsible for the observed signals, and examples include Beauducel et al. (2000), Rémy et al. (2015), Yip et al. (2019).When considering previous attempts at the automatic detection of deformation signals in Sentinel-1 interferograms, Anantrasirichai et al. (2019a) also reported that many of the false positives recovered by their algorithm were caused by signals correlated with topography.Consequently, we postulate that the inclusion of a DEM in the inputs to our CNN will improve its ability to differentiate between deformation signals and atmospheric signals that are correlated with topography, and therefore seek to investigate its use as an input into a multichannel model.
To perform this analysis, we first synthesize a data set of labeled interferograms.To achieve this, we have created an open source Python3 package named SyInterferoPy, which we make freely available to the community via GitHub: (https://github.com/matthew-gaddes/SyInterferoPy).The collection of enough labeled data to train a CNN is commonly time consuming or expensive, and we find that the addition of localisation labels to our data makes it more time consuming than in previous studies.Additionally, due to the large number of data that are required to train CNNs and our expansion to classification of different types of deformation, procuring enough real data to do this may not be possible.Consequently, we perform this analysis using only synthetic data.Following the hierarchy proposed in Figure 1b, we create interferograms that contain either no deformation, deformation due to an opening dyke, or deformation due to a sill or point source.These sources were chosen after reviewing the database of volcanic deformation events measured using InSAR in Biggs et al. (2014) as we believe they cover the majority of the observed signals that are of importance for volcano monitoring (i.e., we disregard signals due to processes such as the cooling of lava flows).This approach advances the state-of-the-art by dividing deformation into different types, but future work should take further complexity into account by adding more complex deformation patterns.Model parameters were chosen to be both physically realistic (e.g., dykes have near vertical dips), and for the resulting deformation patterns to have absolute magnitudes in the range [0.05, 0.3] m which ensured that the signals are visible over the synthetic atmospheric signals.We model the dykes as vertical dislocations with uniform opening in an elastic half space (Okada, 1985) with strikes in the range [0, 359°], dips in the range [75, 90°], openings in the range [0.1, 0.7] m, top depths in the range [0, 2] km, bottom depths in the range [0, 8] km, and lengths in the range [0, 10] km.We model the sill/point sources as horizontal dislocations with uniform opening in an elastic half space (Okada, 1985) with strikes in the range [0, 359°], dips in the range [0, 5°], openings in the range [0.2, 1] m, depths in the range [1.5, 3.5] km, and widths and lengths in the range [2,6] km.It should be noted that our proposed hierarchy of volcanic deformation signals also includes processes that could be modeled as a point pressure source (commonly referred to as a "Mogi" source (Mogi, 1958)) within the sill/point category, but given that we do not envisage that a deep learning model using satellite data from only one look angle (i.e., ascending or descending) would be able to differentiate between these two models, we generate our synthetic data using only one of them for simplicity.
These deformation patterns are then combined with a topographically correlated APS, and a turbulent APS, which we discuss generating in more detail in Gaddes et al. (2018).We calculate the topographically correlated APS using the Shuttle Radar Topography Mission (SRTM) 90m DEM (Farr et al., 2007), and use the coastline information contained within the product to mask areas of water.We also synthesize areas of incoherence and decorrelation within our interferograms, which we mask in order for our synthetic interferograms to be as similar as possible to the Sentinel-1 interferograms automatically created by the LiCSAR processor (Lazeckỳ et al., 2020).Figure 2 shows the results of mixing these different elements to create our synthetic interferograms, This process creates unwrapped data, which can be converted to wrapped data through finding modulo 2π of the unwrapped phase.However, to synthesize both the real and imaginary part of a complex interferogram requires knowledge of both the brightness of a pixel and its phase.To achieve this, we again use the SRTM DEM, and calculate the intensity of reflected electromagnetic radiation at the angles of incidence used by the Sentinel-1 satellites (29.1 46.0°), before adding speckle noise, and calculating the interferometric amplitude between two images (i.e., the product of the two amplitudes).As inputs to CNNs that are to be trained using transfer learning must be rescaled to the inputs used in the original training data, we use only relative values in the range [( 1), 1] for the synthetic intensities.With knowledge of the modulus (relative intensity) and argument (wrapped phase) of each pixel of our synthetic interferogram, the real/imaginary components are simply the products of the modulus and cosine/sine of the argument, respectively.Figure 3 shows five different ways we can represent an interferogram using the three channels available.Whilst this is not an exhaustive list of possible combinations or data sources, we believe that these five types are able to fully explore our hypothesis on the use of three channel data, yet are not so numerous as to be too computationally expensive to train.

Deformation Classification Model
The CNN we build to classify the synthetic interferograms uses the five convolutional blocks of VGG16 (Simonyan & Zisserman, 2014), with our own fully connected network after this.This network was chosen as, when used in the field of computer vision for classifying natural images, it outperformed older models such as AlexNet (Simonyan & Zisserman, 2014) (which is used in the algorithm presented in Anantrasirichai et al. (2018)).Additionally, due to its uniform structure built solely from convolution and max pooling operations, VGG16 remains more intuitive to work with than newer models such as ResNet (He et al., 2016), and Inception (Szegedy et al., 2015).Unlike many other CNNs that are used only for classification, VGG16 was used by Simonyan and Zisserman (2014) to perform localisation of items it classifies, and therefore aligns with our goals.
Figure 4b shows an overview of the model, in which interferograms of shape (224 × 224 × 3) are passed through the five convolutional blocks of VGG16 to create a tensor of shape (7 × 7 × 512).This is flattened to make a vector of size 25,088, before being passed through fully connected layers of size 256, 128, and an output layer of size three (i.e., dyke, sill/point, or no deformation).This network architecture was chosen as the best performing after a parameter search in which we varied both the number of layers, and the number of neurons within them.The localisation output shown in the figure is not used in our preliminary exploration of which channel format to use (Section 2), but is used in Section 3. To produce a set of outputs that can be used as probabilities, we use a softmax activation for the last layer (Bridle, 1990), but on the remaining layers we use rectified linear units to reduce computation time (Agostinelli et al., 2014).As our model seeks to solve a classification problem, we use categorical cross entropy for the loss function, which we seek to reduce using the Nadam optimizer as this has a broadly applicable default learning rate of 0.001 (Dozat, 2016), and so avoids the potentially computationally costly search for an applicable learning rate that is common with other optimizers (e.g., stochastic gradient descent).To train the model using the five different types of synthetic data, we perform what is termed "bottleneck learning" in machine learning literature (e.g., Yu & Seltzer, 2011).This method of training a CNN is used when only the weights within the fully connected layer are updated (i.e., transfer learning is being performed on the convolutional filters), and comprises of first computing the results from passing our entire data set through the first five blocks of VGG16, before then training only the fully connected parts of our network (i.e., the classification output).When a three channel image is passed through the first five blocks of VGG16, a tensor of shape (7 × 7 × 512) and termed a bottleneck feature is created, which we illustrate in Figure 4a.This method is highly efficient as we do not generally wish to update the weights in the convolutional blocks of VGG16, yet passing the data through these blocks is computationally expensive.By passing the data through the convolutional blocks just once, we can then repeat only the relatively inexpensive passes of the data through the fully connected parts of our ).These are flattened to create a large fully connected layer featuring 25,088 neurons, which is connected to both the upper branch/head, which performs classification, and the lower branch/head, which performs localisation.We find the localisation problem more complex than classification, and consequentially our localisation branch/ head features more layers, each with more neurons.The output of the localisation head is a vector of four values determining the position and size of the deformation, whilst the output of the classification head is a vector of three values that indicate the probability for each class, and sum to one.network as we update the weights contained within these layers.This method is of particular use for practitioners who do not have access to high power computing facilities or GPUs.
A common problem of CNNs that are used for classification can be overfitting of the training data, which results in a model that generalizes to new data poorly (Krizhevsky et al., 2012).Overfitting is commonly caused by insufficient training data, but can also be caused by issues such as using a model with too much complexity for the desired task, or training a model for too many epochs (Chollet, 2017).We endeavor to limit overfitting through the use of dropout (Srivastava et al., 2014) before both the 256 and 128 neuron layers, as through randomly removing some connections during each pass of the data through our model, this method aims to ensure that our model is forced to learn more robust representations of the training data.As we use synthetic data, we are not limited by the usual cost of collecting labeled data, and therefore are able to generate 20,000 unique interferograms that are evenly distributed between classes without the use of data augmentation.

Model Results With Different InSAR Data Formats
Figure 5 shows the results of training five models with each of the data formats previously discussed.We also experiment changing the order of the channels within each of the five data formats (e.g., moving the DEM from channel three to two), but find this makes no difference to model performance.The highest classification accuracy achieved is ∼0.95, which is achieved when the models are trained with either wrapped or unwrapped data repeated across the three input channels.However, it should be noted that the accuracy of the unwrapped phase model takes the full 20 epochs to achieve this performance, which contrasts with the wrapped phase model which shows little change after the eighth epoch.Inclusion of the DEM as the third channel appears to reduce classification accuracy, whilst very low accuracies are achieved in the real and imaginary channel case.We discuss these results in more detail in Section 4, but for the remainder of the paper we choose to work with data that is unwrapped and repeated across the three input channels.We choose this approach as no significant differences are seen between the classification accuracy ultimately achieved with either wrapped or unwrapped data, but the use of unwrapped data may allow for a model to be used with unwrapped time series, and so detect subtle signals produced by low strain rate processes.Additionally, a model that works with unwrapped data may also provide the opportunity to be expanded to locate and classify unwrapping errors automatically.

Using Synthetic Data
In the previous section, we demonstrated that, when using VGG16 with convolutional weights learned on ImageNet data, roughly optimal performance for classifying synthetic interferograms is achieved when either the wrapped or unwrapped phase is repeated across the three input channels.We choose to progress with only the unwrapped phase model, as the computational cost of unwrapping is often already met by automatic processing systems (e.g., LiCSAR, Lazeckỳ et al., 2020), and the development of models that use unwrapped phase may lead to benefits such as the ability to classify and locate unwrapping errors.In this section, we build on the model used to perform classification by adding localisation output.We also endeavor to ascertain if the expense of collecting labeled data can be avoided entirely through the continued use of synthetic data when training our model.
We achieve both classification and localisation through dividing the fully connected section of our model to produce two distinct outputs.One output returns the class of the input data in the manner described in Section 2, whilst the second returns the location and size of any deformation within the scene.In machine learning parlance, Figure 5. Accuracy of classifying validation data (10% of the total) during training using three channel data arranged in different formats."u": unwrapped data, "w": wrapped data, "d": digital elevation model (DEM), "r" real component of interferogram, "i": imaginary component of interferogram.Low accuracy is seen for the "rid" data, and in both the wrapped and unwrapped cases inclusion of the DEM in the third channel is seen to degrade classification accuracy.At the end of the 20 epochs of training, only a small difference is seen in accuracy between wrapped and unwrapped data, with both classifying ∼95% of the validation data correctly, though the wrapped phase model is seen to achieve this level of accuracy more quickly (requiring only eight epochs of training).Whilst we see slight changes in the accuracy at the end of each of the latter epochs, we interpret the lines as having broadly plateaued and conclude that 20 epochs were sufficient for training these models.

Earth and Space Science
10.1029/2024EA003679 models of this type are termed double headed, and we subsequently refer to either of the outputs and their corresponding preceding layers as either the classification head or localisation head.Figure 4b shows the structure of the two heads, and how they diverge after the output of the fifth block of VGG16 has been flattened.The localisation head is structured in a similar manner to the model described in Simonyan and Zisserman (2014), in which the model conveys the location of any deformation through outputting a column vector containing four values.Two of these values determine the center of the deformation pattern and two display its horizontal and vertical extent.Together, these four values can be used to construct a box encompassing a deformation pattern.However, we find that an acceptable level of localisation performance cannot be achieved with a fully connected network with the same complexity as the classification head, and were required to increase both the number and size of layers in the localisation head's fully connected network.A simple network architecture search finds that the simplest model capable of achieving good performance has five layers consisting of 2,048, 1,024, 512, 128, and 4 neurons.
We use the mean squared error between the predicted location vector and the labeled location vector as our localisation loss function, which we seek to minimize.When using three arc second pixels (∼90 m) with a loss function of this type, a mean square error of 400 pixels would correspond to the localisation being incorrect by around ̅̅̅̅̅̅̅ ̅ 400 √ = 20 pixels, or ∼2 km.However, when using a double headed network, training is complicated by the fact that the model's overall loss is now a combination of the classification and localisation loss, which must be balanced using a hyperparameter commonly termed loss weighting (Chollet, 2017).In contrast to the localisation loss, we use categorical cross-entropy for the classification loss and, as the value produced by this is generally several orders of magnitude lower than the localisation loss, we find that a weighting of 1,000 for the classification loss and 1 for the localisation loss produces a model which trains well as the losses are approximately balanced.
To increase the performance of our classification and localisation model, we train it using a two step process.In the first, we train it in a similar manner to that described in the previous section, and update only the parameters within the fully connected network.In the second step, we unfreeze the parameters in the fifth block (i.e., the last convolutional filters of VGG16), and continue to train both these parameters and those contained within the fully connected network.As the second step starts with parameters that are already approximately correct, optimizers that adaptively change the learning rate cannot be used, as any initial large updates can destroy a model's performance.Instead, we use the "Adam" optimizer (Kingma & Ba, 2014) and, after experimentation, find a learning rate of 1 × 10 5 neither destroys previous model performance, nor is too slow to train.As the updates to the fifth block performed in the second step of our training preclude the use of bottleneck features, we instead train our classification and localisation model on a Nvidia GTX 1070 GPU.

Application to Real Data: The VolcNet Database
Whilst the model described in the previous section achieved good performance when classifying and locating deformation in synthetic interferograms, for use in automatic detection algorithms we require our CNN to work with Sentinel-1 data.These data are of particular importance for volcano monitoring, as the European Space Agency's data policy ensures that Sentinel-1 data are available quickly and at no cost, whilst the low revisit times ensure that the majority of sub-aerial volcanoes are imaged approximately every 12 days, with this set to fall for some volcanoes when Sentinel-1C is launched.We therefore create a database of labeled Sentinel-1 interferograms, which we term VolcNet and make freely available via GitHub: https://github.com/matthew-gaddes/VolcNet.
To populate our database, we chose a selection of volcanoes for which deformation is known, and which the LiCSAR automatic interferogram processor (https://comet.nerc.ac.uk/COMET-LiCS-portal/) had created networks of interferograms with no gaps and of long temporal duration (e.g., multiple years).This resulted in our use of Campi Flegrei, Vesuvius, Agung, Wolf, Sierra Negra, Cerro Azul, Erta Ale, La Palma, and Domuyo.We filtered the interferograms with a Goldstein filter (Goldstein & Werner, 1998), unwrapped using SNAPHU (Chen & Zebker, 2001), and masked pixels with an average coherence below 0.7, before creating time series using LiCSBAS (Morishita et al., 2020).
To label our database, we develop an approach in which we create generic labels that describe the duration, magnitude, and spatial extent of deformation for each volcano.In contrast to traditional labeling approaches that  2022)), our approach allows us to create labeled interferograms between any two Sentinel-1 acquisitions.Consequently, with relatively few labels, time series with N acquisitions can be quickly converted into sets of N 2 N labeled interferograms.We define two types of deformation label: transient deformation, which is relatively short lived and would be imaged by a syneruptive interferogram, and persistent deformation, which is generally of low rate but spans multiple acquisitions.A choice of threshold is also required for the deformation predicted by the label to be considered as visible in an interferogram, as in the cases of persistent deformation of low-rate, we do not want our short temporal baseline interferograms (e.g., 12 days) to be labeled as containing deformation.Figure 6 shows the VolcNet data and label for Sierra Negra, as this contains both persistent deformation (inflation prior to the 2018 eruption), and transient deformation (the 2018 eruption).
For the vast majority of time series in the collection, labeling was performed by drawing on the results of previous studies in which inversions had been performed to fit the signals observed in the interferograms, using Albino et al. (2019) for Agung, Xu et al. (2016) for Wolf, Gaddes et al. (2018) for Sierra Negra, Moore et al. (2019) for Erta Ale, and Galetto et al. (2019) for Cerro Azul.For the remaining time series, labeling was performed through inspection of the signals present.Additionally, several of the studies from which labels were created contain independent validation data in the form of ground truth measurements made using global navigation satellite systems (e.g., Global Positioning System time series are available at Sierra Negra).These data ensure that signals present in time series that are interpreted as being due to physical processes (such as the inflation of a sill or point source) are not actually of atmospheric origin, and are in fact due to deformation of the volcano.However, in some examples assigning a single class to a complex deformation pattern is difficult, and we instead assign what we deem the dominant class to be, whilst expecting that the network should assign some probability to other classes.This is most evident at Wolf, in which signals were attributed to both the deflation of a sill and the opening of a dyke (Novellis et al., 2017;Xu et al., 2016).Figure 7 details the results of labeling each of these time series, and then creating all possible interferograms between all Sentinel-1 acquisitions.
As we envisage that real data is likely to improve the performance of our classification and localisation model, we do not test our model on all of the VolcNet data.Instead, we first create a class-balanced set of 1,500 interferograms, before assigning 80% of these to training, 10% to validation, and 10% to testing.Class balancing is limited by the relative scarcity of interferograms labeled as "dyke," for which there are only 500 examples and, with three classes, limits our total data set size to 3 × 500 = 1,500 data.For the "sill" and "no deformation" interferograms, we randomly select 500 from each.We choose this method over the alternative approach of assigning different volcanoes into the train/validate/test sets (e.g., all interferograms imaging Sierra Negra are in the training set, whilst all interferograms imaging Wolf are in the test set), as our labeled data set is currently too small for this.In the case of interferograms that are labeled as containing deformation due to an inflating dyke, these come from only two volcanoes, leading to challenging case that the model would be trained on only the Agung imaging interferograms, yet be tested on the Wolf imaging interferograms.Should more labeled data become available, experimenting with this approach may yield insights into how deformation generalizes between different volcanoes.
Column one of Figures 8 and 9 show the results of applying our trained classification and localisation model to a selection of the 150 (i.e., 10% of 1,500) testing Sentinel-1 interferograms from the VolcNet database.Whilst the previous paragraph details how these are randomly selected from the entire data set, we choose our subset of 10 of the 150 to showcase different volcanoes and interesting results.Interferograms such as Interferogram 2 of Figure 9 show a very clear inflation signal at Sierra Negra, and are correctly classified by the CNN ("sill/point"), whilst the localisation is broadly correct.Other promising results include the labeling of the Wolf coeruptive interferograms (Interferogram 4, Figure 9) as containing a dyke ("dyke"), which is also localized well.However, some interferograms are wrongly classified, such as the subtle signal seen at Vesuvius (Interferogram 0, Figure 8), and the strong atmospheric signals at La Palma (Interferogram 3 Figure 8), and Campi Flegrei (Interferogram 1, Figure 8).At Vesuvius, the deformation signal is both small, and surrounded by incoherence and atmospheric signals, and is therefore unlikely to be labeled by a human observer as deformation without inspection of the complete time series.At Campi Flegrei and La Palma, the strong atmospheric signals juxtapose positive and negative signals in a manner somewhat similar to a dyke (our model's label), and this misclassification is likely to be due to our synthetic atmospheric signals not being complex enough to allow our CNN to learn to differentiate between them and deformation.The divergent nature of our CNN's two heads also leads to outputs that show disagreement between them.Interferogram 0 of Figure 9 demonstrates this, in which deformation at Erta Ale is Earth and Space Science 10.1029/2024EA003679 localized approximately but the label is incorrect, although "dyke" has been assigned a probability of 0.48.We again attribute this misclassification to a lack of complexity in our synthetic data limiting what our CNN can learn, as the synthetic dykes we use for training (e.g. Figure 2, Interferograms 2 and 3) are generally more elongate and less complex than the signal seen at Erta Ale.

Augmentation of Training Data With the VolcNet Database of Sentinel-1 Data
To increase the performance of our model further, we incorporate real Sentinel-1 data from our VolcNet database into the training.We initially tried to simply fine-tune our previous model using real data, but found that better performance was achieved when the entire training process was repeated using both synthetic and real data.Of the class-balanced set we introduced in the previous section, 0.8 × 1,500 = 1,200 are available for training, with the remaining 300 split between validation and testing.However, 20,000 synthetic interferograms were used to train the previous model, and the inclusion of 1,200 new interferograms is unlikely to impact the model significantly as these could still be classified poorly with minimal increase in the loss function.We therefore apply data augmentation, which involves creating random flips, rotations, and translations of the interferograms to extend  and Wolf, in blue).Many volcanoes are imaged in both ascending and descending orbits (e.g., 128D and 106A for Sierra Negra), and some volcanoes feature in two frames (e.g., 124D and 022D for Campi Flegrei).Bottom: Number of interferograms that can be created of each label type, showing the scarcity of interferograms that contain deformation attributed to a dyke.With the exception of including real data, we train our model in the same manner as described in the previous section.

Earth and Space Science
Our new model has a lower combined loss for all class types, and column two of Figures 8 and 9 shows the results of applying it to the same test set of 150 real Sentinel-1 VolcNet interferograms used in Section 3.2.Inspection of this figure shows that our model is now better able to handle interferograms with strong atmospheric signals, with Interferograms 1 and 3 of Figure 8 now correctly classified as "no deformation."Our model is also able to better classify deformation (e.g., Interferogram 0 of Figure 9 is correctly labeled as "sill/point" rather than no

Varying the Encoder to Improve Model Performance
To increase the performance of our model further, we vary the encoder used to create our deep representation of the input data.As all the encoders we use were trained on ImageNet, we do not envisage that varying the channel format, fully connected classification and localisation heads, and data used for training (synthetic and real) are likely to change the model performance.Therefore, to avoid the high computational cost of repeating all our previous experiments, we hold these factors constant, and vary only the encoder used.
We choose InceptionV3 (Szegedy et al., 2015), ResNet152V2 (He et al., 2016), and EfficientNetB0 (Tan & Le, 2019) as our encoders.The InceptionV3 encoder contains inception modules, which perform convolutions with different sized filters in parallel, and allow the model to encode features of different spatial scale (Szegedy  et al., 2015).It contains ∼21 million trainable parameters, and achieves a top-5 accuracy on ImageNet of 93.3% (Hoeser & Kuenzer, 2020).The ResNet152V2 encoder uses skip (or residual) connections to propagate gradients more effectively during training, and so allows deeper architectures to be trained.It contains ∼58 million trainable parameters, and achieves a top-5 accuracy on ImageNet of 95% (Hoeser & Kuenzer, 2020).The EfficientNetB0 encoder was part of a family of scaleable encoders (B0-B7), and we choose the member with the lowest number of trainable parameters to investigate the performace of models that are likely to be easier to train from scratch in the future.It contains ∼4 million trainable parameters, and achieves a top-5 accuracy on ImageNet of 93% (Hoeser & Kuenzer, 2020).
Figure 10 shows the results of evaluating different subsets of the real Sentinel-1 test data from the VolcNet database with VGG16 and the three new encoders.When considering all 150 of the test data, each of the three new encoders outperforms VGG16 at determining deformation type (higher classification accuracy), but only Effi-cientNetB0 outperforms it at determining location and size of deformation (lower localisation loss).When the data is divided by deformation type (dyke, sill, or no deformation), EfficientNetB0 consistently plots in the lower right corner, indicating that it performs well at both classification and localisation on all the data classes.
Column three of Figures 8 and 9 shows the results of our final EfficientNetB0 based model when applied to the same test set of 150 real Sentinel-1 VolcNet interferograms used in Sections 3.2 and 3.3.Examples of the continued reduction in loss are particularly evident in the increase accuracy of the boxes that bound deformation signals (e.g., interferogram 2, Figure 9).

Discussion
From the analysis performed in Section 2 we conclude that the incorporation of a DEM into our CNN could not be achieved through the relatively simple step of using it as one channel in multichannel data.This is likely because the weights in the first five convolutional blocks of our model were transferred from VGG16 and, as VGG16 was trained using natural images, inputs which are broadly similar across all three channels are required.It should be noted that we rescaled our training data to lie in the same range as the data that VGG16 was trained on (described further in Section 2), and therefore the lack of similarity across channels we refer to is not due to different magnitudes, but rather, different spatial patterns.However, an approach where the weights within the convolutional blocks of a classification and localisation model were trained from scratch may easily allow for the incorporation of extra data in the different input channels.Whilst this approach was considered during the design of this study, we do not expect that training a CNN from scratch (i.e., training both the convolutional filters and the fully connected network) is feasible with only ∼1,500 real Sentinel-1 interferograms (i.e., the subset of data in which we balance our three data classes), and we did not have the resources available to create a larger database of labeled data.
When one datum is repeated across the three input channels, our finding that unwrapped data slightly outperforms wrapped data is surprising.We speculate that whilst the fringes that are usually present in wrapped data can make deformation signals more apparent, these images do not generally look like those that we see in daily life.As the filters contained in the initial convolutional layers of our model were trained on images that are seen in daily life (e.g., images of pedestrians, to be used by models powering autonomous vehicles), we postulate that images of unwrapped phase may be more similar to these than wrapped ones, and so be best suited for use with convolutional filters that were trained on them.
Additionally, our finding that model performance is increased by training with both synthetic and real data suggests that the scarcity of real data cannot be circumvented by simply generating large amounts of synthetic data.However, we envisage that synthetic data could be used to train a model with InSAR specific data in the channels (e.g., a DEM) from scratch until reasonable performance is achieved, before switching to real data to complete the final training.By including data such as the DEM, it is likely that a model built in this way would outperform our model, but is beyond the remit of our current study.
The results presented in Sumbul et al. (2019) explore this theme further, and they find that when using ∼600,000 labeled Sentinel-2 images they are able to train a shallow CNN with a channel for each of Sentinel-2's 13 spectral bands that outperforms a deeper model that was pre-trained using ∼1.2 million ImageNet images and used only the three visible Sentinel-2 spectral bands.Therefore, we expect that it is likely that through developing the VolcNet database that we introduce in this work (and make freely available to the community), models that are able to use disparate data in the different channels may be trainable, with resulting increases in performance over the model presented here, or models that are initially trained with large amounts of synthetic data.
However, should the development of a larger training database continue to be problematic, information such as the DEM may be best incorporated through the use of a two input model, in which one set of convolutional filters are applied to the phase information, whilst a second is applied to the DEM.These two networks could then be merged at the fully connected stage, in much the same way as our fully connected model diverges into two outputs.Should this be successful, it may also provide a method to add further inputs to a model, such as those outputted by a weather model, which may reduce false positives due to occurrences such as a strong topographically correlated APS.However, training the weights of a model from scratch and exploring more complex multi-input model architectures remains beyond the remit of this study.
The results presented in column one of Figures 8 and 9 show that a model trained only with synthetic data is able to classify and locate deformation signals in Sentinel-1 data.However, it is only successful in cases with particularly clear deformation patterns, and in cases with more subtle signals generally erroneously resorts to labeling these as not containing deformation.Additionally, strong atmospheric signals are often misclassified as deformation.It is possible that these limitations may be overcome through the use of more realistic synthetic data, as our result suggests that our current methodology does not describe processes well enough to be used without real data.The generation of more realistic deformation patterns may be achieved through steps such as more intelligent sampling of the parameters used in the forward models used to generate the deformation patterns, the use of different types of deformation models such as penny-shaped cracks (Fialko et al., 2001) or point/Mogi sources (Mogi, 1958), and the superposition of multiple deformation patterns in a single interferogram such as was observed prior to the 2005 eruption of Sierra Negra (Jónsson, 2009).The generation of more realistic atmospheric signals could be achieved through increasing the complexity of synthetic data, such as through the use of phase-elevation ratios that are non-linear or spatially variable, or through using data from different sources.Interferograms that image regions with little deformation could be used to increase the complexity of the set of "no deformation" data, or combined with synthetic deformation patterns to produce more complex semi-synthetic data.
The results presented in column two of Figures 8 and 9 show the benefit of incorporating real data.The correct classification of strong atmospheric signals suggest that our synthetic data does not produce complex enough Earth and Space Science 10.1029/2024EA003679 atmospheres, and that the inclusion of real data is required to address this.In future work, generative deep learning methods could be used to generate more complex synthetic atmospheric signals to circumvent this.
The results presented in column three of Figures 8 and 9 show that more advanced encoders such as Effi-cientNetB0 can be used to improve our model further, with the localisation of deformation seen to improve significantly from the previous model.However, scope for improvement remains, with clear classification and localisation errors remaining (e.g., coeruptive deformation at Wolf shown in Interferogram four of Figure 9 is neither classified or located accurately).To improve our model further, we believe a change in architecture to one that was built with more emphasis on localisation may be required.Models that fit this description such as R-CNN (Girshick et al., 2013) are likely to require more real data from the VolcNet database for training, which may be addressed through incorporating more time series, and by addressing the large disparity in the number of data per class (i.e., the scarcity of dykes) that limits the number of other interferograms that we are able to use in this study.We are not currently aware of any other freely available labeled unwrapped data that we could use to train or evaluate our models, with similar data sets such as "Hephaestus" (Bountos et al., 2022) containing only wrapped data.
The divergent nature of the two heads (classification and localisation) of our network also allows for discrepancies between their outputs.This is seen in Interferogram four of Figure 9, in which the localisation head produces a bounding box of the correct size, but the signal is incorrectly labeled as "no deformation."However, we postulate that it may be possible to avoid errors of this type by using more complex model architectures.Models such as YOLO (Redmon et al., 2016) produce bounding boxes and classifications in one step, and have the added bonus of being able to work with images that contain multiple signals.If successfully applied to interferograms, a model of this complexity may avoid the discrepancy errors we encounter, and be able to handle interferograms that contain multiple deformation patterns.In the case that multiple signals do exist in a single interferogram, we do not envisage these to be difficult to label as it is likely that these would be considered interesting events by the scientific community and therefore be the subject of detailed study (e.g., the multi-signal interferograms used in this study are analyzed in detail in (Xu et al., 2016)).
Our approach to localisation avoids the need for repeated classification using a sliding window approach, and allows for our network to reason using the entire image.Whilst this approach is beneficial in terms of advancing the state-of-the-art toward that of a human interpreter, one caveat remains in that building a network that is able to utilize large interferograms can be complex.In our model, we use pixels of three arc second size and, with an input size of 224 × 224, the resulting model is able to "see" an approximately 20 km square around a volcano.If we wish to proceed at this resolution, our model's visual field could be increased through changing the input size to around 400 × 400 which would not impact our ability to use VGG16's filters (or convolutional blocks), but would increase the size of the first layer of the fully connected part of our network.
At present, an input with side length 224 is reduced to a feature map with side length 7 (shown in Figure 4) which, combined with a depth of 512, produces a flattened layer of size 7 × 7 × 512 = 25,088.However, doubling the input side length would double the feature map side length, increasing the flattened layer to a size of 14 × 14 × 512 = 100,352.Whist our model contains millions of free parameters, connecting this layer to a subsequent layer would produce a significant increase in the total, and is likely to require either more ingenuity or more data to be trained successfully.Analysis of the offsets of deformation patterns at volcanic centers by Ebmeier et al. (2018) finds that 8% of signals are located more than 10 km from a volcanic edifice, and would therefore be missed by our current model.Future models that wish to perform localisation using a global approach may therefore require slight increases in size in order to capture all signals of interest.Alternatively, as per the approach of (Anantrasirichai et al., 2018), CNNs can themselves be convolved across larger images (such as those routinely captured by the TOPSAR mode of the Sentinel-1 satellites) to create repeat classifications, and this may provide a way to apply our current model to images for which the number of parameters in the first layer of the fully connected network is prohibitive.However, for application to large scenes that capture non-volcanic deformation, a network similar to the fully convolutional network presented in Rouet-Leduc et al. ( 2020) may be more suitable, as this contains no fully connected network and so can be applied to an input of near arbitrary size.
The smallest deformation signals that our model can accurately label are of approximately 5 cm in magnitude, which is a product of us choosing this threshold as the minimum deformation required for a VolcNet interferogram to be considered as containing deformation.A benefit of our novel labeling approach is that through Earth and Space Science 10.1029/2024EA003679 GADDES ET AL.
decreasing this parameter, we can produce single interferograms that span persistent deformation that is not visible to the human observer (e.g., 1 cm of deformation is not likely to be visible through the atmospheric noise of several to tens of centimeters commonly encountered at volcanoes).Relabeling the VolcNet database could therefore be done at increasingly lower deformation thresholds, and provide a route to train deep learning models that outperform human domain experts.Through computing cumulative displacements in the manner described in Anantrasirichai et al. (2019b), our existing method could also be extended to extremely low rate signals, providing a long enough time series is present.Alternatively, for signals that are not persistent in time, which therefore cannot be made more apparent over noise terms through the use of time series, the use of atmospheric correction methods such as the Generic atmospheric correction model for InSAR observations (GACOS, Yu et al. (2018)) could be used to increase the signal-to-noise ratio.In doing so, deep learning models could then be trained with, and used on, transient lower magnitude deformation signals than are presented in this work.However, experiments into the effects of atmospheric correction methods on the data used in deep learning models are beyond the scope of our current study.

Conclusion
We find that either wrapped or unwrapped data are approximately equally suited for use with the weights of VGG16's filters trained on ImageNet data.We also find that incorporating extra information that a human interpreter may use (such as a DEM) in the two otherwise unused channels of a model trained in this way acts to degrade model performance, and we postulate that this is a result of the disparate nature of the signals contained within a DEM and the phase of an interferogram.However, we expect this will not be the case if the weights within VGG16's filters are trained from scratch, as additional data such as a DEM should help to separate deformation from noise.
We combine the encoders of four high performing CNNs with two fully connected networks to perform both classification and localisation of deformation, which allows our network to reason using the whole interferogram (i.e., avoiding a sliding window approach), and therefore move a step closer to interpreting InSAR data in a manner similar to a human expert.Of the four encoders we use, we find that EfficientNetB0 is best able to determine both deformation signal location and size, and deformation type.Additionally, our network is able to differentiate between several different forms of deformation, and advances the state-of-the-art.We expect that further work may build on the results presented in this manuscript and use the same method to increase the number of deformation signals that a model is able to identify.For use with volcano monitoring, this may include models that are able to classify signals such as those due to cooling lava flows, or those due to unstable volcano flanks.For use in the broader remote sensing community, this three class model could be adapted to perform tasks such as differentiating between strike-slip, thrust, and normal fault earthquakes in single interferograms.
As Sentinel-1 interferograms are being automatically created for the majority of the world's subaerial volcanoes every 6 or 12 days, our algorithm provides a method to search through this vast and regularly changing database to search for signs of deformation that may indicate that a volcano has entered a period of unrest.Through doing this, the algorithm could facilitate monitoring of many currently unmonitored volcanoes.Additionally, as our model is able to localize any deformation it does encounter, this allows the model to determine the spatial extent of a signal (i.e., the area of the bounding box it creates), and so provide information that is likely to be useful when determining how urgently interferograms that it flags should be inspected by a human expert.
To minimize the costly nature of labeling data, we initially train our model using only synthetic data.We find that our model generalizes well to some cases of Sentinel-1 data, but errors remain in cases such as subtle deformation signals, or unusual atmospheric signals.We alleviate this problem through building a database of Sentinel-1 data, which we term VolcNet, that uses a novel labeling strategy to create ∼500,000 labeled interferograms from relatively few labels.The inclusion of a small amount of this real data during the training phase improves model performance drastically, and we present a model that is able to both classify and locate deformation within interferograms of ∼20 km side length.For other practitioners seeking to train similar models, we make both our code for generating synthetic interferograms (SyInterferoPy), our labeled Sentinel-1 data (VolcNet), and our two models (VUDL-NET21) freely available via GitHub.

Figure 1 .
Figure 1.(a) Introduction to the hierarchy of computer vision object/signal identification methods.The upper and lower rows show 12 days descending Sentinel-1 interferograms of Sierra Negra and Wolf volcano (Galapagos Archipelago, Ecuador), respectively.The Sierra Negra interferogram contains only one signal (an inflating sill), whilst the Wolf interferogram contains two signals (a deflating sill and an opening dyke).(b) Proposed hierarchy for signals of interest in interferograms at volcanic centers.We propose a model that is able to classify interferograms into one of the three classes shown in blue: "no deformation," "Dyke," and "Sill/Point."We envisage that future studies may add further classes which we mark in gray, such as those that differentiate between sills and point sources.(c) Overview of a traditional Convolutional neural network (CNN) (CNN), showing how convolving filters and downsampling create a small but deep representation of an image ((224 × 224 × 3) to (7 × 7 × 512)), which is then flattened and passed through a traditional neural network.
of sizes of deforming regions that the different deformation model parameters produce (e.g.Interferogram 2 vs. Interferogram 3).

Figure 2 .
Figure 2.An example of the constituent parts of seven synthetic interferograms.Interferogram 5 does not feature deformation, interferograms 1, 4, and 6 feature deformation due to an sill/point source, and interferograms 2 3 feature deformation due to an opening dyke.These signals are geocoded and areas of water masked, before being combined with a topographically correlated atmospheric phase screen (APS), and a turbulent APS.Areas of incoherence are also synthesized, and these are used to mask the combination of the three signals to create the final synthetic interferograms.

Figure 3 .
Figure 3. Organization of an interferogram into three channel form.Columns one and two feature unwrapped data that is repeated, and in column two the digital elevation model (DEM) is included as the third channel.In column three the real and imaginary elements of the complex values of each pixel of an interferogram occupy channels one and two, whilst the DEM is included in the third.Columns four and five feature wrapped data that is repeated, and in column five the DEM is included as the third channel.

Figure 4 .
Figure 4. (a) Overview of our approach to creating a data set of synthetic interferograms, arranging these into the five different three channel formats, computing the bottleneck features for each piece of data, and training the fully connected layers of a convolutional neural network (CNN).(b) Structure of our classification and localisation CNN.Input interferograms are first passed through the first five convolutional blocks of VGG16 to transform them from size (224 × 224 × 3) to size (7 × 512).These are flattened to create a large fully connected layer featuring 25,088 neurons, which is connected to both the upper branch/head, which performs classification, and the lower branch/head, which performs localisation.We find the localisation problem more complex than classification, and consequentially our localisation branch/ head features more layers, each with more neurons.The output of the localisation head is a vector of four values determining the position and size of the deformation, whilst the output of the classification head is a vector of three values that indicate the probability for each class, and sum to one.

Figure 6 .
Figure 6.Demonstration of VolcNet labeling for the time series that images Sierra Negra prior to and during the 2018 eruption.Upper: A subset (every 12th) interferogram that can be made between all possible acquisitions, showing the increasing deformation in longer temporal baseline interferograms.Interferograms along the diagonal are omitted as they contain only zeros.Lower left: Example of the longest temporal baseline interferogram that can be created, which features both inflation of the caldera floor (persistent deformation), and complex syneruptive deformation propagating to the north west (transient deformation), for which a single bounding box is automatically created.Lower right: Graphical representation of the labeling, which shows an approximation of the increase in inflation rate that was observed approximately 1 year before the eruption as an increase in the height of the orange line, and the large but short-lived syneruptive signals in blue.

Figure 7 .
Figure7.Summary of the VolcNet database.Top: Number of interferograms that can be created at each volcano divided into label type (sill/dyke/no deformation), showing the scarcity of time series that contain deformation attributed to dykes (Agung and Wolf, in blue).Many volcanoes are imaged in both ascending and descending orbits (e.g., 128D and 106A for Sierra Negra), and some volcanoes feature in two frames (e.g., 124D and 022D for Campi Flegrei).Bottom: Number of interferograms that can be created of each label type, showing the scarcity of interferograms that contain deformation attributed to a dyke.
real training data to feature 20,000 unique, though often highly correlated, Sentinel-1 interferograms.

Figure 8 .
Figure 8. Results of each of our three classification and localisation convolutional neural networks on our testing set of Sentinel-1 interferograms.Column one shows results from our VGG16 based model when trained with synthetic data only, column two shows results from our VGG16 based model trained on both synthetic data and real data from the VolcNet database, and column three shows results from our EfficientNetB0 based model trained on both synthetic data and real data from the VolcNet database.Rows depict different example data, with the corresponding volcano name shown on the extreme right.Model predictions are shown in red (including classification probabilities as decimals), and VolcNet labels are shown in black, with deformation shown in centimeters.Interferograms one and three show strong atmospheric signals that could be misclassified, and the remaining interferograms show deformation signals.

Figure 9 .
Figure 9.As per Figure 8, but a second selection of the test data showing different deformation signals.

Figure 10 .
Figure 10.Localisation loss (y axis) and classification accuracy (x axis) for the four encoders used with different subsets of the real Sentinel-1 test data from the VolcNet database.Models that plot in the lower right corner accurately determine the spatial extent of the deformation (low localisation loss), and accurately determine the type of deformation (high classification accuracy).