Comparing Adjoint Waveform Tomography Models of California Using Different Starting Models

Adjoint waveform tomography (AWT) sits at the cutting edge of seismic tomography on local, regional, and global scales. However, the choice in starting model may have a significant impact on the final inversion results. In this paper, we present 3 AWT models of California that are based on different starting models. We chose three models that were inverted at different scales: SPiRaL, a global travel‐time tomography model (Simmons et al., 2021, 10.1093/gji/ggab277), CSEM_NA, a regional adjoint tomography model of North America and the North Atlantic (Krischer et al., 2018, 10.1029/2017JB015289), and WUS256, a regional adjoint tomography model of the western US (Rodgers et al., 2022, https://doi.org/10.1029/2022JB024549). We then inverted three AWT models using the same source and receiver set. We ran each model over three period bands: 30–100 s, 25–100 s, and 20–80 s. Once the iterations were finalized, we used five methods of testing model similarity in both the model and data space. We conclude that the choice of starting model has a minimal impact on long wavelength models if an appropriate multi‐scale inversion approach is used.

picks and the computation of model resolution matrices. In practice, model performance in FWI models is evaluated in three ways: qualitative waveform fits, performance of model on validation events (i.e., events not included in the inversion; Tape et al., 2010), and qualitative resolution analysis through methods such as checkerboard resolution tests, which are known to have significant drawbacks (e.g., Lévěque et al., 1993;Rawlinson & Spakman, 2016;Vandecar & Crosson, 1990), and Hessian vector product calculations (Fichtner and van Leeuwen, 2015;Y. Gao et al., 2021). Uncertainty remains difficult to quantify in FWI models mainly due to the computational cost of computing deterministic and probabilistic uncertainties (e.g., Gebraad et al., 2020;Liu & Peter, 2019;Tarantola, 1987;Virieux & Operto, 2009).
As for model comparison, quantitative comparisons are made difficult by the different parameterization of models by different researchers. The misfit function chosen, events and stations included in the inversions, attenuation model used, and regularization employed all affect model results (Tromp, 2020;Zhou et al., 2022). Though global models show similarity at long wavelengths (Lekic et al., 2012;Ritsema et al., 2011), models diverge as shorter wavelength information is incorporated. In past studies, models are often presented as updates of previous models, using the last version of the model as the starting model for the current model (e.g., Bozdag et al., 2016;French et al., 2013;French & Romanowicz, 2014;Lei et al., 2020). Recent studies have also begun to use 1-D velocity models such as the Preliminary Reference Earth Model (PREM; Dziewonski & Anderson, 1981) to avoid artifacts from the starting model (e.g., Boehm et al., 2022;Métivier et al., 2016).
In this paper, we use a multi-scale, iterative approach to compute three AWT models of California that each start from a different starting model. All three models reveal known tectonic features of the region. We find that the models obtained from this approach are broadly equivalent in terms of their shear wavespeeds and waveform fits. We conclude that using the methodology presented herein, we find that there is minimal effect of the starting model on the final model results.

Data
We collected waveform data from 103 regional earthquakes recorded within our domain (Figure 1). The events used occurred between 1 January 2000 and 31 October 2020 with moderate moment magnitudes (M w ) between 4.5 and 6.5. The lower magnitude threshold is dictated by signal-to-noise ratios at long periods, while the upper threshold is chosen such that we can use the point-source approximation. Some larger magnitude earthquakes (M w 6-6.5) had complex waveforms indicative of finite fault ruptures that we could not model, so those events were excluded during quality control. Focal mechanisms for the events were taken from the Global Centroid Moment Tensor catalog (Dziewonski & Woodhouse, 1983;Ekström et al., 2012). The events were recorded at over 1,300 unique stations from more than 40 permanent and temporary networks ( Figure 1; for a list of networks, please refer to Table S1 in Supporting Information S1). Waveforms were collected from the International Federation of Digital Seismograph Networks web services and stored in the Adaptable Seismic Data Format (ASDF; Krischer et al., 2016).

Inversion Methodology
AWT is computation-and data-intensive and requires efficient workflow software to automatically manage the costly computations and volume of data required to run AWT iterations. Several software packages have become available in the past few years to enable AWT (Chow et al., 2020;Krischer et al., 2015; R. T. Modrak et al., 2018;Thrastarson et al., 2021). In this study, we used Salvus, an end-to-end FWI software from Mondaic  (Ekström et al., 2012). The inverted black triangles indicate the location of all stations with data used in the inversion and/or validation.

of 21
Ltd. (mondaic.com). Salvus manages data processing, forward simulations, window picking, misfit calculations, and inversions, including communication between the user's local machine and the high-performance computing cluster used. Salvus also builds in visualization and quality control directly into Jupyter Notebooks (Kluyver et al., 2016) that allow the user to check results at any stage of an inversion. For a more detailed description of Salvus and its use, we point the reader to Rodgers et al. (2022) and Wehner et al. (2022).
We begin our inversions with three different starting models that cover our study area: SPiRaL (Simmons et al., 2021), CSEM_NA (Krischer et al., 2018), and WUS256 (Rodgers et al., 2022). These three models were chosen because each includes radial anisotropy, which is essential to fit regional surface waves. The different scales and datasets used to infer these starting models, described below, provide variability that allow us to test the effect of starting models on final inversion results. SPiRaL is a global model obtained from joint inversion of P and S wave arrival times and vertical transverse isotropy. CSEM_NA is an AWT model of the North Atlantic, including both the North American and European continents. WUS256 is an AWT model for the western US. It is important to note here that WUS256 uses SPiRaL as its starting model, so we expect that it will show comparatively improved fits prior to any inversions. CSEM_NA is independent of the other two models.
To make our comparison as fair as possible, we use the same regularization scheme described above for all three models. Figure 2 provides a flowchart of the steps followed in the inversion methodology and is annotated with key details of the data and methodological choices. We performed multi-scale inversions over three period bands: 30-100 s, 25-100 s, and 20-80 s ( Figure 2). We began iterations at 30 s minimum period because of our small domain size (∼1,000 × 1,000 km) and focus on resolving crustal structure. We decreased the minimum period band at small increments of 5 s to be conservative in our descent down the misfit surface and to avoid getting trapped in local minima (Bunks et al., 1995;Fichtner, 2010). At 20 s, we chose to decrease our maximum period band to 80 s to improve the signal-to-noise ratio for smaller magnitude events (M w 4.5-5). We solved for four wavespeeds: V PV , V PH , V SV , and V SH , the vertically and horizontally polarized compressional and shear wavespeeds, respectively.
Forward simulations were calculated using the spectral element method on unstructured meshes (Afanasiev et al., 2019;Tromp, 2002a, 2002b). Window picking is done automatically on the large data set using a modified version of the method proposed by Krischer et al., 2015. The choice of window location and duration is controlled mainly by three parameters: cross-correlation of the observed and synthetic data, the phase shifts between the observed and synthetic data, and the envelope similarity between the observed and synthetic data. If a portion of the waveforms are above a given threshold for all parameters, then a window is chosen. We chose restrictive thresholds such as high correlation values, a minimum window length of 1.5 times the minimum period, and small allowed time shift values to refrain from fitting noise and cycle skipping. We point the reader to Section 4.2.1 for a more detailed description of window picking parameters used in this study. There is one key difference to the model setup that arises during window picking. Because windows are picked based on the similarity between the observed and simulated data, each model has a unique window set. We chose to pick windows separately for each model so as not to bias the results by using the window set from one model (giving advantage to the model with the window set chosen) or to include noise, cycle skipping, or other artifacts in standardized body/surface wave windows. Comparing the windowing statistics of the final model results shows that, in the end, all three models pick windows in similar station-event pairs that are also of similar length. Windowing statistics are discussed in depth in Section 4.2.1.
We used the time-frequency phase misfit function (Fichtner et al., 2008) to quantify the misfit of the observed and synthetic data and to compute adjoint sources. The time-frequency phase misfit function converts waveforms into the time-frequency (wavelet) domain (Kristeková et al., 2006(Kristeková et al., , 2009 to compute weighted phase and envelope differences. The time-frequency phase misfit function works well for regions where the wavefield is spatially undersampled and the background model is not well known (Fichtner et al., 2009). After the misfit is calculated, we compute event sensitivity kernels with the interaction of the forward and adjoint wavefields. The gradients computed are the sum of the sensitivity kernels calculated for each event (Fichtner, 2010;Fichtner et al., 2006;Tape et al., 2007;Tarantola, 1988;Tromp et al., 2005). Model updates are preconditioned with anisotropic smoothing based on the local wavelength of the shortest period; the gradient is smoothed at 0.2 wavelengths in the vertical direction, and 0.5 wavelengths in the lateral directions. To avoid source and receiver effects in the event-dependent gradients, we applied a source cutout of 75 km around the hypocenter and 35 km around each receiver in the 30-100 s and 25-100 s period bands; at 20-80 s, we remove the receiver cutouts so that we can begin to resolve upper crustal structure near the receivers.
Finite frequency sensitivity kernels for waveform misfit computed with adjoint methods have outsized amplitudes to near-source and near-receiver structure, which are generally near the Earth's surface. This could lead to updates in the shallow structure that could trade-off with structure along the path away from the source and receiver. To gradually update crustal structure through the inversion iterations, we instituted a "region of interest" in our first two inversion stages (30-100 s and 25-100 s) to suppress sensitivities at shallow depths. The region of interest allows us to focus on fitting deeper structure in earlier iterations by minimizing updates above a specified depth (Rodgers et al., 2022). At 30 s minimum period, we used a region of interest of 50 km; the region of interest is reduced to 15 km at 25-100 s before being removed at 20-80 s.
To compute model updates, we use the Limited-memory Broyden-Fletcher-Goldfarb-Shanno (L-BFGS) approach (e.g., Nocedal & Wright, 2006). We specifically use the trust-region L-BFGS method proposed in Boehm et al. (2018), which uses a quasi-Newton approximation of the inverse Hessian to calculate model updates. This method allows us to compute both step size and direction for each update by minimizing the quadratic approximation of the given objective function. Trust-region L-BFGS approaches have been shown to be more efficient than line search methods in cases where the gradient of the misfit cannot be easily evaluated (G. Liu et al., 2022;R. Modrak & Tromp, 2016).
The key aspects of the method presented above that influence the results of this study are the gradient treatments, the large number of iterations performed, the use of the trust-region L-BFGS optimization approach that checks for good alignment between predicted misfit based on the curvature approximation and the actual misfit, and the multi-scale inversions that allow us to fit long wavelength structure before moving to shorter periods. These factors combined allow us to feel confident in the robustness (i.e., independence of the starting model) of our methodology. When combined, these methodological choices may lead to smaller steps in model space when compared to contemporary regional-scale FWI work.
We believe that cutting off inversions early before the model has converged on its own are the main factors that could lead to different results in the final models. In our inversions, we see that broad changes to the structure happen in the first few iterations; after the elbow point of the misfit curve, the remaining iterations define smaller-scale structures and constrain amplitudes of model features. These later iterations are also key to fitting shorter period, late arriving dispersed surface waves in our models. Stopping before the elbow point of the misfit reduction curve would lead to differing model attributes and waveform fits, therefore likely causing the models to diverge. Given the conclusions of this paper, we believe the inversion approach taken in this study has, in part, allowed us to successfully navigate the necessarily complex misfit surface, and was key in removing the influence of any given starting model.

Results
Before we begin comparing the three models, we first present the results of the three inverted models. Figures 3  and 4 show the isotropic shear wavespeeds at 5 and 55 km depth, respectively. We focus on interpreting the models in the crust and uppermost mantle (<60 km) as the period band considered (20-100 s) has minimal sensitivity below about 50 km (Pasyanos, 2005). Since the crux of this study is comparing model results, we defer tectonic interpretation of the models to a later study. The plots shown below (Figures 3 and 4) focus on the lithosphere, but we will refer also to deeper structure; the plots for other depths can be found in Supporting Information S1 ( Figures S4 and S5 in Supporting Information S1). All figures are also masked to only show model features in California; since most of our ray coverage is in California ( Figure S1 in Supporting Information S1), we limit our interpretations only to that region. Figure 3 shows the relative Voigt-averaged V S anomalies in the three starting models (Figures 3a-3c), the three final models (Figures 3d-3f), and the natural logarithm ratio of the absolute V S velocities between the final models and the starting models (Figures 3g-3i) at 5 km depth. The logarithmic ratio plots show the location and magnitude of updates made compared to the values of the starting model. The updates required to achieve the final model is considerably different depending on the smoothness of the starting model. This indicates that the iterative waveform inversion methodology makes updates to the model based on the structure required by the data.
The three starting models show a large variation in structure. SPiRaL ( Figure 3a) shows high velocities over most of California, with the lowest velocities off the coast of Cape Mendocino, and the highest velocities under the central Sierra Nevada mountains. WUS256 (Figure 3b) shows the highest amplitude and finest scale structure of the three models. The root of the Sierra Nevada mountain range is visible, as well as lower velocities related to the Central Valley. The lowest velocities are under Cape Mendocino, around where the proposed slab window in the Gorda Slab is located (Furlong & Schwartz, 2004) and The Geysers, a geothermal field in central Northern California. Low velocities are also seen under the Long Valley caldera region (e.g., Flinders et al., 2018). Finally, CSEM_NA ( Figure 3c) has the smoothest structure, with higher velocities in Southern California, and low velocities in northern and northeastern California.
Despite the differences in the starting models, the final models resolve similar features in the upper crust ( Figure 3). All three final models have low velocities along the western coast in the upper crust, corresponding (from North to South) to the Gorda slab window (Box 1, Figure 3e; Furlong & Schwartz, 2004), the Geysers geothermal field (Box 2, Figure 3e; Lin & Wu, 2018;Majer & McEvilly, 1979;Zucca et al., 1994), and the Central Valley sedimentary basin (Box 3, Figure 3d; Li et al., 2022;Lin et al., 2010). CANV_WUS, which starts with WUS256, has the lowest velocities and clearest definition between the Gorda slab window and the Geysers geothermal field. Velocities also appear lower in the Ventura Basin in CANV_WUS (Box 4, Figure 3e), though the basin is smeared because we are only resolving long-wavelength structure. Low velocities in eastern California correspond to volcanic centers. In the northeastern part of the state, lower velocities are seen under Mount Shasta and Lassen volcanic centers (Box 5, Figure 3f The starting models at 55 km depth are also quite different (Figures 4a-4c). However, the final models also agree with other studies at uppermost mantle depths ( Figure 4). Beginning in the North, the subducting Gorda slab is visible in the north-central part of the state in all three final models . At depths below 60 km, the curvature of the Gorda slab toward the southwest at its southernmost extent is clearly visible in all three final models ( Figure S5 in Supporting Information S1). The extent and curvature of the slab in the CANV models follows the contours of Slab 2.0 (Hayes et al., 2018); the rotation of the slab is thought to be connected to the slab tear in the Juan de Fuca slab under central Oregon (Hawley & Allen, 2019). For the starting models, WUS256 shows the Gorda slab at the same depths as the final models, while the slab is visible in SPiRaL only at deeper depths (<75 km) and CSEM_NA shows anomalous low velocities throughout extent of the Gorda slab (Figures 4a-4c). Therefore, regardless of the slab's existence in the starting models, the final models are all able to resolve the slab, though with varying amplitudes. All three final models are also resolve the Isabella Anomaly in central California (Box 1, Figure 4e). This high-amplitude fast velocity feature is thought to be the detached lithospheric root of the southern Sierra Nevada mountains (Doughtery et al., 2020;Jiang et al., 2018;Zandt et al., 2004). The lithospheric drip is offset to the west from the location of the Sierra Nevada, which is also observed in other studies (e.g., Rodgers et al., 2022;Stanciu & Humphreys, 2021). The high velocity anomaly associated with the Transverse Ranges are also visible (e.g., Kohler, 1999), though not as prominently as the Isabella Anomaly (Box 2, Figure 4f). The continuity of the Transverse Range anomaly is interrupted in some of the models by a low velocity anomaly under the Salton Trough (Box 3, Figure 4d), which is related to the thin lithosphere and high heat flow in this region (Han et al., 2016;Lekic et al., 2012). These low velocities are likely smeared at these depths, since our period bands of interest give us limited resolution at the upper mantle depths.

Model Comparisons
In this section, we compare the resulting three models described in the previous section. Models were compared in both the model space and data space. Comparing the models in model space allows us to view the spatial 10.1029/2023JB026463 8 of 21 patterns of the shear wavespeeds and tectonic structures that the models resolve. Comparing the models in data space allows us to determine how well each of the models fit the observed data given the model space differences. Both comparison spaces are vital to understanding whether the models are converging on a similar final result given the different characteristics of each starting model. The comparison methods presented below are not meant to be an exhaustive list of ways to compare seismic tomography models. Some methods, such as K-Means clustering (e.g., Lekic & Romanowicz, 2011) and event-based misfits (e.g., Rodgers et al., 2022) have been used to compare the performance of tomography models before; the structural similarity index (SSIM) (Z. Wang et al., 2004), though broadly applied in the signal processing community, but has not been applied to seismic tomography to our knowledge.
We have focused on methods that are computationally inexpensive, generalizable, and cover a broad range of ways to compare model results. When discussing the comparison results below, we choose not to interpret these results in the context of determining which model has the best performance. What model is the "best" will vary depending on the intended application of the model and chosen metrics. The discussion below is meant to highlight quantifiable ways to compare tomography model results through the lens of model convergence. Any discussion of the models outside of their comparison to each other is outside of the scope of this paper.

K-Means Clustering
We begin comparing our models by using k-means clustering (Lloyd, 1982). K-means clustering is frequently used in seismic tomography to both identify tectonic structures in the models (e.g., Chai et al., 2015;Cottaar & Lekic, 2016;Eymond & Jordan, 2019;Lekic et al., 2012;Xiong et al., 2021) and to compare morphologic features between tomography models (Lekic et al., 2012). The shape and characteristic velocities of the clusters provide a qualitative method to determine the models' ability to resolve large-scale tectonic features.
K-means clustering is an unsupervised machine learning algorithm that separates n data points into a user-defined number of clusters k. Data points are segregated into clusters by their squared Euclidean distance (i.e., intra-cluster distance) from a given centroid, or mean, cluster value. This process is then repeated multiple times starting at different mean points. The final cluster results are chosen from the iteration that has the smallest sum of squared Euclidean distance between the centroid and the data points within the clusters. To avoid sub-optimal clustering, we used the k-means++ approach (Arthur & Vassilvitskii, 2007) to choose the initial cluster centroid locations. Unlike traditional k-means, which chooses n number of random centroid locations, k-means++ chooses one random centroid location, and then places the remainder n − 1 centroids at the maximum squared distance from the first centroid. This method of choosing initial centroid locations reduces the error at the onset, leading to faster convergence times.
We use the k-means++ algorithm as implemented in Scikit Learn (Pedregosa et al., 2011) to calculate the clusters in our models. We input profiles of absolute isotropic V S in the crust (down to 30 km) calculated at every 0.1° × 0.1° in latitude and longitude. We run the clustering algorithm on the dataset for 2, 3 and 4 clusters. The resulting clusters are shown in Figure 5.
At k = 2 clusters (Figures 5a-5c), all three models have one cluster in the oceanic crust, and the second cluster in the continental crust. Since the Moho depth varies drastically between the oceanic and continental lithosphere, this clustering behavior is expected (Lekic et al., 2012). At k = 3 clusters (Figures 5d-5f), the characteristics of the clusters begin to differ between the three models. CANV_SP ( Figure 5d) and CANV_CSEM (Figure 5f) differentiate the continental shelf from the rest of the ocean as their third cluster (see Figure 1 for bathymetric map). CANV_WUS also differentiates the continental shelf in its third cluster but includes in that cluster the Coastal Ranges/Central Valley and a small area under the Salton Sea. This is due to the stronger amplitude low velocities in the crust under the coast and Salton Sea in the CANV_WUS model (and WUS256; see Figures 3c and 3d). WUS256 also has lower velocities in the crust under the continental shelf compared to CSEM_NA and SPiRaL.
Finally, at k = 4 clusters, the clusters align again. All three models retain the ocean, continent, and continental shelf clusters. The fourth cluster for each model clusters around lower velocities under the continent. The extent of these low velocity regions varies between each model. CANV_SP has the low-velocity cluster with the largest extent, covering the Great Valley (Figure 3d, Box 3), the extensional Basin and Range province of Nevada and easternmost California, and the Salton Sea area (see Figure 3e, Box 8); the high velocity of the root of the Sierra Nevada is visible (dark blue region of eastern California in Figure 5g), surrounded by the low velocity cluster described above. These low velocity features have been described in other tomography models of the region (e.g., Chai et al., 2015;Lin et al., 2010;Rodgers et al., 2022). CANV_CSEM similarly clusters the Great Valley and some of the Basin and Range province; this cluster, however, seems to track the artificial low velocities from the starting model in this region. Finally, CANV_WUS only clusters along the Coastal Ranges and Great Valley, not in the Basin and Range province like the other two models. Though the structure of the CANV_WUS model in the Basin and Range province resembles the CANV_SP model, the amplitude of low velocities along the coast is much stronger in the CANV_WUS model, meaning that the clustering algorithm will prefer to cluster these very low velocities together instead of including the smaller low velocity perturbations in the Basin and Range province.
Overall, k-means clustering provides qualitative evidence that the three models resolve similar tectonic features. Though variations exist in both the geographic extent and characteristic velocity of the clusters between the three models, these can all be attributed either to variations in the amplitude of velocity features between the three models or to artifacts that are carried over from the starting models in regions where ray path coverage is poor. Regardless of the variations between the models, we can conclude that the models are able to resolve large-scale tectonic structure.

Structural Similarity Index (SSIM)
Though k-means clustering can provide a qualitative analysis of model similarity, we were interested in a metric to quantify the differences between the models. Since it is difficult to compute formal uncertainties for FWI models (see Introduction for discussion on these difficulties), we turned to the structural similarity index (SSIM; Z. Wang et al., 2004) to quantify model similarities and differences. SSIM is an "objective image quality measure" (Z. Wang et al., 2004) that is used to quantify the degradation of a target image in comparison to a reference image. Though other image quality measures exist, SSIM stands out because it considers the human visual perception system's ability to analyze structural information of an image.
The SSIM value between two images (x and y) is calculated using the following equation: where μ x and μ y are the average pixel values, or luminance, in a 7 × 7 pixel window in the reference and target images respectively, and σ x and σ y are the standard deviation of the pixel values, or contrast, in the same window for the reference and target images respectively. σ xy is defined as: where x i and y i are the pixel values. C 1 and C 2 from Equation 1 are constant values; these values are assumed to be very small, so they can be ignored (Z. Wang et al., 2004). The SSIM index is calculated for each 7 × 7 pixel window, creating a 2D array of SSIM values for the given image; a singular SSIM value is calculated by averaging the values over space. Applying SSIM values locally and averaging the values instead of running the calculation globally is advantageous for multiple reasons, but most importantly it provides a spatial map to compare the (dis) similarity between the two images. SSIM values range from 0 to 1, with 1 meaning that the images are identical and 0 meaning that the images are completely different (e.g., comparing an apple to a building). Figure S10 in Supporting Information S1 provides an example of images with different SSIM values to contextualize the results of comparing the starting and final models below.
To determine the SSIM values of the respective tomography models, we extract the absolute V S values for each starting and final model from 5 to 75 km in 5 km intervals. The V S values were plotted using PyGMT  in grayscale over our domain. Once the images were created, the SSIM value was calculated for each model pair at each depth; the results with depth are shown in Figure 6 and grouped by starting models, starting and final models and final models in panels a, b, and c, respectively.
For the starting models compared to each other (Figure 6a) For the starting models compared to the final models (Figure 6b), there is a larger variability between the SSIM values. WUS256 shows the highest average SSIM values compared to all three final models (0.9-0.95), especially in the crust. Since WUS256 is already well resolved, the model space updates made to create CANV_WUS are smaller than the other final models (see Figures S2 and S3 in Supporting Information S1 for misfit plots), which results in higher SSIM values. The SSIM values for CSEM_NA compared to CANV_CS are the lowest, indicating that large changes were made to the CSEM_NA model to arrive at CANV_CS (consistent with Figures 3i and 4i). CSEM_NA is only iterated down to a minimum period of 30 s, and so is unlikely to resolve the smaller-scale structures that the three models produced in this study can resolve. The SSIM values of CSEM_NA compared to the final models are also far lower at Moho depths because of the lack of Moho topography in CSEM_NA. The SSIM values for SPiRaL compared to CANV_SP fall between those of CSEM_NA and WUS256 compared to the final models. The SSIM values of SPiRaL compared to CANV_SP drop significantly after 35 km, then increase again toward 75 km. Though SPiRaL and CANV_SP show similar tectonic structure in the upper mantle, the range of velocities in the two models vary (Figures 4a, 4d, and 4g). Therefore, the low SSIM values are attributable to the amplitude difference between SPiRaL and CANV_SP. The SSIM values increase with depth since we have limited resolution below 60 km, which means that deeper depths will resemble the starting model more. This trend can be seen at upper mantle depths when comparing CSEM_NA and CANV_CS as well. The SSIM values for WUS256 compared to CANV_WUS are relatively flat below the Moho, which can be attributed to WUS256's high resolution in the uppermost mantle.
The SSIM values of the final models compared to each other show far less variability with depth than either the starting models compared to themselves, or the starting models compared to the final models. All final model comparisons have SSIM values above 0.9, with the lowest values around Moho depths. These values are likely due to residual artifacts from CSEM_NA's lack of Moho topography. The SSIM values for the final models are higher in the crust than they are in the uppermost mantle. Because we start our iterations at 30 s minimum period and our maximum period extends only to 100 s, the surface wave sensitivity is concentrated in the upper 60 km of our models. Therefore, the lower SSIM values and larger variability below the Moho are inherited from the starting models.
Overall, the analysis of the models using SSIM allows us to quantify the differences between the models in model space based on their visual characteristics. The SSIM analysis shows that all three final models are more visually similar to each other than they are to the models they start from. The SSIM values are particularly high in the crust of the final models, where we have the highest resolution, with values of above 0.95 for all three final model permutations in the upper 25 km of the models. Therefore, we can quantitatively confirm that the final models strongly resemble each other in model space.

Data Space Comparisons
With k-means clustering and the SSIM, we have shown that the models are qualitatively and quantitatively similar in model space. However, model space comparisons do not give any information on waveform fits or data-space convergence of the models. Therefore, we propose three methods of data-space comparison to examine the goodness of fit of the starting models and final models compared to the observed data. All three of these methods have been used in other studies to determine improvements in fit in tomography models, and we believe that these three measures-windowing statistics, event-based misfits, and variance reductions of waveform differencesprovide well-rounded insight into the convergence of the three models in the data-space, irrespective of the differences seen in the model space.

Windowing Statistics
As discussed in the Methods section, each inversion picks an independent window set based on the synthetics of the given model. We chose not to standardize the windows for each inversion for two main reasons: to avoid incorporating cycle skipping and noise if using windows based on body/surface wave travel times and to avoid biasing the results toward one model by picking the window set of one model and standardizing it across the rest of the models. By not standardizing the windows across all models, we will inevitably include different data in the inversions as each starting model and interim model will correlate to the observed data differently. If the models reach convergence, however, we expect that we will have a similar set of windows, as the simulated waveforms will have the same response, therefore correlating to the observed data in the same way.
To test the similarity of the final waveforms in the data space, we re-ran our window picking algorithm on the final models as well as the starting models using a mesh resolving 20 s minimum period. Before analyzing the results, we will briefly comment on the window picking methodology. The window picking algorithm used by Salvus is adapted from the Large-scale Seismic Inversion Framework (LASIF; Krischer et al., 2015).
Window selection is broken into three stages: using travel times to find the bounds where we can pick windows, determining global rejection criteria that would prevent windows from being picked for a given waveform, and a set of acceptance/rejection parameters that determine what parts of the waveform are suitable to be windowed. For the first step of calculating the time window for which we can pick windows, the first arrival time is calculated using ak135 (Kennett et al., 1995) and a minimum surface wave speed is set by the user (2.4 km/s in this study). A buffer of half the minimum period is added to both ends to determine the time period over which windows can be picked. Any part of the waveform before the first arrival − buffer or after the minimum surface wave arrival + buffer is excluded. Once the area over which we can window is calculated, the algorithm then moves on to determining whether the data quality is good enough to pick windows for that waveform. Data quality is determined using a normalized cross-correlation value of the observed and synthetic data (see Equation 1, Krischer et al., 2015) and the relative noise level of the data, which is calculated by taking the ratio of the maximum amplitude of the waveform prior to the first arrival and the maximum amplitude of the waveform in the travel time-bounded window. The minimum normalized cross-correlation value and maximum noise level are determined by the user. If the waveform has a high enough cross-correlation value and low enough noise level, we can then move on to picking the windows for that waveform.
The built-in window picking function in Salvus uses nine parameters to constrain windows. Three parameters control the initial window candidates: the cross-correlation values across the waveform calculated using a sliding window cross-correlation approach (see Equations 2 and 3, Krischer et al., 2015), the time shift between the observed and synthetic data in the sliding window, and the envelope amplitude similarity in the sliding window (see Figure S11 in Supporting Information S1 for visual representation). The thresholds are set by the user. Candidate windows are then refined by minimum window length, time difference between matching peaks, minimum number of peaks and troughs in the window, and the energy ratio of the maximum amplitude within the window and the absolute noise prior to the first arrival. If the proposed window fits all the above criteria to the user-defined thresholds, then a window is picked.
With the details of the window picking algorithm in mind, we turn to the results of the window picking. Table 1 shows summary statistics for the windows picked from the starting and final models. We chose to analyze two different metrics of the final window sets: the total number of windows picked by each model and their time-bandwidth product. The time-bandwidth product (total window duration times the frequency bandwidth) provides a measure of the information content in the signals that is explained by a given seismic model. The total duration of selected windows based on a common event, path and receiver data set and the same window picking algorithm provides a quantitative metric of waveform fit performance.
The starting models have different characteristics in terms of the number of windows picked and the time-bandwidth product. CSEM_NA picks the least number of windows overall and has the shortest time-bandwidth product (i.e., the least amount of information included in the window set). Even though CSEM_ NA is an adjoint tomography model that is optimized to fit waveforms, it is only iterated down to 30 s minimum period, and so is very smooth when meshed at 20 s. We therefore do not expect to have as many windows from CSEM_NA as SPiRaL and WUS256, which both contain shorter period data. WUS256 contains the most number of windows and includes the most data out of all three starting models. Compared to CSEM_NA and SPiRaL, WUS256's window set contains 8,500 more windows and 19 more days of data-more than 50% of the data contained in CSEM_NA's window set. SPiRaL has roughly 3,000 more windows than CSEM_NA, but has a similar time-bandwidth product, indicating that SPiRaL picks shorter windows than CSEM_NA and WUS256.
Compared to the differences between the starting models, the final models show remarkable similarity in number and duration of windows. All three models pick around 74,000 windows (an average of ∼2 windows per station-event pair), with CANV_SP picking the least number of windows, and CANV_WUS picking the most. CANV_WUS performs the best of the three models both in terms of the number of windows picked and the amount of information contained in those windows. WUS256 has shown better data fits than both SPiRaL and CSEM_NA (Rodgers et al., 2022), leading CANV_WUS to include more surface wave dispersion data in the windows than the other final models. CANV_WUS includes 14% more data than CANV_SP and 8% more data than CANV_CS. Compared to the spread in time-bandwidth product between the starting models, the variation in the final models is minimal.
To test our models' fit on independent data, we created a validation dataset of events that were not included in the inversions. Our validation dataset is composed of 15 events that occurred between 1 November 2020 and 28 February 2022. We use the same domain and magnitude constraints as our inversion dataset for the validation dataset. We chose not to include events that occurred prior to 1 January 2000 in our validation dataset because recent events have nearly 3 times as many stations contributing data compared with events in the 1990s due to the explosion of broadband stations in California for earthquake early warning (e.g., Strauss & Allen, 2016).
The trends seen in the windowing statistics for the inversion dataset also apply to the validation events. WUS256 picks the most number and duration of windows of the three starting models, amounting to 29% more information included in the window set compared to CSEM_NA. As for the final models, both CANV_WUS and CANV_CS contain nearly identical amounts of data in the east component, though CANV_WUS still picks more data in the vertical and north components. CANV_SP picks the least amount of data in the validation dataset, but still has similar statistics to CANV_CS in terms of number of windows picked and time-bandwidth product.
To summarize, window picking statistics serve as a proxy for waveform fit in both our inversion dataset and validation dataset. The window picking statistics in the final models closely resemble each other in both absolute number of windows and the time-bandwidth product of the windows for both datasets. Therefore, we can extrapolate that the three final models are moving toward the same window set independently. It is also important to note that all three final models also include more data than any of the starting models. This means that all three final models fit the waveforms better than our best-fitting starting model, regardless of the resolution of their respective starting models.

Inversion and Validation Event Misfits
While window picking statistics provide a proxy for the similarity of waveform fits in the time-domain, it does not provide a quantitative measure of the misfit between the observed data and our models. Therefore, we compute both the time-frequency phase misfit (Fichtner et al., 2008) and a normalized L2 misfit for all models for both the inversion and validation datasets. Comparing the mean misfit values for each event allows for a more detailed understanding of the fit of the final models to the observed data.
Because the traditional L2 misfit is heavily biased toward high amplitude-and therefore near-source-receivers, we use a normalized L2 misfit, which compares the similarity in waveform shape between the observed and synthetic data. Normalizing the L2 misfit allows for an equal contribution of far-field stations to the total misfit value for a given event, providing a better proxy for how each model fits the data at all stations. To be able to compare the models fairly, we calculated both the time-frequency phase misfit and normalized L2 misfit values for each model using the window set of WUS256. We chose to use the windows from WUS256 because it has the highest number of windows and best waveform fits out of the three starting models. We did not to use a window set from one of the final models so as not to bias the comparison of the final models. The results of the event-based misfit calculations are shown in Figure 7. Figure   SPiRaL and CSEM_NA have higher misfit values compared to WUS256 for most events in both the inversion and validation datasets. The misfit values are highest for CSEM_NA, indicative of poorer fit to the data than SPiRaL or WUS256. SPiRaL also has consistently higher misfit values than WUS256; as with the other statistics discussed, we expect WUS256 to have a lower misfit than SPiRaL because WUS256 uses SPiRaL as its starting model. The large spikes in misfit for SPiRaL and CSEM_NA do not seem to have a trend by location. The misfit for the SPiRaL and CSEM_NA tends to be higher for larger events (M w < 6.0). This is likely a breakdown of the point source approximation for larger magnitude events, and so are not able to capture the complexity of these waveforms.
The misfit values for the final three models follow the same trend for both the inversion and validation datasets. The misfits for CANV_SP and CANV_CS tend to be higher than that of CANV_WUS, especially for events where the starting model has spikes in the misfit values. All three models have misfit values that are lower than WUS256 in 40% of the inversion events and 67% of the validation events. The smaller percentage of events where all three models have lower misfits than WUS256 in the inversion dataset is affected mainly by larger misfit values for CANV_SP and CANV_CS. These higher misfit values correlate with events where SPiRaL and CSEM_NA have anomalously high misfit values.
The most striking feature of the misfit values across the inversion and validation datasets is the consistency in misfit values across the final models. The consistency in misfit values points to the similarity in final model waveforms regardless of the misfit in the starting models. Therefore, we provide more evidence that the final models are converging toward the same model in the data space.
Calculating the event-based misfit for the starting and final models confirms the results from the window picking statistics comparison: regardless of the misfit values of the starting models, the final models have lower misfits than their respective starting models, and often perform better than WUS256. Therefore, the event-based misfits provide quantitative confirmation of the similarity in waveform fits of the three final models, and therefore the convergence of the three final models toward the same model.

Envelope Similarity of Synthetic Versus Observed Data
The previous two sections have covered waveform fits based on the goodness of fit for the observed and synthetic waveforms for a given model(s). To get an objective view of waveform fit across the whole waveform, we compared the envelope of the observed data with both our starting and final models. Comparing envelopes allows us to determine the similarity between both the phase and amplitude of the waveforms, allowing for an unbiased measure of waveform fit.
To calculate the envelopes, we began by removing any station-event pair where the signal-to-noise ratio (SNR) of the observed data was less than 10. We chose to remove these data to ensure that noisy data would not skew the interpretation of the results. After we removed noisy data, we windowed our data over the whole waveform. We chose this window to include wave speeds between 8 km/s and 2 km/s for each station-event pair. This section of the waveform captures most of the seismogram and corresponds to the high and low wave speed limits of our models. Windowing by wave speed assures an unbiased comparison of the synthetic data regardless of the similarity of the synthetic and observed data. Once the data is windowed, we compute the envelope for the observed data as well as all for the starting and final models. Each envelope is computed by squaring the amplitude of the data, taking its Hilbert transform, and then taking the square root (Kanasewich, 1981). To determine the similarity of the envelopes, we calculate the Pearson correlation coefficient (Pearson, 1895) for the envelope of the observed data versus the synthetic data. The Pearson correlation coefficient, known better as the cross-correlation coefficient, is a measure of the normalized covariance of two datasets: where x i is the ith value of the observed envelope, is the mean value of the observed envelope, y i is the ith value of the synthetic envelope for a given model, and is the mean value of the synthetic envelope. Cross-correlation values range from −1.0 (anti-correlated) to 1.0 (perfectly correlated). For the purpose of this study, we only look at positive cross-correlation values. Station-event pairs with negative cross-correlation values account for less than 2% of the total station-event pairs, so do not affect the interpretation of the results.  (Table 1). Even with slightly lower values in the highest cross-correlation value bins, the majority of CANV_SP's waveforms have cross-correlation values above 0.6, which suggest that the waveforms are still well-correlated to the observed data.
From the comparison of the final models against each other, we can extrapolate that the waveforms across the final models are similar to each other over the whole waveform window, not just in windows chosen based on the similarity between observed and synthetic data. It is important to emphasize that based on the cross-correlation values, we show that all areas of the waveform, not just the sections included in the automated window set, are highly correlated to the observed data. The cross-correlation values for the starting models are variable (see Figure S12 in Supporting Information S1), yet the final models all converge toward similar values regardless of where they start from. This is a particularly powerful indicator that the three final models are traversing toward the global minimum.
Comparing the envelopes of the observed and synthetic data over the whole waveform window provides an unbiased confirmation of improved waveform fits for the CANV* models. The envelope cross-correlation, combined with the misfit values and window picking statistics, provide strong evidence that all three CANV* models are moving toward the same Earth model. Adjoint waveform tomography is dependent on a starting model, and we have shown that regardless of the smoothness or artifacts seen in the starting model, a multi-scale approach can resolve discrepancies in tectonic structure and waveform fits.

Conclusions
Adjoint waveform tomography relies on a starting model to begin inversions. However, previous studies have only considered a single starting model beyond the forward modeling stage because of computational cost. In this study, we used an efficient AWT workflow to investigate the impact of the starting model on the resulting Earth model and waveform fit. We considered three starting models for a broad region covering California. We found that the models resulting from a conservative, multi-scale inversion approach show good agreement by measures of Earth model similarity (model space) and waveform simulation performance (data space). The three models were able to overcome smoothness and artifacts in the respective starting models to produce three models that contain similar tectonic features and comparable waveform fits in California. This is the first AWT study to explicitly consider three different starting models and demonstrate the robustness of AWT to obtain consistent results.
To compare the results of the three models produced, we were interested in providing quantitative methods of determining similarity in both the model space and data space. Comparing seismic velocity models has often relied on qualitative comparisons such as the similarity of tectonic structure to other models or visual comparison of waveform fits for select source-receiver pairs. In this paper, we proposed five methods of comparison-both qualitative and quantitative-for determining the similarity of different seismic models. We used k-means clustering and structural similarity (SSIM) index analysis to show that the three final models can resolve large-scale tectonic features to a high degree of similarity both qualitatively and quantitatively. To determine the similarity of waveform fits, we studied windowing statistics, event-dependent misfit values, and envelope similarity. All three methods produce the same result: the CANV* models perform equally well at fitting the data, even outside of the windowed portions of the data.
The performance of the CANV* models suggests that we can converge toward a similar model regardless of the starting model used. However, there are important caveats to consider with these results: • Station/Event Coverage: California provide impressive station coverage and good azimuthal coverage for events, and so is an ideal environment for AWT. In an environment with poor station coverage or azimuthally restricted sources, the choice of starting model will likely have a bigger influence. • Methodology: We chose to take a careful multi-scale approach to our inversions, both in terms of the data that we include and the inversion steps we make. We allowed the L-BFGS algorithm to choose the step-size at each iteration, closely monitoring and confirming that the predicted and actual misfit reduction were consistent. This approach led to more iterations than are commonly seen in contemporary work, but we believe that this careful path through misfit space is a key factor in starting model independence. • Wavelength: The final models we present are iterated down to 20 s minimum period, which is low resolution for a regional velocity model. Artifacts in the final models that are related to the starting models are likely to have a greater effect at shorter periods, where data are more complex and affected by small-scale structures. Because of the computational cost of running three California-scale models to short periods, we suggest using a smaller region where many starting models are available (e.g., Southern California) to test the effect of starting models down to 5 s or below. We defer this question to a later study.
With these caveats in mind, we argue that the choice of starting model has minimal effect on the final inversion results at long periods. This implies that adjoint methods can resolve artifacts in starting models, and we can be confident that we are marching toward a global minimum with conservative iteration approaches. This also suggests that all starting models are in the locally convex region of the misfit functional, which also justifies the use of uncertainty quantification using a Gaussian approximation and information derived from the Hessian (Bui-Thanh et al., 2013). The minimal effect of the starting model also means that starting with a one-dimensional starting model (e.g., Boehm et al., 2022;Métivier et al., 2016) is not required to minimize artifacts of the starting model in areas with dense ray coverage. 1D starting models require significantly more iterations due to the layered structure and lack of lateral heterogeneity; using a 3D starting model will minimize computational time while still not contributing artifacts.

Data Availability Statement
All the starting and final models mentioned in this paper are available on Zenodo (Doody, 2023a). The codes used for figure generation are also available on Zenodo (Doody, 2023b). Maps were created using pyGMT (Uieda et al., 2023), a python wrapper for the Generic Mapping Tool version 6 . The facilities of IRIS Data Services, and specifically the IRIS Data Management Center, were used for access to raw waveforms, related metadata, and/or derived products used in this study. IRIS Data Services are funded through the Seismological Facilities for the Advancement of Geoscience (SAGE) Award of the National Science Foundation under Cooperative Support Agreement EAR-1851048.