A Multi-Objective Optimization Framework for Joint Inversion

Abstract: Different geophysical data sets such as receiver functions, surface wave dispersion measurements, and first arrival travel times, provide complementary information about the Earth structure. To utilize all this information, it is desirable to perform a joint inversion, i.e., to use all these datasets when determining the Earth structure. In the ideal case, when we know the variance of each measurement, we can use the usual Least Squares approach to solve the joint inversion problem. In practice, we only have an approximate knowledge of these variances. As a result, if a geophysical feature appears in a solution corresponding to these approximate values of variances, there is no guarantee that this feature will still be visible if we use the actual (somewhat different) variances. To make the joint inversion process more robust, it is therefore desirable to repeatedly solve the joint inversion problem with different possible combinations of variances. From the mathematical viewpoint, such solutions form a Pareto front of the corresponding multi-objective optimization problem.


Introduction
In this paper, we describe how to combine multiple geophysical datasets for the purpose of assisting in better determining physical properties of the Earth structure. The need for combining different datasets comes from the fact that different datasets provide complementary information about the Earth structure. By jointly inverting multiple geophysical datasets, we combine these complementary pieces of information and thus, we get a more accurate description of the Earth structure; see, e.g., (Vozoff and Jupp 1975 Specifically, we combine receiver functions, surface wave dispersion measurements, and P-wave travel times. The need to use different datasets comes from the fact that there are two types of seismic waves that travel through the Earth: the body waves and the surface waves. Both types of waves provide different sensitivities and information about the Earth structure, since they are sampling, correspondingly, the interior and surface of the Earth. The information collected from the body waves travels deeper into the Earth and translates into teleseismic P-wave receiver functions. In order to obtain information about the Earth surface, surface waves are analyzed, in our case, by means of surface waves dispersion. Receiver functions resolve discontinuities (impedance contrasts) in seismic velocities, and provide good measurement of crustal thickness, without providing a good average of shear wave velocity. On the other hand, we have surface (Love and Rayleigh) waves whose energy is concentrated near the Earth's surface, and provide good average of absolute shear wave velocity, without a good shear-wave velocity contrasts in layered structures (Julia et Obrebski et al. 2010). Therefore these two data sets provide complementary information about the Earth structure. Seismic first-arrival travel times are complementary to the other datasets because the travel time enable us to recover the causative slowness of the Earth structure (Lees and Vandecar 1991).
For each dataset, we usually know the relative variance (uncertainty of data) of different measurement results from this dataset, and thus, we can use the Least Squares method to find the corresponding Earth model. For multiple datasets, we can sometimes still use the Least Squares Method to process all these datasets -provided that we know the variances of different measurements from different datasets. In practice, however, we usually only have an approximate knowledge of these variances. So, instead of producing a single model, several models corresponding to different possible variances are generated. If all these models -corresponding to different possible values of variances -contain a certain geophysical feature, then we can be certain that this feature is also present in the actual Earth model (which corresponds to the actual (unknown) values of the variances).
From the mathematical viewpoint, the task of computing all these models is equivalent to computing the Pareto front of the corresponding Multi-Objective Optimization problem (Kozlovskaya 2000), where different objective functions correspond to different datasets.
In addition to producing all these models, it is also desirable to produce a "typical" model, so that we only look for features which are present on this typical model. In this paper, we use methods for selecting such a typical model as described in (Sambridge 1999a, Sambridge 1999b, Kozlovskaya 2000. This paper has the following structure. In Section 2, we describe, in detail, the need for multiobjective optimization. In Section 3, we show how to solve the corresponding optimization problems. In Section 4, we briefly describe the corresponding geophysical datasets. The results of applying multi-objective optimization technique to these datasets are shown and discussed in Section 5. Finally, Section 6 contains conclusions.

Inverse Problems: Brief Reminder
In many real-life situations, we are interested in the values of the quantities x 1 , x 2 , . . . , which are difficult (or even impossible) to measure directly. For example, in geophysics, one of our main objectives is to find the shear velocities x i at different 3-D points i (i.e., at different depths and at different geographic locations). To find these values, we measure easier-to-measure quantities y 1 , y 2 , etc., which are related to x = (x 1 , . . . , x n ) by a known relation y = F(x), and then use the measured quantities y = (y 1 , . . . , y m ) to find the desired values x.
In particular, in geophysics, to find shear velocities x i , we can calculate the teleseismic receiver functions y RF , surface wave dispersion velocities y S W , travel times y T T , etc. For each of these types of data, if we know the velocity model x, then we can predict the Earth's response by using the corresponding (known) operator F: Measurements are never absolutely accurate, there is always some measurement inaccuracy, there is always some level of noise preventing us from measuring the corresponding quantities exactly. A usual way to estimate parameters in the presence of noise is to use the Least Squares method, i.e., to find the values x that minimize the expression where σ i is the standard deviation of the noise (measurement inaccuracy) of the i-th measurement.
In some cases, all the available data points come from measurements of the same type, obtained by using the same methodology and similar instrumentation. For example, we may only have travel times, or only surface wave dispersion velocities. In such cases, it is reasonable to assume that all these measured values have the same accuracy, i.e., that σ i = const. Under this assumption, minimizing the expression (1) is equivalent to minimizing the sum of the squares

Need to Take Constraints into Account
Sometimes, the models x obtained by an appropriate minimization are not physically meaningful. For example, in geophysics, some models x predict higher velocities in the crust and lower velocities in the mantle, contrary to known geophysical models. In other cases, when we expect a smooth dependence on a signal on time, the reconstructed signal x can exhibit abrupt non-physical changes.
To avoid such non-physical solutions, it is desirable to explicitly take into account the corresponding constraints. For example, in most practical problems, there are known physical bounds on the values of the quantities x i . In particular, in geophysics, for each depth, we can approximate the lower bound and the upper bound on possible values of shear velocity x i at this depth.
In precise terms, for every i, we know the bounds a i and b i such that a i ≤ x i ≤ b i . Restrictions on smoothness can be described as known bounds ∆ i j on the difference between the values x i and x j for nearby points of time or at nearby spatial locations: Under the corresponding constraints, the optimization problem (2) takes the form where g(x) is a vector consisting of the corresponding constraints. For example, to describe bounds a i and b i on the values x i , we use constraints x i − a i ≥ 0 and b i − x i ≥ 0.
Constraints corresponding to smoothness can also be expressed in the form g(x) ≥ 0, with the corresponding components of the vector g(x) having the form Comment. Traditionally, researchers avoid non-physical non-smooth velocity models by adding a regularization term λ Lx 2 to the minimized function; see, e.g., (Tikhonov and Arsenin 1977). The problem with this term is that it is not clear how to select λ, and different values of λ lead to different solutions; see, e.g., (Hansen 2010, Vogel 2002. Because of this problem, in this paper, instead of using regularization, we explicitly formulate constraints that need to be satisfied. For example, the desired smoothness is described as a bound on the differences x i − x j .

Joint Inversion: Idealized Case
As noted earlier, measurements of different type usually provide complementary information and it is, therefore, beneficial to use measurement results of all the types.
When we use measurements of different types t, t , etc., then while it is reasonable to assume that all the measurements i of the same type t have the same standard deviation σ i = σ t , standard deviations of measurements of different types are, in general, different: σ t σ t . Let us first consider the idealized case, when we assume that we know the accuracy σ t of measurements of type t.
In this case, we can still use the Least Squares expression (1) to find the desired model x. By grouping together measurements of different type, we can rewrite the expression (1) in the following form: where T is the total number of different types of measurements, and the notation i ∈ t indicates that the i-th measurement is of type t.
We can rewrite this expression as where c t def = 1 σ t , y t is the list (tuple) consisting of all measurements of type t, and F t (x) is the list consisting of all the corresponding values F i (x).
To find the desired solution x, we must minimize this expression under the constraint g(x) ≥ 0.

Reformulation of the Problem
The more measurements of a given type t we have, the larger the contribution of these measurements to the solution. In general, all the terms in the sum (4) are approximately of the same type, so we can conclude that the joint contribution of all the measurements of type t is proportional to the number m t of all measurement results of this type.
To compare the importance of measurements of different types, it is useful to define relative importance as a ratio of the importance m t of this type to the overall value t m t , i.e., as η t def = m t t m t . These influence parameters η t are non-negative numbers that add up to 1.
To better understand the minimized expression (5), it makes sense to show the explicit dependence on the influence parameters. This can be done, e.g., by making sure that each term in the new formula is proportional to η t ; this way we will see that the larger the influence parameter, the larger the influence of this term. To do that, we replace each term Thus, each term The location of the minimum does not change if we divide all the values of a function by the same positive coefficient C. Therefore, minimizing the expression (7) is equivalent to minimizing a simpler expression

General Case: A Description
In the previous sections, we considered the ideal case, when we know the exact variance σ 2 i of each measurement i. In this case, we can use the usual Least Squares approach to solve the joint inversion problem.
In practice, we only have an approximate knowledge of these variances. For example, for each measurement type t, we only know an approximate value σ t of the corresponding standard deviation σ t .
A traditional approach to such situations is to use these approximate values σ t and solve the corresponding optimization problem. The problem with this approach is that, if a geophysical features appears in the solution corresponding to these approximate values of variances, there is no guarantee that this feature will still be visible if we use the actual (somewhat different) variances.
It is desirable to separate artifacts that are due to the specific choice of variances from the phenomena that occur no matter what variances we use. For this purpose, we wish to repeatedly solve the joint inversion problem with different possible combinations of variances. If a certain geological feature is visible in all these solutions, then we can be confident that this feature is also present in the actual solution corresponding to the actual (unknown) values of the variances -i.e., that it is the feature of the actual Earth structure.

General Case: Analysis of the Problem
In the previous sections, we described the need to minimize the expression under the condition g(x) ≥ 0, where σ t are the known standard deviations. We showed that this equivalent to minimizing the expression (8), In situations when we only know an approximate value σ t , the traditional approach would be to use this approximate value, i.e., to minimize the expression or, equivalently, to minimize the expression (8), in which with the same values η t = m t t m t of the influence parameters.
As we have mentioned, a more appropriate approach is to minimize expressions (9) corresponding to all possible combinations of standard deviations σ t . Let us show that: • each such constraint minimization problem can be equivalently reformulated into the form (8) with the weights (11) if we select different values of the influence parameters η t > 0, and that • for each combination of influence parameters η t > 0 with t η t = 1, there exist values σ t > 0 for which the corresponding optimization problem (8) is equivalent to the original optimization problem (9).
Indeed, for each combination of values σ t , let us take where the normalization coefficient c is chosen as In this case, η t > 0 and t η t = 1. For the corresponding weights (11), we get Thus, minimizing the expression (8) is equivalent to minimizing the expression which, in its turn, is equivalent to minimizing the expression (9).
Vice versa, for each combination of weights η t > 0 for which For these standard deviations σ t , the expression (9) takes the form where we denoted From the formula (11), we conclude that w 2 t = η t ( σ t ) 2 · m t and thus, that the formula (15) takes the form Thus, the expression (14) takes an equivalent form and the minimization of this expression is indeed equivalent to minimizing the expression (8).
The equivalence is proven.
Comment. The above analysis holds when we know the approximate values σ t of the corresponding accuracies σ t , but we know no guaranteed bounds on these accuracies. In some practical situations, in addition to the approximate values σ t , we also know the bounds σ t and σ t , for which σ t ≤ σ t ≤ σ t . In this case, instead of all possible values η t of the corresponding influence parameter, we only need to consider possible values -and we can use the above formulas relating η t with σ t to transform bounds on σ t into the corresponding bounds on η t .

General Case: Resulting Optimization Problem
So, we arrive at the following equivalent reformulation of our problem: for all possible combination of influence factors η t > 0 for which t η t = 1, we compute the weights and then minimize the expression under the corresponding constraints g(x) ≥ 0. This minimization problem can be reformulated as where is a diagonal matrix whose elements will be called weights.
In particular, when we reconstruct the values of shear velocities from Receiver Functions (RF), Surface Waves (SW), and Travel Times (TT), the corresponding minimize functional J(x) takes the form σ i is the approximate standard deviation of each point, and p, q, and r are the number of RF, SW, and TT observations (Sosa et al. 2013).

Relation to Multi-Objective Optimization
When we have observations of only one type t, then, to find the corresponding model, we minimize the function F t (x) − y t 2 . Minimizing this function is equivalent to minimizing the expression In situations when we have observations of different type, and when we only know the approximate values σ t of the corresponding accuracies, we need to minimize expressions (8), i.e., equivalently, expressions of the type corresponding to all possible combinations η t > 0 for which It is known that, under reasonable conditions, the resulting set of solutions can be described in terms of the corresponding multi-objective optimization problem (MOP), namely, the problem of optimizing For solving multi-objective problems, a natural idea is to generate the Pareto optimal set (also known as Pareto front), i.e., the set of all the values x for which it is not possible to improve all the criteria f i (x). In precise terms, the Pareto optimal set P(x) is defined as where Ω is the set of all possible solutions that satisfy the corresponding constraints, and f ( Under reasonable conditions, elements of the Pareto set can be obtained by finding the minima of all the functions (19) corresponding to all possible weights η t adding to 1, and, vice versa, each such minimum is an element of the Pareto set P(x).
In these terms, we can say that what we want in the general case, when we only know the approximate values of the corresponding accuracies, is to find the Pareto set of the multi-objective problem in which we minimize the criteria f t (x) = const · F t (x) − y t 2 corresponding to measurements of different types t.
In particular, in our geophysical example, we want to minimize the three criteria f FR (x) = const · F RF (x) − y RF 2 , f S W (x) = const · F S W (x) − y S W || 2 , and f T T (x) = const · F T T (x) − y T T 2 .

How to Generate a "Typical" Solution
When we have data of several types, and we only know approximate values of the corresponding accuracies, our recommendation is to generate solutions corresponding to all possible combinations of actual accuracies. The number of such solutions is huge, and meaningfully analyzing all these solutions is difficult. It is therefore desirable to select, among these solutions, a solution which is, in some sense, typical.
The expectation is that, in general, this "typical" solution will only have features that all other solutions have. Thus, when we look for features common for all possible solutions, a good idea is to first analyze this typical solution, and then to check whether the features that we found on this solution are indeed present in all other solutions as well.
In multi-objective optimization, there are several possible ways of generating such a "typical" solution; see, e.g., (Sambridge 1999a, Sambridge 1999b, Kozlovskaya 2000. For example, once  . . , f min T ). We then select a solution x which is the closest to this ideal point.
, and then we minimize the corresponding normalized distance. In other terms, we select a solution x for which the distance is the smallest possible.

Solving the Corresponding Constraint Optimization Problems
In the proposed approach, we need to solve several minimization problems, corresponding to different combinations of the influence parameters η t . For each such combination, we need to find the value F(x) − y 2 under the constraint g(x) ≥ 0. Let us show how to solve the corresponding constraint optimization problems.

Linearization
In most practical situations, we know the approximate values x 1 of the corresponding quantities x, i.e., values for which x ≈ x 0 . Since these values are close to each other, the difference x − x 0 is small and thus, we can expand the functions F(x) and g(x) into Taylor series in this difference and safely ignore terms which are quadratic (or higher order), and only keep the first order Taylor approximation. Once we solve this first-order approximation problem, we get a better approximation x 1 . We can use this approximation as the basis of a new linearization and get an even more accurate approximation, etc.
On each of these iterations, we start with an approximate model x k , and then we use the first order Taylor approximation of the operators F and g around x k : where F (x k ) is the matrix formed by the partial derivatives of F and ∇g is the matrix formed by the partial derivatives of different components of g(x). Substituting the expressions (23) and (24) in the corresponding constraint optimization problem, we get the following linearized constraint optimization problem: where
The formula (28) implies, in particular, that z − w = 0, and thus z = w. Hence the perturbed KKT system (28) is rewritten aŝ thus the Jacobian associated to (30) is then computed as The system (31) can be simplified further by eliminating the third block of equations as follows. From the last block of equation in (31), we have and then which allow us to write the reduced linear system

Receiver Functions
A receiver function is simply a time series representation of the Earth's response relative to an incoming P-wave propagating near a recording station. A representation of a receiver function is indicated in figure 2, with different incoming P-to-S converted waves and a time series representation of the Earth response underneath a seismic station. Positive or negative spike amplitudes represent positive or negative seismic velocity contrast. A receiver function technique can model the structure of the earth by using seismograms from three component (vertical, north, and east) seismic stations from teleseismic earthquakes. The receiver function technique takes advantage of the fact that part of the energy of seismic P waves is converted into S waves at discontinuities along the ray path (Bashir et  Receiver functions were first applied in the late 1970s at solitary stations to obtain local onedimensional structural estimates (Langston 1981). Since then, there was an increase in the number of stations deployed seismic experiments. It is now possible to generate detailed two or three-dimensional images of structures, such as the Moho and upper mantle transition zone discontinuities near 410 km and 670 km depth; see, e.g., (Wilson 2003).
Receiver functions are derived using deconvolution (Liggoria and Ammon 1991), a mathematical method used to filter a signal and isolate the superimposed harmonic waves. Specially, receiver functions are calculated by deconvolving the vertical component of a seismogram from the radial component, resulting in the identification of converted phases where there is an impedance contrast (crustal-mantle boundary) (Shearer, 2009).

Surface Wave Dispersion
Surface waves in general differ from body waves in many respects: they travel slower, lower frequencies, largest amplitudes, and their velocities are in fact dependent on frequency (Shearer 2009). The surface wave velocities vary with respect to depth being sampled by each period of the surface wave. The sampling by each period of the surface wave is known as dispersion (Shearer 2009). Valuable information can be inferred by measuring surface wave dispersion because it will allow you to  The dispersion curves for surface waves are extracted from station records of three component seismograms for different frequencies and distances, by using reduction algorithms that rely on spectral analysis techniques. The important fact here is that, based on Rayleigh's principle, surface wave velocities are more sensitive to S wave velocity, although they are also theoretically sensitive to P wave velocity and density. Figure 4 provides an example of teleseismic Rayleigh waveforms for the 12/01/2010 event. The Rayleigh's principle states that the phase velocity perturbation, denoted by δc c , can be viewed as a function of (K α , K β , K ρ ), the sensitivity coefficients for P wave velocity, S wave velocity and density, respectively, i.e., where T is the period and z is the depth. By investigating sensitivity function variation in depth, the relative contribution of each property to dispersion can be shown. This subject is beyond the scope of our work, thus we just mention here that such analysis allows geophysicists to show that the relative contribution of P wave velocity, and density to dispersion is smaller than the one for S wave velocity (Julia et al. 2000). That is, surface wave dispersion is much more sensitive with respect to S wave velocity, and therefore we have established the dependence of this data set on shear wave velocity. For this research, a Matlab-based software package -that automatically downloads, analyzes, and  measures phase as well as amplitude of surface waves to generate surface-wave tomography maps -was used to construct figure 5 describing tomographic images of the Texas region. The Automated Surface-Wave Phase-Velocity Measuring System (ASWMS) was the matlab package developed by (Jin and Gaherty, 2014) that developed this automated cross-correlation based method to generate surfacewave tomography of the entire U.S. The ASWMS tool was used to see what geological signatures that we can resolve using surface wave phase data.

Delay Travel Times
For this research, we used TauP toolkit (τ(p)) and Antelope (BRTTO) database program to acquire the delayed travel times from the Array Network Facility (ANF) seismic catalog. Figure 6, shows an example of some of the delayed travel time data used for this study that were acquired from the ANF seismic catalog. The TauP toolkit program that we used to acquire the delayed travel times, uses the spherical Earth geometry into the computation. The TauP toolkit uses the relation of Snell's law, Figure 5. Example plot shows P-delay times for different USArray Stations that were used for this research from a earthquake during 09/15/2011. (Second) plot shows the difference between the predicted and measured travel times and illustrates the P-wave delay.
according to which, for each ray k, the ratio sin i k j x k j = ∆T k l k j remains constant along the k-th ray, i.e., does not depend on j. This ratio is known as the constant ray parameter, and is usually denoted by p k . We use this law to compute delayed travel times ∆T k (i.e., components of the vector F T T (x)): where h j is the thickness of the j-th layer. By using the Snell's law as mentioned earlier, the incidence angle i k j can be rewritten as: Equation (32) can be rewritten as Using Snell's law and rewriting equation (32), we obtained the partial derivatives which are needed to use the primal dual interior point method mentioned earlier:  When ray paths between the source and the receiver are short enough, Earth's curvature is known to be negligible, which provides us with the importance of utilizing equation (35) for our computation of partial derivatives of T .

Results and Discussion
Based on the joint inversion results using multiple geophysical datasets, the compatiability of the datasets provides better estimates of the target model based on numerical experiments with the datasets and the synthetic rift model. In order to illustrate how receiver functions and surface wave dispersion velocities complement each other, we present in figures 6-8 the inversion results for the data sets created from the rift velocity model.
In figure 6, the joint inversion of both RFs and TT data sets gives a better approximation to the target model as expected based on the compatiablity between the two datasets. The single inversion of receiver functions (left top) identifies the velocity interfaces (fig 6), while single inversion of surface waves (fig 7) gives information on the average velocities at different depths. For figure 7, the joint inversion of SWs and TT also provides a better approximation of the target model indicated in red.
The joint inversion of RFs, SWs, and TT provides more accuracy and compatiability when performing joint inversion of the three geophysical datasets as shown in figure 8.
A synthetic rift velocity model that we developed was used as our initial test for the MOP joint inversion scheme. For the MOP joint inversion scheme, a comparison was done with the rift velocity   model and the initial velocity test model. The synthetic rift model was the best velocity model used for this joint inversion scheme. Different synthetic Earth velocity models were used but overall, the rift model was the best. Numerous test were performed to test the compatiablity and complementary nature of the multiple geophysical datasets based from the results in figures 9-11. The algorithm used to perform the joint inversion of the multiple geophysical datasets using the multi-objective joint inversion scheme was written in FORTRAN 77 and coupled with a C code that performs the Multi-Objective Optimization method, based on the work of [25].

Conclusion
We apply an inversion scheme that expands a joint-inversion constraint least-squares (LSQ) algorithm used to characterize a one-dimensional Earth's structure. We utilize the Multi-Objective Optimization technique to perform joint inversion of multiple data sets (receiver functions, surface wave dispersion, and travel times) to develop 3-D shear wave models like in figure 12 (e.g., Thompson et al., submitted for publication). By jointly inverting these three geophysical data sets, we improve the model's accuracy. In the ideal situation when we know the relative accuracy of different datasets, we can formulate the joint inversion problem as a (single) least-square optimization problem. In practice, however, we only know approximate values of these accuracies; so, for inversion, we use the Multi-Objective Optimization Problem (MOP) approach. Different combination of weights were incorporated in the MOP inversion scheme in order to map the Pareto Set (Solution Space) corresponding to receiver functions, surface wave dispersion measurements, and travel times. From the Pareto Set, the MOP technique performs a direct search method that selects the most feasible solution from a set of alternative solutions from the model space (Sambridge 1999a, Sambridge 1999b, Kozlovskaya 2000.  For future work, we plan to incorporate gravity into our inversion scheme to obtain a more constrained Earth structure and to better determine physical properties of the Earth structure. (15)