A priori estimates of attraction basins for velocity model reconstruction by time-harmonic Full Waveform Inversion and Data Space Reflectivity formulation

HAL is a multi-disciplinary open access archive for the deposit and dissemination of scientific research documents, whether they are published or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers. L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés. A priori estimates of attraction basins for velocity model reconstruction by time-harmonic Full Waveform Inversion and Data Space Reflectivity formulation Hélène Barucq, Henri Calandra, Guy Chavent, Florian Faucher


Introduction
The Full Waveform Inversion (FWI) approach to seismic inversion consists in minimizing the misfit between the recorded and synthetic data with respect to the Earth parameters. It uses the full waveform of the recorded waves as data, which makes it free of horizon picking, and hence a good candidate for automatic inversion. It was first introduced by Bamberger et al. [5,6] for one-dimensional model, and followed, in two dimensions, by Lailly [37] and Tarantola [61]. In particular, [37] made the link between the gradient of the misfit functional, computed by the adjoint state method, and the well-known migration imaging operator used by geophysicists. Early numerical experimentations in [62,32] already show the strength and the weakness of FWI: it can produce very clear and detailed images of the sub-subsurface (strength), under the imperative condition that a good background velocity model is available (weakness). In fact, the misfit functional with respect to background parameters shows local minima due to phase shift/cycle skipping, which hamper the reconstruction by local algorithms, unless the minimization starts inside the attraction basin. FWI is extended to elastic waves in [43], with implementation in [26], to resolve the short wavelengths of P-and S-impedance. Initially carried out in the time-domain, FWI can also be used in the frequency domain, as promoted by Pratt and collaborators in [51,48,50]. The use of complex frequencies was later introduced by [53,54], giving rise to the so-called Laplace and Laplace-Fourier domain FWI.
In the recent years, with the increase of the computational power, the solution of the FWI approach to seismic imaging by local optimization algorithms has been extensively developed. However, this approach does not solve the inherent problem of local minima: the determination of the full velocity model, including the low spatial frequency background, requires data with very low frequency (as unrealistic as 1 Hz), coupled with a resolution from low to high frequency to avoid 'cycle-skipping', as for example in [16,59] (similar behaviours are studied in electromagnetism in [7,8]). One could also resort to random optimization techniques e.g., [33], which have the ability to find the global minimum even in presence of many parasitic local minima. But they require a very large number of misfit evaluations for the determination of a relatively small number of parameters: they are not adapted for large number of unknowns (in our experiments, we consider several hundred thousands of parameters). Different misfit functionals have been considered to alleviate the phase shift/cycle skipping problem, such as the logarithmic norm, explored by [62] and [57,58]. [13] uses a criterion based on the envelope of the signal; [41,52,70] are based upon an optimal transport distance. Then, Newton type algorithms provide a natural framework to solve the nonlinear minimization problem in FWI, cf. [49]. However, they require the computation and inversion of the Hessian, which may be overwhelming due to the numerical cost. Thus, some algorithms simply account for the diagonal of the Gauss-Newton Hessian, e.g., [55]; multi-scale Newton methods are studied in [1] for the same reason. Conjugate gradient method with Hessian-vector computation is implemented in [42]. The estimation of the step length is another challenge for the iterative minimization problem and is often conducted using the line search method (e.g., [44]). Alternatives for a more precise step exist, such as the trust region method [29], the Maximum Projected Curvature method [19]. For Landweber iteration scheme, this step is further given in [27]. The choice of search direction and line search method naturally influences the performance of FWI algorithm, comparisons are given in [9] in the context of inverse scattering. Yet, none of these local optimization methods can avoid local minima.
In order to overcome the presence of local minima when solving the FWI problem by local optimization algorithms, alternative parametrizations have been proposed early. They are based on a decomposition of the Earth parameter into separate background and reflectivity parameters, and result in an increase of the computational complexity. The Differential Semblance Optimization (DSO) of Symes [60] extends the depth reflectivity model to account for the various illuminations in the data, defining a semblance criterion whose minimization produces the desired background model [60]; The Migration Based Travel Time (MBTT) approach [21,24,22] parameterizes the depth reflectivity by a data-space reflectivity, which is used to store the reflectivity during the update of the background. Note that the so-called Reflection FWI [69] retains from MBTT the migration-demigration process, but differs in that it does not use the data-space reflectivity concept, and uses as search direction the gradient of the classical FWI data misfit.
In this paper, we investigate the optimizability of the FWI problem associated with the time-frequency domain wave equation, and compare it with that of the MBTT reformulation (in terms of background and data-space reflectivity). The objective is to quantify the size of "attraction basins" and "tolerable level errors" which ensure that a local algorithm with an initial guess inside the basin will converge to the global minimum, provided the error on the data is less than the tolerable error. By construction, these quantities depend solely on the (parameter-to-synthetic) forward map, but not on the data, and can be computed without any minimization of the data misfit. Section 2 presents the time-harmonic wave equation to be inverted, together with the FWI minimization problem and its MBTT reformulation employed for the model parameters reconstruction. Section 3 presents the required geometrical tools of [19] for optimizability study. Sufficient and necessary optimizability conditions are given, and estimates of the size of the attraction basin in the parameter space, and of the tolerable error level in the data space are derived. Section 4 analyses the influence of (complex) frequency and search geometry on the size of attraction basins and tolerable error levels for the classical FWI formulation. In Section 5, we compare the optimizability of classical FWI and its MBTT reformulation. We expect that the migration-demigration included in MBTT eliminates phase shift/cycle skipping problems, leading to larger attraction basins for the minimum with respect to background, and investigate the effectiveness of this approach. In Section 6 we analyze the optimizability of least squares minimization algorithm for other context (elasticity, boundary conditions). In Appendix A, we review the gradient computation in the frequency domain, emphasizing the specificity of complex valued fields for the adjoint state method. The MBTT decomposition is specified in Appendix B. Appendix C provides numerical experiments of acoustic FWI to validate our estimates.

Time-harmonic inverse wave problem
We formulate the seismic inverse problem associated with the wave equation, for the recovery of acoustic or elastic materials from reflection data. We consider a bounded domain Ω of R 2 which represents the region of interest in the subsurface (note that we restrict ourselves to two dimensions for the numerical experiments but the analysis holds similarly in three dimensions). The boundaries are denoted ∂Ω = Γ 1 ∪ Γ 2 , where we distinguish the upper free surface (Γ 1 ) from the numerical boundary (Γ 2 ). We assume that partial data are acquired on a portion Σ of the domain, which possibly coincides with a part of the boundary. Figure 1 illustrates the configuration.

Time-harmonic wave equations
The propagation of waves is defined by the displacement field u, solution to the following partial differential equation, We have introduced the (possibly complex) angular frequency ω, the medium density ρ and the (Cauchy) stress tensor of order two σ. The source is represented by g. We disregard boundary conditions for now. The stress tensor encompasses the different material properties and reduces, for example, when considering isotropy (see [64,65]).
In the case of acoustic propagation (i.e. propagation in fluid), the wave equation is established for the scalar pressure field p instead. It is based upon the Euler equations (see, for example, [25,35]). If we further assume a homogeneous density, then the wave equation coincides with the Helmholtz equation (see, e.g., [31]), and the pressure field p is solution to where c(x) is the velocity (wave speed) and h the scalar source. The upper part of the boundary, Γ 1 (see Figure 1), corresponds with a physical free surface, i.e. the interface between the air and the medium, the field satisfies Due to the numerical truncation of the real domain (the Earth), appropriate conditions are imposed on Γ 2 (see Figure 1(b)). One possibility is the consideration of absorbing boundary condition (see [30]) to ensure that waves that reach Γ 2 are not reflected back to the domain, it is given by where ν indicates the normal direction. As an alternative, one can consider instead Perfectly Matched Layers (PML, see [10]) on the sides and bottom boundary (see Figure 1(a)) where the derivatives become and analogously for the other direction, on Ω Γz . In our implementation, the damping function σ is defined following the work of [66,68]. The inverse problem aims the recovery of the physical parameter c(x) in (2), using observations of the wave phenomenon p(x) at some location. In the following, we mainly work with this acoustic inverse problem but we also offer some analysis for the isotropic elastic situation in Subsection 6.1 and for other boundary conditions in Subsection 6.2. Note that the methodology developed below is not restricted to inverse wave problem or geophysical setup, and can be applied in any context involving least-squares minimization schemes.

Model and data spaces
We denote by m ∈ M the subsurface model parameter to be recovered, here m = c −2 in (2), and by M ⊂ E the admissible parameter set, which is a subset of the parameter space E. Here, E = R n where n indicates the number of unknowns in the representation of m. We assume that M is a closed convex subset of E equipped with the L 2 norm . E : where m i is the i th component of m.
Remark 1 (Model representation). For the computational resolution of the inverse problem in the discrete settings, it is natural to consider the parameter to be recovered with a piecewise constant representation, such that Here, m i , for i = 1 to n, are scalar coefficients (which represent the model) and χ(D i ) represents the characteristic function on the domain region D i . The subdomains D i (for i = 1 to n) are nonoverlapping and verify Inria Therefore the model is represented via n coefficients, and n is the number of unknowns in the FWI inverse problem. Practically, the representation usually corresponds with the space discretization employed for the resolution of the wave equation: one coefficient per cell for discretization based upon mesh (e.g., Finite Element, Discontinuous Galerkin) or one value per node for Finite Differences discretization. Moreover, the piecewise constant model representation (7) is also fundamental to show properties of stability, see [3,11] and Remark 4.
The data are recorded by the receivers on a portion Σ of the domain Ω, they contain both direct and reflected energy. Because one aims to image the deep structures of the earth, the direct arrivals need to be eliminated from the recorded data before they can be used in the inversion procedure. This is automatically done when a 'linearized' model is used, but in general, for the full Helmholtz problem (2), it requires subtracting the direct arrivals. So we shall denote by d the data deprived from their first arrivals, and by D the associated data space. In the time-harmonic case, d is a vector of q complex numbers and D = C q , with q = n f requency × n source × n receiver . The data space D is equipped with the norm . D , where d i is the i th component of d and denotes the complex conjugate.

Inversion via classical FWI
The essence of FWI ( [61,51,67]) is to try to reconstruct the subsurface image by minimizing a misfit functional defined as the difference between the observations and simulations, starting from an initial model. In order to focus on reconstruction of deeper structures, we have to remove the direct arrivals from the forward model. Let us denote by p s,ω the solutions of (2) for m and m s , where m s is a smooth version of m. For the source h at frequency ω, we can define the forward operator at frequency ω for a source h by: where R denotes the restriction operator to the discrete receivers location. When computing derivatives of F, it will be necessary to remember that m s depends also on m. FWI amounts to solve the non linear least squares minimization problem: where the data d (h) ω are made of the observations at frequency ω for a source h, deprived from first arrivals. For sake of conciseness, we use the vector notation and write the FWI minimization problem as: As mentioned in the introduction, different norms can be used to build the objective function, and the minimization is usually performed by a Quasi-Newton algorithm, which requires only the gradient of the cost function: where DF stands for the Fréchet derivative of F and * is the adjoint. This gradient can be efficiently computed by the adjoint method, which does not require the formation of the Jacobian matrix. Appendix A describes a careful adaptation of this method to the case of complex variables (contrarily to the time-domain formulation, the data and the wavefields are, in the harmonic formulation, complex).
It is well known that the determination of the background (low spatial frequencies) of m by (13) is hampered by the many local minima of J , caused by phase shifts in the synthetics (see Figure 18). This can be overcome only if the data contain extremely low frequencies. Appendix C shows a few reconstructions of velocity models via FWI, which illustrate the difficulties inherent to this approach. These difficulties are a motivation for alternative techniques such as the MBTT reformulation of FWI below, and for the optimizability study developed in this paper.

Inversion via MBTT-FWI (background/data-space-reflectiv ity decomposition)
In the MBTT (Migration-Based Traveltime) approach, see [24], the Earth model m is parameterized by a smooth background p and a data-space reflectivity s using a migration operator: where r is the depth reflectivity associated to s and p; W is a scaling operator (which possibly depends on the frequency) and * denotes the adjoint. The weight W is meant to compensate for the lower amplitude of deep migrated events, see [46]. In our experiments we use a simple scaling proportional to the square root of the depth. where (x, y, z) are the 3D space coordinates with z the depth, and β is a scalar coefficient. We refer to Appendix B for more details regarding the computational aspects of the decomposition. Note that the solution of the linearized version (Born approximation) of the FWI problem (13) is of the form (15), in which case parameterization (15) is not underparameterizing.
Hence, when the full model (2) is used, the parameterization by the data space reflectivity s will be able to generate all primary events of the data, but maybe not all multiple events (i.e., it will miss the events associated to multiple reflections involving at least one reflector which generates no primary reflection), cf. [20].
When this decomposition is employed, the natural choice for the smooth version m s of m in (10) (to account for the direct arrivals in the forward map F) is simply m s = p. With this change of parameter, the forward map given by (10) (12) rewrites: with F given by (10) and m by (15).
By construction, F contains no direct arrivals, which implies that, for a smooth enough background p, F satisfies: The motivation for this parameterization is to eliminate phase shifts induced in the synthetics by changes in the background p: the events in the synthetic section F(p, s) are obtained from the dataspace reflectivity s by migration followed by simulation with the same kinematic, and hence are expected to have the same phase as those of s, as illustrated in Figure 18, Section 5.1. Besides controlling the phase, this migration-demigration process has the additional property that the stack involved in the migration turns, for a fixed data space reflectivity, the data misfit into a coherency measure for the current background [20]. The MBTT-FWI minimization problem is then: where M s ⊂ M is the set of admissible smooth backgrounds and D is the data space. The iterative reconstruction follows alternative updates of the two components of the model: 1. Start the reconstruction with the recovery of the reflectivity part s, from the lowest available frequency.
2. Using the updated reflectivity, which we fix, we can iterate to reconstruct the background part of the model, p.
3. Continue to perform the alternative updates, in a succession of increasing frequency.
This approach has been shown successful in [22], for the inversion of synthetic Marmousi data starting from a lower frequency of 5 Hz. Hence another motivation for the optimizability study of this paper is to quantify how far the MBTT reformulation of FWI succeeds in overcoming in step 2 the local minima problem inherent to classical FWI.

Optimizability and Attraction Basins
In this section, optimizability conditions are defined for general least-squares minimization problems, such as (13) and (18). By optimizability, we refer to the possibility for a local (deterministic) optimization algorithm to converge to a global minimum, i.e. there is no local minimum to hamper the search of the solution. This analysis follows the work of [19]. In addition to the notations of Subsection 2.2, we shall refer to F(M) as the attainable set of the least squares problem.
where m 1 , m 2 are two models of M.
The following set of hypotheses is required to derive the optimizability conditions (cf. [19]): -the forward map F : M → D is twice differentiable along segments of M, where D t stands for the derivative with respect to t.
Note that F is indeed twice differentiable because our study is conducted in the finite (discrete) dimensional setting.
Definition 2 (Velocity and acceleration). P is twice differentiable and we denote by V and A (velocity and acceleration along P ) its two first derivatives: For simplicity, we shall consider only paths for which V (t) = 0 for all t, so we can define the unit tangent velocity v, and the normal acceleration a by where ·, · is the real inner product in D = C q : Let us point out that, due to the limited accuracy of the recording devices, model error and noise, the observed data d do not belong in general to the attainable set F(M). Therefore, it is important that the least squares misfit function does not have parasitic local minima for data d which are "not too far" from the attainable set. This property is made precise by the following definition. -convergence: if d ∈ V, any minimizing sequence d n ∈ F(M) of the distance to d is a Cauchy sequence for both the norm · D and the arc length distance (P ) along the path P defined by (19). Hence d n converges in F to the unique projection d † of d onto F(M).
Therefore, an optimizable least squares problem (13) exhibits rightful dispositions for its resolution by a local gradient algorithm: uniqueness of the projection and absence of local minimum ensure that the algorithm will converge to a global minimizer whatever the initial guess in its basin of attraction M.

Global Radius of Curvature and Deflection
Following [19, pp. 167-172 and 300-308], we define the global radius of curvature and the deflection along a path P , and further give in Propositions 1 and 2 a characterization and a sufficient condition of optimizability.
Definition 4 (Radius of curvature). The (possibly infinite) radius of curvature R(t) of a path P at t is given by: 1 The radius of curvature of the whole path P is then defined as: It is straightforward to see that Definition 5 (Global radius of curvature). The (possibly infinite) global radius of curvature R G (t, t ) of a path P at t seen from t , with t = t , is given by: where v(t), v(t ) are the normalized velocities along P defined by (21), and The global radius of curvature of the path P is then defined as: The interest of global radius of curvature comes from the following proposition.
The least squares problem (13) is optimizable -or equivalently M is an attraction basin for (13)if and only if it exists R G > 0 such that R G (P ) ≥ R G > 0 for all path P of F(M). The associated neighborhood V is defined by: The proofs can be found in [19]. The global radius of curvature can be computed numerically using (26) and (27), as will be done in following numerical sections. It can also be estimated via the usual radius of curvature depending on the value of the deflection, which we define now, and which is illustrated Figure 2(a).
Definition 6 (Deflection). The deflection between two points t and t of the curve P is the angle between the two velocities V (t) and V (t ) (see Figure 2(a)). It is given by: The deflection Θ(P ) of the curve P is defined as the largest angle Θ(t, t ) ∈ [0, π] between any two tangent vectors V (t) and V (t ) for any two points t and t of [0, 1]. An infinitesimal variation of the deflection dΘ satisfies

Inria
Denoting t 1 and t 2 the values of t for which the deflection is maximum, the deflection Θ(P ) along the curve P satisfies This majoration is sharp, but it is very conservative: equality holds only when P is an arc of circle with constant velocity V (t) : "when the path P turns always in the same direction with a constant radius". Therefore, the estimation of the deflection Θ(P ) of the path using its upper bound in (32) is the worst case estimate.
The relation between global and local radii of curvature is then given by the following proposition. there exists R > 0 such that: for a.e. t ∈ [0, 1] and all paths P , for all paths P .
From Definition 7, a FC/LD problem verifies that, using (33): which shows that FC/LD problems (also referred to as weakly nonlinear inverse problem in [23]) are necessarily optimizable. Notice that Proposition 1 gives a characterization of optimizable problems, whereas Definition 7 provides only a sufficient condition.

Directional Attraction Basins
Numerical application of previous section to check whether or not a given least squares problem is optimizable becomes quickly intractable when the number of parameters n increases, as it is the case in seismic inversion. So we limit ourselves to directional (or one-dimensional ) parameter sets of the form: Here, m 0 is a nominal model, u a normalized perturbation direction, and ∆ gives the size of the domain of investigation. The associated attainable set is constructed from the path P defined by: We refer to directional optimizability when the problem is optimizable for an interval such as (37), and this interval is a directional attraction basin. Directional optimizability is only a necessary condition for optimizability, but it will allow to analyze the behavior of seismic inverse problems and to compare formulations: the size ∆ of a directional attraction basin in a descent direction tells us how far one can move away in this direction without being stopped by parasitic local minima. Our objective now is to determine (see illustration Figure 2): 1. the size ∆ u m0 of the directional attraction basin centered at m 0 . The larger ∆ u m0 is, the better the least squares problem is amenable to minimization by local algorithm, because we allow a larger area for investigation. In our numerical experiments, we shall scale the estimate with the norm of the nominal model, m 0 E , to provide relative (to the model) quantity.
2. the associated tolerable error level R u G,m0 . It is the largest tolerable error on the data d which ensures the absence of parasitic local minima for the least squares objective function over [0, 1]. The larger R u G,m0 is, the better is the robustness of the minimization procedure to noise in the data. In our numerical experiments, we divide the estimates with the norm of the synthetic data d 0 = F(m 0 ) to provide relative (to the data) quantity. Figure 2: A one-dimensional setup for least squares problems. The figure lives in the data space, the attainable set is the path P image of the interval M(m0, u, ∆) of the model space. (a) Illustration of the computation of the deflection Θ(t, t ) between two arbitrary points t and t . (b) The path has a finite curvature and the deflection is smaller than π/2, so the FC/LD Property 7 is satisfied, and RG = R > 0 by Proposition 2. Hence the "distance to d" function cannot have local minimum over P provided the data d is at a distance of the attainable set P = F(M(m0, u, ∆)) smaller than R.
We shall use two types of estimate: • Θ-estimates of ∆ u m0 , where the optimizability over M is obtained by satisfying the sufficient condition Θ(P ) ≤ π/2 of Definition 7. In this case, R G (P ) = R(P ), so the tolerable error level is given by the minimum over the [0, 1] interval of R(t) given by (23).
• R G -estimates of ∆ u m0 , where optimizability is obtained by satisfying the R G > 0 characterization of optimizability of Proposition 1. In this case, the associated tolerable error level R G has to be computed by evaluating numerically the infimum in (28) using (26) and (27).
Remark 2. The size of attraction basins depends only on the choice of the forward map to be inverted, but not on the optimization algorithm used (e.g. Newton method, gradient descent, etc). The choice of the method naturally affects the rate of convergence and the speed at which the final solution is eventually reached, but it has no influence on the presence or absence of local minimum.

Local
We provide here a local Θ-estimate ∆ of the attraction basin, in the sense that it is based only on the velocity V and acceleration A at m 0 in the direction u. In order to ensure that the deflection of the path P defined by (38) is smaller than π/2, we use the upper bound (32) of Θ(P ), which, according to the optimizability condition of definition 7, ensures in turn that M(m 0 , u, ∆) is an attraction basin.
With the notations of [17] for the directional derivative, the chain rule differentiation gives: where u acts as the direction of derivation. Then we use a rectangle approximation in equation (32), which gives the approximate upper bound Θ u m0 to the deflection Θ(P ): Inria This gives immediately a local Θ-estimate of the size ∆ of an attraction basin at m 0 in the direction u: This estimate is an approximate (because of the rectangle approximation of the integral) lower bound (because it is based on the upper bound (32)) to the size of the largest attraction basins at m 0 in the direction u. It is computationally cheap, and we will use it in Section 4 to analyze the effect of frequency and perturbation direction on the optimizability of the FWI least squares problem. The associated tolerable error level R u m0 is then the minimum of the radius of curvature along P = F(M(m 0 , u, ∆)), which is approximated simply by its value at m 0 , that is R(t = 1/2) given by (23): where V (t) and A(t) have been defined in (40).
Remark 3. One may use the directional attraction basins M(m 0 , u, ∆ u m0 ) to construct an approximate multidimensional attraction basin M(m 0 , ∆ m0 ) as follows: suppose that ∆ u m0 ≥ ∆ m0 ≥ 0 for all u for some ∆ m0 , and define: where B denotes a ball in the model space. The size ∆ m0 of an approximate multidimensional attraction basin M(m 0 , ∆ m0 ) follows immediately from the previous results by computing an upper bound Θ m0 to Θ u m0 using (41): where λ DF min denotes the lowest singular value of DF(m 0 ), and we use the Frobenius norm for the matrix norm of the numerator. Estimation of ∆ m0 follows: Note that M(m 0 , ∆ m0 ) is only an approximate attraction basin, because condition R G > 0 (Proposition 1) or Θ ≤ π/2 (Definition 7) holds only for segments passing through its center m 0 . Then, (43) provides a lower bound R m0 independent of u to the radius of curvature of M(m 0 , ∆): 3.4 Exact Θ-and R G -estimates of ∆ u m 0 and associated tolerable error levels The determination of the exact Θand R G -estimates of the attraction basin centered at m 0 in a direction u inside an interval M(m 0 , u, ∆) of given size ∆ requires the numerical computation of the deflection Θ(t, t ) and the global radius of curvature R G (t, t ) between any two points of the path P , which is the image by F of the investigated interval M(m 0 , u, ∆). For this purpose,deflection maps and global radius maps are computed, and they display the values of Θ(t, t ) (Definition 6) and of R G (t, t ) (Definition 5) between the points of M(m 0 , u, ∆). On the diagonal of the maps, where t = t , R G is not defined by (26) (27), and we indicate instead the values of R(t) given by (21) (23), which represent the limits of R G (t, t ) when t → t. One can then read on these maps the exact Θ-estimate (resp. the exact R G -estimate) of ∆ u m0 by searching for the largest During this process, when ∆ increases from 0 to its exact Θ-estimate, the associated exact tolerable error R u m0 = inf −∆≤t≤∆ R(t) decreases from its value R 0 at m 0 to the tolerable error R u m0 of the Θattraction basin. When ∆ increases further to its exact R G -estimate, the tolerable error is R G = inf −∆≤t,t ≤∆ R G (t, t ), which continues to decrease, until it reaches the value 0 of the R G -attraction basin.
These computationally intensive exact estimates will be used in Section 4 on the analysis of FWI, for comparison purpose with the local Θ-estimates. They will also be used in Section 5 to compare more precisely the optimizability properties of classical FWI and MBTT-FWI.

FWI: Numerical determination of convergence basins and tolerable error levels
Our objective in this section is to give a comprehensive understanding of the behavior of the seismic FWI problem (13) with respect to parameters such as the frequency, the geometry of the target, the parametrization, etc. Therefore, we select a nominal model m 0 and compute, according to Subsection 3.2, the size ∆ u m0 of an attraction basin and the associated tolerable error level R u m0 for different (normalized) directions u. When m 0 is smooth, the direction u can be seen as the search direction, or the geometry of the unknown parameter to be reconstructed in the framework of inversion.

Influence of frequency ω and search direction u
We consider the two-dimensional geophysical setup of Figure 1 for the acoustic Helmholtz equation (2), and select the squared slowness m := c −2 as unknown parameter, which is expressed in s 2 m −2 . The forward map F ω at frequency ω for parameter m is a vector of C (nrcv×nsrc) with n rcv the number of receivers per source and n src the number of sources. In this experiment, we take 19 sources and 183 receivers associated with each source. Both are located near the surface (accordingly to Figure 1), note that the receivers remain in the same position for all sources. Thus, we work with reflection data obtained from a one side (the surface) illumination. The nominal model m 0 is the smooth 9.2 × 3 km background velocity pictured in Figure 3, where the velocity varies linearly with depth, according to the fact that no information is known on the structures to reconstruct.
We investigate the behavior of attraction basins for different directions u, which are chosen to be based on straight reflectors, or extracted from the Marmousi model (probably the most popular velocity model in geophysics, synthetically designed by the Institut Français du Pétrole in 1988). They are shown in Figure 4: u 1 has a single, slim reflector ( Figure 4(a)), u 2 has two slim reflectors ( Figure 4(b)), u 3 has a single large reflector (Figure 4(c)) and u m is obtained by high-pass filtering from the Marmousi model of Figure 4(d). The amplitude for the directions are selected such that u k is normalized: Given a perturbation size ∆, one can think of m 1 = m 0 +∆u as being the target velocity, which one tries to retrieve starting from the smooth model m 0 by minimization of the FWI objective function in direction u by a local algorithm (it is analogous to the update formula in Newton-type algorithm). The question is: how large can ∆ be while still guaranteeing that the algorithm will not stop before the global minimum, provided that the error in the data is smaller than some R or R G ?
The determination of ∆ u m0 with the local Θ-estimate, (42), and R u m0 with (43), requires the computation of the first and second order directional derivatives in the direction u, and we refer to Appendix A.3 for the computational details. Figure 5 shows the evolution of ∆ u m0 (top) and R u m0 (bottom) for frequencies between 0.5 to 15 Hz, for the four directions u of Figure 4. We observe the following.
-In all cases, the size ∆ u m0 of the directional attraction basin decreases with increasing frequency, meaning that the largest basin of attraction is obtained for the lowest frequencies. This is the    expected behaviour that indicates that lower frequencies give a better chance for the iterative minimization to converge, especially when no prior information is known for the initial model.
-Concerning the effect of the perturbation direction on attraction basins, Figure 5(a) shows that their size decreases in the order u 1 → u 2 → u m → u 3 . This can be understood as follows: Directions u 1 , u 2 and u m are essentially reflective, and contain no low frequency components. Hence increasing ∆ will not change the phase of backscattered data, but because of multiple reflections some nonlinearity will occur, which limits the size of the attraction basins. The directions u 1 and u 2 with the less multiple reflections have the largest basins of attraction. Then comes u m which has many reflectors and hence many multiples. On the contrary, direction u 3 has a strong low frequency component besides its reflectivity. Hence increasing ∆ will generate a phase shift in the backscattered data, which is likely to create local minima more quickly, and one expects the size ∆ u m0 of the attraction basin in the direction u 3 to be much smaller than in the three other directions.
-Concerning the largest tolerable error, Figure 5(b) shows that R u m0 increases with the frequency, meaning that the low frequencies are more likely to be affected in the presence of noise. We notice the exception of u 3 for which the allowed error remains relatively similar (and worse than the others) with frequency.
These results clearly show that it is harder to converge towards salt dome than it is to recover    Figure 4(a)), the green circles ( ) two reflectors (u2, Figure 4(b)) the red stars ( ) one large reflector (u3, Figure 4(c)) and the black squares ( ) the reflectors extracted from the Marmousi model (um, Figure 4(d)).
Marmousi-like structures. Because of its low spatial frequency component, the former reduces the size of the attraction basin and is less tolerant to noise in the data. This situation is further illustrated in the context of model reconstruction in Appendix C. Note that the order of magnitude in Figure 5 are around 10 −1 of the norms of the model (top) and data (bottom).
Exact Θand R G -estimates of ∆ u m0 and R u m0 As an alternative to confirm these preliminary observations, we compute ∆ u m0 using the exact Θestimate and R G -estimate. It requires the computation of deflection and global radius maps on the investigated interval, from which the estimate is extracted, cf. Subsection 3.4. In Figures 6 and 7, we show the resulting maps in the direction of the Marmousi reflectors u m of Figure 4(d) and of the single large reflector u 3 of Figure 4(c), using frequencies 4 and 7 Hz. In Table 1, we show the numerical values of the interval size centered around m 0 , depending on the estimation.
The maps confirm the results of the previous experiment: the exact Θ-estimates of ∆ u m0 (top of the figures) are of the same order of magnitude, but slightly larger than the local Θ-estimates of Figure 5, cf. Table 1. Similarly, the exact R G -estimates of ∆ u m0 ( bottom of the figures), are larger (Table 1), but at the price of a smaller tolerable error on the data. We also note a much more consistent pattern in the deflection for u m , compared to u 3 . Comparing the formulas, we observe that the inexpensive local Θ-estimates already provide accurate values for classical FWI problems, close to the more refined (and computationally more intensive) exact Θ-and R G -estimates. Therefore, we should only use the local Θ-estimate in the following of this section.
(a) Using the Marmousi direction um and frequency 4Hz.
Using the Marmousi direction um and frequency 7Hz.   Table 1: Size ∆ of attraction basins centered at m0 using directions um and u3 (see Figure 4) and corresponding maximal tolerable error RG, at 4 and 7 Hz. By construction, the RG-estimates correspond to the limit case of a zero tolerable error, the other values are extracted from Figures 5, 6 and 7.

Influence of the initial model
We now investigate the influence of the initial model m 0 on the size of attraction basins. We consider two initial models: the acoustic Marmousi medium m m and a model m c encompassing objects of high contrasts, given in Figure 8. The local Θ-estimate of ∆ u mj given by (42) is plotted in Figure 9, for the two directions u 1 (one reflector) and u 3 (one large reflector). We see that ∆ u mj is barely affected by the choice of initial model for the computation. We still observe the decrease in the size of the interval with increasing frequency, and the deterioration in the direction of the large reflector u 3 . Clearly, when comparing amplitudes of our estimates, the direction u has much more influence than (a) Using the direction u 3 and frequency 4Hz.
Using the direction u 3 and frequency 7Hz.

Approximate n-dimensional attraction basins
We compare the directional local Θ-estimates of Subsection 4.1, evaluated at the smooth model m 0 of Figure 3 in the Marmousi direction u m of Figure 4(d), with the approximate multidimensional attraction basins centered in m 0 introduced in Remark 3. The determination of the radius ∆ m0 of such a spherical attraction basin by (46) and of the associated level of tolerable error R m0 by (47) requires the computation of the lowest singular value of the Jacobian of F, and the computation of its full second order derivative matrix. In order to reduce the computational burden associated with this computation, we develop a simple model reduction process where the wave speed in adjacent cells is averaged to generate a coarse representation. Note that this approach is also employed to improve the stability of the inverse problem, cf. [11]. When applied to the smooth acoustic model of     Figure 10(a), where the number of coefficients has been reduced to 56. Using this reduced representation, we can compute all quantities to effectively obtain the lower bounds ∆ m0 and R m0 (i.e., we compute the matrix D 2 F and the singular values of DF). Figure 10(b) shows an important reduction of the size ∆ m0 of the n-dimensional basin compared to the directional estimates in the Marmousi direction u m using the local Θ-estimate (six orders of magnitude). Yet, it confirms the main property, which is the decrease of the attraction basin size with increasing frequencies. A similar observation holds for the tolerable data error level R m0 , see Figure 10(c), which suffers an important reduction compared to the directional estimates. In addition, we observe a different pattern between R m0 , which decreases with frequency, and the directional estimates R u m0 , which increases with frequency. As illustrated in Figure 5, R u m0 depends more on the direction than ∆ u m0 , and it is reflected on the n-dimensional estimates, which takes the worst case.    From this first set of experiments, we clearly see that the frequency should evolve from low to high values in order to facilitate the convergence. When no prior information is known for the initial model, one should take advantage of the larger interval size ∆ u m0 given by low frequencies.
Here we motivate the well-known frequency progression (e.g., [16,59]) from quantitative estimates of the size of the basin of attraction, which allows the quantification of the benefits. However, the low frequencies require more accuracy regarding the data, as illustrated with the estimates of R u m0 . In addition, we mention the fact that seismic data are particularly affected by noise at low frequencies, which are sometimes unusable. This highlights where resides the difficulty: the low frequencies that are required to converge may be unusable due to the noise, or the inaccuracy may violate the curvature condition.

Frequency bandwidth data
We have seen that the sequential progression in frequency must be conducted from low to high regime. We now investigate the case of frequency bandwidth: when the forward problem encompasses not only one, but several frequencies. This approach is particularly natural for the time domain inverse problem where several frequencies are simultaneously contained in the data. The forward problem is now a vector of C (nrcv×nsrc×nω) , where n ω is the number of frequencies taken in the subgroup.
We remain with the smooth background of Figure 3 for m 0 . We select groups of ten consecutive frequencies. The initial group encompasses the frequencies from 0.1 to 1 Hz using 0.1 Hz increment. Similarly, we design fifteen groups so that the last one is from 14.1 to 15 Hz, employing identical increment. The local Θ-estimate of the attraction basin size is shown Figure 11, together with the largest tolerable error R u m0 . We investigate two directions: the Marmousi one of Figure 4(d) and the large reflector of Figure 4(c). We picture curves and histograms, that correspond to sequential or group of frequencies respectively. Numerical values for the attraction basis are prescribed Table 2.   Figure 3. The histograms represent the estimations for groups of ten frequencies and cover the respective interval of frequency. The lines represent the estimations using sequential frequency. The different directions u are either extracted from the Marmousi model (um, Figure 4(d)) or from one large reflector (u3, Figure 4(c)). Figures 11(a), 11(b) and Table 2 highlight that -the size of the attraction basin when using group of frequencies is dictated by the largest frequency in the group. Therefore, it is better to use the lowest frequency sequentially in order to have the largest size possible. For instance, using the single frequency 0.1 Hz provides a size of interval one order of magnitude larger than when including frequency up to 1 Hz.

The comparison provided in
-At a fixed frequency, taking into account lower frequencies does not affect the size of the interval.
This preference on sequential over multiple frequency is actually advocated in [15], motivated, as the frequency progression in [16], by the intuition that it reduces the cycle-skipping effect. Here Inria sequential frequency ∆ um m0 / m 0 frequency group ∆ um m0 / m 0 0.   we justify this choice by a quantitative estimation of the size of the attraction basin. Moreover, we show that once the lowest frequencies are taken individually, incorporating them in higher frequency content does not modify the size of the attraction basin.
Regarding the largest tolerable error on data, Figure 11(b) shows that adding frequency increases the robustness in the direction u m , except for low frequency. Then, for the direction of the large contrasting object, u 3 , we do not observe any improvement, and the allowed distance remains relatively stable, as it was observed in Figure 5.

Selection of complex frequencies
The use of complex frequencies (also referred to as the "Laplace-Fourier" domain) has shown advantageous behavior when used during reconstruction algorithm, as mentioned in the introduction. In this case, the angular frequency ω in (2) writes as where f is the 'Fourier' component of the frequency (in Hz) and σ a damping factor, representative of the Laplace component. Here, we are interested in deciding how to select the progression of those coefficients to obtain the best convergence properties. Applications have been carried out with a progression that appears mostly intuitive in literature. In the case of zero-Fourier frequency (f = 0), it is proposed from low to high damping coefficients, [53]. However, for the complex frequency progression (f = 0 and σ = 0), the evolution of the damping parameter at fixed f is proposed to be from high to low damping, see [56,45]. We compute the estimation of ∆ u m0 (using local Θ-estimate) and R u m0 with complex frequencies to see how it affects the basin of attraction and robustness to noise. We consider m 0 the smooth background velocity and the similar setup as employed in Subsection 4.1. The estimates are shown in Figures 12, 13 and 14, where we compare the behaviours on a logarithmic scale for the twodimensional complex frequency plane. Here, both f and σ varies from 0 to 15 (excluding the case where both f and σ are zero).
It is clear that the incorporation of a damping coefficient in the frequency is beneficial for the convergence, because it increases the size of the attraction basin of several orders of magnitude. Then, the process of frequency selection, for convergence, is the following: -for a fixed f , the interval size increases with increasing damping, with the exception of the case f = 0, cf. Figures 12(b) and 13(b).
-At fixed damping σ, the interval size decreases with increasing f .
-Following a large to small interval size gives, 1. initial iterations should use f = 0 (largest interval), and a damping coefficient progression from low to high (see Figures 12(b) and 13(b)).
2. Fix f = 0 as small as possible, the damping coefficient should evolve from high to low (see Figures 12(a) and 13(a)); then f increases progressively.
3. Finally, one can take σ = 0 and f evolves from low to high.
Estimates using fixed f .  Regarding the largest tolerable data error R u m0 , we see that -at fixed σ, increasing f allows an increase in the distance between the data and the attainable set.
-The pattern depends strongly on the type of the search direction geometry, as it has already been observed in the previous experiment (and contrary to ∆ u m0 ). The distance is allowed to increase with σ for the direction u m , but decreases in the direction of the strong reflection, u 3 . Also in the direction of the Marmousi reflectors, Figure 14(a), we observe a band where the allowed distance decreases, around σ = 1. On the other hand, for both directions, the case where σ = 0 appears the most robust (i.e. where the allowed distance between the data and the attainable set is maximal).  -Regarding the magnitude, we clearly see that the high contrast object has globally much less allowance, except when the damping is close to zero.
It is important to notice that the use of complex frequency, especially when f = 0, improves the convergence by increasing notably the size of the attraction basin. Therefore, it provides a fundamental benefit for the reconstruction, in particular when no information is initially known. Note that it basically acts as a very low frequency (note that the magnitude of the estimates for complex frequencies eventually matches the very low frequency ones; we can see a continuation in the magnitudes). On the other hand, such frequencies are more sensitive to noise, and require accuracy in the data. We refer to [53,56,45,31] for illustrations of reconstruction.

Remark 4 (Stability and frequency progression).
We have shown that the reconstruction procedure requires low frequencies to ensure a larger size for the basin of attraction, hence allowing the better chance of convergence despite the possible lack of prior information. However, higher frequencies are still required for reasons of stability, and justify the traditional progression in frequency along with the minimization iterations. Stability of inverse problem has been studied in particular with the Calderón problem, where the data are the Dirichlet-to-Neumann map. Assumptions are commonly introduced to obtain a conditional Lipschitz-type stability results, see, e.g., [2,40,3,12] and the references therein. For instance, the piecewise constant model representation introduced Remark 1 is a necessary ingredient, e.g., [3,11]. Lipschitz-type stability is obtained when, for two models m 0 and m † , it holds that where C represents the stability constant. The iterative procedure for the reconstruction minimizes the difference between the forward map and the stability result ensures the accuracy of the reconstruction (depending on the values of C).
In [11], numerical estimates of the stability constant are obtained in the context of the geophysical inverse problem. It illustrates, for example, the influence of the number of coefficients n (in the model representation (7)) for stability. When the stability constant C is small, it indicates that the correspondence in the data guarantees the accuracy of the reconstruction. Here, we estimate the stability constant (following the methodology of [11]) to see its dependency with complex frequency, in order to compare with the convergence estimates of Figures 12 and 13. We use m 0 the smooth background of Figure 3 and select m † to be the Marmousi model of Figure 8(a). The resulting stability constant estimates are pictured Figure 15. We observe that -the stability constant estimates decrease (i.e. the expected accuracy of the reconstruction is improved) with increasing frequency f .
-Incorporating damping with frequency deteriorates the stability constant (the numerical estimates increases). More precisely, low frequencies and damping (f and σ respectively) give the worst result in terms of stability. Therefore, the estimates for convergence (Figures 12 and 13) and stability ( Figure 15) behave oppositely. The convergence improves at low frequency and with damping meanwhile the stability deteriorates. Higher frequency reduces the convergence radius but ameliorates the stability (by reducing the wavelength, such frequencies allow to capture more details). Both aspects of the problem have to be taken into account during the reconstruction; it designs the well-known multi-frequency algorithm for the computation, starting from low frequency (to foster the convergence) towards high frequency (to foster stability) (see, e.g., Algorithm 1).

Optimizability: MBTT versus classical FWI
In this section, we apply the optimizability analysis tools of Section 3 to compare the MBTT formulation of FWI (18) (where the unknown model is parametrized by a smooth background p and a data space reflectivity s), with the original FWI problem (13) (where the unknown model is the squared slowness m). The MBTT approach was originally designed to overcome the phase shift problem which hampers classical FWI, so the optimizability study of this section is expected to quantify the gainif any -of the MBTT formulation over the classical FWI in this respect.

Choice of the model
For a fair comparison, the two approaches have to be applied to the same model, so we construct a nominal model m 0 whose MBTT decomposition, p 0 , s 0 , is known exactly, i.e. which satisfies m 0 = m(p 0 , s 0 ) according to (15). We have decided to construct a model m 0 inspired by the Marmousi model m m of Figure 8(a). This is achieved by defining p 0 and s 0 by: Inria where m m is the Marmousi model of Figure 8(a) and m 0 is the smooth background of Figure 3. Then, m 0 is defined by: where r 0 (ω) (respectively r 0 ) is the depth reflectivity associated to frequency ω 1 (respectively to the sum of all frequencies). Definition (52) guaranties that: We choose the weight W proportional to the square root of depth, as proposed in Section 2, with a proportionality coefficient such that the model reflectivity level β at frequency ω defined by takes the value 1% at all frequencies.
When β is small, the forward modeling part of the MBTT approach becomes close to the Born approximation.
In Figure 16, we illustrate the resulting models r 0 (ω) for three frequencies: 2, 4 and 7 Hz. We also show the model r 0 where the frequency sum contains frequencies between 0.5 to 15 Hz, with 0.5 Hz increment. We observe that the reflectivity, defined from the difference between observations and simulations using a smooth background, provides structures of size consistent with the selected frequency. For the global model, shown Figure 16(d), we see the contributions of all wavelengths, and we can distinguish some structures of the Marmousi medium given Figure 8      For simplicity, in the following, we restrict ourselves by studying only single frequency nominal models, which means that we only work with models resulting from a single, fixed ω: It allows us to study the behaviour of both approaches (global velocity model reconstruction or MBTT decomposition) with individual frequency, preventing artifacts that may appear due to the frequency summation in the reflectivity.

Choice of perturbation directions
Background perturbation We consider a background perturbation in the direction of the onedimensional ramp u of Figure 17, and illustrate the effect of such a perturbation applied either to the global model m, as in the case of FWI, or to the background unknown p in the case of MBTT.  Reflectivity perturbation The FWI objective function is known to be nearly quadratical with respect to reflectivity, i.e. to the high spatial frequency part of m, and the same property holds by construction for the dependance of the MBTT objective function with respect to the data space reflectivity s. Hence one expects large basins of attraction with respect to s in the MBTT formulation. We choose for u s a random vector of the data space.

Comparison of Local Θ-estimates
The formulas for local Θ-estimates have already been derived in Section 3.3 for classical FWI. Application of these formulas to F(p, s) instead of F(m) gives immediately the local Θ-estimates of the sizes ∆ u p0 and ∆ us s0 of attraction basins for p and s at p 0 , s 0 in directions u, u s for the MBTT formulation of FWI: We have omitted the frequency index for clarity in the notation. Note that this is not misleading because we have chosen to work with single frequency content here, i.e., the reflectivity r only contains one selected frequency. Figure 19 shows the evolution of the local Θ-estimates with frequency. The size of the corresponding attraction basins decreases with frequency, when the perturbation is applied on m, p and s. The left part of the figure shows a slightly larger attraction basin in the direction of the background perturbation u when it is applied to the propagator part p 0 of the MBTT parameterization rather than when applied directly to m 0 . But its not up to the expectations raised by the claim that the MBTT parameterization allows to overcome the phase shift problem [20,21]. Regarding s, the estimated size appears surprisingly small compared to the large attraction basin expected, see Section 5.2. But one has to remember that these are local Θ-estimate, which can be very pessimistic, as explained in Section 3.3, and postpone more definitive comments to the end of Subsection 5.4, where exact Θ-estimates are calculated.   (42) and (56). Here p0 is the smooth velocity background of Figure 3, the direction u for p is given Figure 17, and the direction us for s is a random vector.
In the MBTT representation, the reflectivity uses only the selected frequency.

Comparison of exact Θ-and R G -estimates
We apply the method described in for all −∆ ≤ t, t ≤ ∆.
We first compute the deflection and global radius of curvature maps for m and p, using values of t and t in an interval [−∆, ∆] which is chosen to represent in each case about ±20% of the norm of m 0 or p 0 defined in (52). Figures 20 and 21 show these maps at two selected frequencies: 4 and 7 Hz, and Table 3 summarizes the extracted exact estimates of ∆ u m0 , ∆ u p0 , together with the local estimates extracted from Figure 19(a).    -The first observation is that lower values of deflection are achieved when the background perturbation u is applied to p (MBTT) rather than to m (FWI): at 4 Hz, Figures 20(b), it never  reaches π/2, and at 7 Hz, Figure 21(b), only a few portions attain this value. On the contrary, for FWI, Figures 20(a) and 21(a), the deflection rapidly reaches π/2 at both frequencies. This indicates that the MBTT formulation produces larger Θ-attraction basins than the standard FWI formulation, roughly by a factor ten ( Table 3). Notice that the size of the Θ-attraction basin is divided by two for both FWI and MBTT when the frequency increases from 4 to 7 Hz.
-The second observation concerns the strict positivity of the global radius of curvature R G , bottom of Figures 20 and 21, which determines the R G -attraction basin characterized by a zero tolerable error (see Section 3.4). For the MBTT formulation, Figures 20(b) and 21(b), R G remains strictly positive all over the map, which shows that the R G -basin is larger than the investigated interval. Its size is of approximately 20% at both 4 and 7 Hz (Table 3). On the contrary, for the usual FWI formulation, Figures 20(a) and 21(a), R G decreases very rapidly to zero when one moves away of the diagonal, producing smaller R G -attraction basins, with size of 5% at 4 Hz and 2.5% at 7 Hz, smaller by a factor four to eight to the corresponding MBTT attraction basins.
-Concerning the magnitude of R G , whose minimum over the attraction basin gives the tolerable error level, one sees that it takes larger values for FWI near the main diagonal, i.e. for small attraction basins, than for MBTT over the whole map, i.e. for large attraction basins, which is confirmed by the values of R G in Table 3, which give the tolerable error level associated with the Θ-attraction basins (this level is zero by definition for the R G attraction basins).
To summarize, MBTT extends significantly the size of attraction basins with respect to background perturbations, at the price of a reduction in the admissible error level. This explains the success of MBTT's alternate minimization algorithm, as reported in [22]. We compare now the above exact Θ-estimates with the local Θ-estimates of Subsection 5.3.
-For the FWI approach, Figures 19(a), 20(a) and 21(a) and Table 3, it shows that both local and exact Θ-estimates of ∆ u m are of the same size. In sight of the upper bound estimate (32) on which the local Θ-estimate is based, one can think that the FWI formulation corresponds to the worst situation, where the image of a segment in the background space is a curve close to an arc of circle in the data space.
-The situation is completely different for the MBTT formulation: Figures 19(a), 20(b) and 21(b) and Table 3, we see that the exact Θ-estimate ∆ u p is about ten times larger than its local Θ-estimate.
We can also determine the exact attraction basins in the MBTT formulation for the data space reflectivity s at s 0 in the direction u s , which we expect to be large because the forward map F is nearly linear with respect to s. Figure 22 shows the corresponding deflection and global radius of curvatures maps for values of t and t in an interval [−∆, ∆] which is chosen to represent in each case about ±35 times the norm of s 0 defined in (52). As expected, the exact Θ-attraction basin is large (23 to 54 times the norm of s 0 depending on frequency), and is 10 5 times larger than its local estimate, which, together with the previous results on the estimation of ∆ u p , confirms the necessity of exact estimates for accuracy. two perturbed data-space reflectivities, for the perturbation direction us chosen in section 5.2 The black lines indicate when the deflection becomes higher than π/2, the white lines indicate when the global radius becomes 0.

Influence of reflectivity level
The previous comparisons at 4 Hz and 7 Hz have been performed on the nominal model which has a reflectivity level β of 1% (see (54)). We repeat the same computations, at 7 Hz, changing only the reflectivity of the model, which is fixed to a low level, 0.2% in Figure 23 (close to the Born approximation) and to a stronger value, 5% (close to Marmousi reflectivity) in Figure 24. Table 4 summarizes the sizes of and tolerable error for the corresponding attraction basins. Increasing the reflectivity increases the relative importance of multiple reflections -whose phase are not controlled by the migration/demigration included in the MBTT approach -with respect to primary reflections, whose phase are controlled. It is then expected that the performance of MBTT, when applied to a full wave model as it is the case here, will deteriorate when the reflectivity level increases. This is confirmed by Table 4: the MBTT attraction basin for a reflectivity level of 5 % is three times smaller than at 1 %, which itself is two times smaller than at 0,2 %. On the other hand, the FWI attraction basin remains essentially unaffected by the reflectivity level, and the MBTT basins remain always larger, with a factor four to fifteen (with respect to the Θ-estimates, eight with the R G -estimates).

Influence of background smoothness
Here we illustrate an important condition which has to be satisfied for the MBTT decomposition to work. When a background perturbation is applied to p 0 while s 0 is maintained fixed, the phase shifts of the reflections are controlled, provided these reflections are generated only by s 0 . Any backscattered energy generated by the updated background p 0 + tu -or worse, by p 0 itself -will lead to reflections whose phase is not controlled by the migration-demigration process included in MBTT,   Figure 21, the reflectivity β defined by (54) has been multiplied by five. The perturbation direction is the ramp of Figure 17, it is either applied to the global model m or to the background parameter p. The black lines indicate when the deflection becomes higher than π/2, the white lines indicate when the global radius becomes 0.
thus annihilating its benefits. This appears already in Figure 21(b), where one sees that the deflection reaches π/2 (resp. the global radius of curvature equates zero) much closer to the diagonal at the top right corner than at the bottom left one. Indeed, the large positive t, t correspond to strongly increasing backgrounds, which backscatter energy. At the opposite corner of the diagrams (bottom left), which corresponds to mildly increasing velocities, the amount of scattered energy is smaller, and the limiting values are further apart from the diagonal. We have considered for simplicity only attraction basins which are centered at the nominal model m 0 , but on most of the deflection and curvature maps, the attraction basin could, in fact, be extended on the side of negative t (where backgrounds are less steep). Then, if the unperturbed background itself scatters back some energy, as the one illustrated Figure 25, the situation is even worse, as illustrated in the corresponding deflection and curvature maps of Figure 26, at 4 Hz. These maps show no enlargement of the attraction basin in favor of MBTT, in fact, it is even smaller than for FWI. That is why it is fundamental to ensure the smoothness of the background in the MBTT decomposition by using an adapted parametrization, e.g., spline functions are employed in [22] to represent p.

Behavior of the data misfit
We illustrate in this section the behavior of the FWI and MBTT misfit functions in the model space direction u at the nominal model m 0 = m(p 0 , s 0 ) for some seismic data d. The cost functions are respectively J (m 0 + tu) and J(p 0 + tu, s 0 ) for FWI and MBTT, according to (13) and (18). This will allow to verify the theoretical results of Section 3: if the minimum of the misfit function is smaller than the tolerable error R u G associated to an attraction basin, the misfit should not have any local minimum over the basin. However, when the data exhibit an error larger than the tolerable error, local minima may appear, but they do not have to.   Figure 25 and is the only difference in the computation compared to the maps shown Figure 20. The perturbation direction is the ramp of Figure 17, and is either applied to the global model m or to the background p. The black lines indicate when the deflection becomes higher than π/2, the white lines indicate when the global radius becomes 0.
In order to define a seismic data d to build the misfit, we first denote by d 0 = F(m 0 ) = F(p 0 , s 0 ) the synthetic data associated to the nominal model. Note that, due to the choices made in the construction of the nominal model in Subsection 5.1, d 0 is nothing but a migrated-demigrated version of the Marmousi data deprived of the direct arrivals. We define a scaled Marmousi data d m by scaling the synthetic Marmousi data d M = F(m m ) (where m m is given Figure 8(a)) in such a way that d m = d 0 . Then, the data that enter in the FWI and MBTT misfit functionals are defined by: When η increases from 0 to 1, the distance of d to the attainable set increases from 0 to d m − d 0 , which is larger than the tolerable error associated to the attraction basin of interest (FWI or MBTT). The computations are done for the nominal model with 1% reflectivity level (the results for models with 0.2% and 5% reflectivity are similar). As in the previous sections, the nominal model follows Subsection 5.1, and the background perturbation direction u is that of Figure 17. Figure 27 shows the cost functions for FWI and MBTT at 7 Hz for different values of η. We also show by vertical lines the exact R G -attraction basin associated to zero tolerable error, and by horizontal lines the levels 1/2R 2 G corresponding to the tolerable error level associated to the exact Θ-attraction basin (cf. Table 4).
-The FWI and MBTT attraction basins are essentially independent of the error level η in the data. The size of the FWI R G -attraction basins is slightly larger than the one anticipated by ∆ u m0 . When η = 0, d = d 0 is attainable, with a zero error level, and one sees that the first local minimum is just after the right limit of the exact R G -attraction basin, hence consistent with the analysis of Section 3. The tolerable error level R u G,p0 on d for the Θ-attraction basin is attained at η = 0.1.
-The local minima appear much earlier with FWI than with MBTT and the sizes of intervals where there is no local minimum is well anticipated by the estimates ∆ u .
-In the numerical setting considered here, the MBTT approach appears to be robust with respect to data error levels much larger than the tolerable level, as we do not see local minimum appearing inside the attraction basin.   (57) for a nominal model with a reflectivity level β = 1%. We also indicate by vertical dotted lines the size of the exact RG-attraction basins, and by horizontal dotted lines the maximal tolerable errors R u G,m 0 associated to the exact Θ-attraction basins, taken from Table 4.

Strategy for MBTT waveform inversion
The numerical experiments made in this section lead to the following recommendations, for the implementation of the MBTT reformulation of FWI with a full wave equation, in order to obtain large attraction basins for the background reconstruction: Inria -the background space for p has to be chosen smooth enough and the reflectivity level β = r / p large enough so that the energy backscattered by p is negligible compared to that backscattered by r; -the reflectivity level β has to be chosen small enough for the importance of multiples to remain small compared to primary reflections in the synthetics.
This can be achieved by using an adapted parametrization for the background p, e.g. spline functions as employed in [22], and by tuning the reflectivity level β by adjusting the migration weight W in (52). We remind that the relevance of the chosen settings can be checked before performing any iterative optimization algorithm, by computing the size of the Θ and R G -attraction basins, and the associated tolerable error.
-The attraction basins tend to be larger in the direction of slowly increasing (with depth) backgrounds, which suggests it is better to underestimate the initial slope of the background velocity with respect to depth.
The dependency of the size of attraction basins with the reflectivity level suggests the following organization for MBTT waveform inversion:

Elastic medium reconstruction
In the case of isotropic elastic propagation, the medium is characterized by three physical models: the Lamé parameters λ and µ, and the density ρ. The propagation of waves in such media is expressed in terms of the vectorial displacement field u, see (1). More precisely, for elastic isotropic media, the equation of wave propagation writes as (where we omit the space dependency for clarity) In this context the unknown m is composed of the three parameters, m = {λ, µ, ρ} (which are space dependent). The forward problem is accordingly adapted and we write where we omit the source and frequency dependencies for clarity. Hence, we have considered that the forward problem is given by the measurements of displacement in all directions 2 .
We consider the models m 0 to be smooth backgrounds, similarly to the previous acoustic case. They are illustrated in Figure 28 and represent a subsurface of size 17 × 3.5 km.  The geophysical acquisition is composed of 19 sources and 168 receivers for each source, both being located at the surface. In this case, the derivatives have to be computed with respect to the three parameters. For the directions, we consider the elastic Marmousi2 medium and extract the different reflectors. The directions for the Lamé parameters and the density are presented in Figure 29 where the amplitude has been chosen such that u m = 1. Because of the difference of magnitude between the different quantities, we actually employ a scaling parameter to obtain dimensionless coefficients such that the parameters of interest are (λ/λ 0 , µ/µ 0 , ρ/ρ 0 ), where λ 0 , µ 0 and ρ 0 are scalar coefficients given by the maximal values of the respective quantities. In Figure 30, we show the evolution of the estimates ∆ u m0 and R u m0 with frequency. We see that the estimates for the size of the interval associated with the elastic wave equation, Figure 30(a) behave similarly to the acoustic case: the size of the attraction basin decreases with frequency. However, the maximal distance between the data and the attainable set, given Figure 30(b) appears now to be decreasing with frequency, contrary to the acoustic situation. It would be interesting to pursue the analysis in the elastic framework, in particular to identify how the different parameters are intertwined and how they should be taken.

Extension to different boundary conditions
The estimates we have derived can also be employed for other situations than the inverse wave problem affiliated with a geophysical setup. Let us briefly illustrate the consideration of alternative boundary conditions for the domain of interest. We acknowledge two situations, sketched in Figure Figure 28, the direction is given by the Marmousi structures of Figure 29. The computation of the directional derivatives uses scaled quantities: (λ/λ0, µ/µ0, ρ/ρ0).

Dirichlet boundary condition set to zeros for all boundaries, generating reflection from all sides.
This situation is particularly appropriate in the context of medical imaging and Electrical Impedance Tomography (EIT).
We compute the estimates according to the new situations and show the evolution of ∆ u m0 (obtained with local Θ-estimate) and R u m0 in Figures 32 and 33 respectively. The initial model is given by the smooth velocity background of Figure 3 and the directions are those introduced in Figure 4. Namely, we follow the setup of Subsection 4.1, and only the boundary conditions are changed.
The size of the attraction basin, Figure 32, decreases with increasing frequencies independently of the situation. However, in the case of Dirichlet boundaries on all sides, the behavior is much more distorted and reveals a much higher magnitude in the evolution. Namely it has four order of magnitude while only two for the other situation. Also in this case (Dirichlet boundary conditions), there does not seem to be any difference between the two directions, and the large obstacle direction, u 3 , behaves similarly as the Marmousi ones u m . Regarding the maximal distance to the attainable set     (largest tolerable eroor on data), Figure 33, we observe a decrease of the distance for low frequencies and then stabilization for the absorbing medium. This medium (fully absorbing) is strongly influenced by the geometry of the direction and the medium with Dirichlet boundary condition renders a totally unsettled evolution. We can intuit that the patterns obtained with the Dirichlet boundary conditions are due to the multiple reflections introduced at the boundaries, which strongly affect the problem in Inria terms of convergence and sensitivity to noise.

Conclusion
We have presented a theoretical and numerical toolbox for the apriori analysis of the optimizability of nonlinear least-squares minimization problems by local algorithms. Such problems arise in seismic Full Waveform Inversion (FWI), where the misfit function tends to exhibit local minima in the directions associated with low spatial frequencies perturbation of the background velocity. But the methodology is totally applicable for other least squares minimization problems.
We have defined attraction basins around a nominal model and the associated tolerable error level such that for any data below tolerable error, one is sure that the data misfit has a single local -and hence global -minimum over the basin. These quantities depend only on the forward map to be inverted, so one can obtain apriori information on the absence of local minima without having to experiment with the misfit function for different data. We have characterized attraction basins by the notion of global radius of curvature in the data space, and given formulas and algorithms for their numerical approximate and exact determination.
We have first applied this tool to the classical FWI problem for a wave equation in the timefrequency domain, and computed directional attraction basins which give quantitative insight into optimizability of FWI: 1. unsurprisingly, the size of attraction basins becomes smaller when frequency increases. In our tests, it is only of ±2% of the norm of the background at 4 Hz, and ±1% at 7 Hz, 2. the use of complex frequencies including damping enlarges the attraction basin, 3. the largest attraction basins obtained for low (or complex) frequency usually go with smaller tolerable error levels. Therefore, low and complex frequencies suffer more from noise.
4. solving the problem for a sequence of increasing single frequency data is better in terms of attraction basin than using an increasing sequence of frequency-bandwidth data, 5. salt domes produce smaller attraction basins, and are harder to recover.
These findings confirm the difficulty inherent to FWI, which requires unrealistically low frequency data for the determination of the full velocity model (Appendix C) In order to see if these limitations could be overcome, we have then applied our tool to the MBTT approach, which uses a background/data space reflectivity reparameterization of the velocity model. Our findings are: 6. in a direction of background perturbation, the largest attraction basin for MBTT shows an increase in size by a factor four to fifteen compared to FWI, and is only barely reduced when frequency increases.
7. Correlatively, the tolerable error for MBTT is divided approximately by 10 compared to FWI. Nonetheless, the comparison of misfit functions has shown that local minimum does not appear when the tolerable error is exceeded.
8. In the direction of data-space reflectivity perturbations, where the forward map is linear up to the multiple reflections, the attraction basin is very large, more than ±20 times the norm of the nominal data-space reflectivity.
9. Critical parameters for an efficient implementation of MBTT are the background smoothness and the reflectivity level.
This provides a strong incentive for the use of the MBTT decomposition to alleviate the low frequency requirement of FWI, despite a larger computational burden. For future work, we plan to implement the MBTT method with these settings to investigate further the efficiency in the context of iterative minimization with velocity model reconstruction.

A Adjoint state using complex variables
The Full Waveform Inversion (FWI) method aims the reconstruction of the subsurface medium parameters by an iterative minimization of the cost function defined as the difference between simulation and observations. We follow the standard least squares formulation of (13), and consider the Helmholtz equation (2) to write where the forward problem is written with the restriction operator R, and we use the index (s) for the sources. For the minimization, one needs to obtain the gradient of the cost function, which is usually obtained using adjoint state method, see [47] for a review of the method in geophysical application.
We specify now the computations for complex-valued fields, omitting the source and frequency sums for clarity, and write In the frequency domain, p is complex, which requires some precaution for the application of the adjoint state method. In particular, the functional is not analytic (holomorphic) with respect to the field p. A workaround is relatively standard, see for example [14,38,36], with elements of complex calculus based on Wirtinger calculus. We believe it is important to mention this aspect which is too often disregarded in seismic applications and hereby present the steps involved.

A.1 Complex derivation
The derivation of complex functional is conducted by taking independently the complex variable and its conjugate, respectively z and z, with z = x + iy.
Theorem 1. [14, Theorem 1] Let g : C × C → C be a function of a complex number z and its conjugate z and let g be analytic with respect to each variable (z and z) independently. Let h : R × R → C be the function of the real variables x and y such that g(z, z) = h(x, y) where z = x + iy. Then the partial derivative ∂ z g (treating z as a constant) gives the same result as (∂ x h − i∂ y h)/2. Similarly, ∂ z g is equivalent to (∂ x h + i∂ y h)/2. Corollary 1. Following the statement of Theorem 1, we have Proof. By direct application of Theorem 1, We straightforwardly apply the theorem to the misfit function where we consider p := z = x + iy.
where x, y and d can be assimilated with vectors in the discrete setting. Then by deriving independently with respect to x and y we obtain

Inria
We can further deduce the derivative of J with respect to p and p, where they are considered independent such that J = J(p, p), The following theorems give the framework of what can be seen as the chain rule for complex derivation.
Theorem 2. Consider the complex-valued function f of a real parameter m and the real-valued functions g 1 and g 2 such that f (m) = g 1 (z(m), z(m)) + ig 2 (z(m), z(m)). where We inject in (71) The alternative expression is obtained similarly but by replacing ∂ z g in (71), instead of ∂ z g.
Application of Theorem 3 gives the gradient of the cost function with respect to m, where T stands for the transposed.

A.2 Adjoint state method
In order to avoid the computation of the Jacobian, the gradient is computed with the first order adjoint state method. It has been introduced in the work of [39], and implemented by [18] for the computation of a functional gradient. The formulation for the elastic wave problem has been carried out by [62,63]. It is a relatively standard techniques nowadays, e.g. [34], see [47] for a review in geophysical situations. Yet, the complex variable specification is less common in seismic literature. In order to compute the derivative ∇J , we formulate the minimization problem (omitting the space dependency) where we introduce the wave operator A, which corresponds to the Helmholtz equation in the acoustic case, We consider one single source for now, to clarify the indexes, we shall later reintroduce the sum over the sources by linearity, cf. (85). The problem is recast into a formulation with Lagrangian to define where ·, · stands for the complex inner product in L 2 such that v, w = v * w, with v * the adjoint.
The adjoint state γ is now selected such that which gives, We now incorporate (67), and the adjoint state γ solves the problem Using this formulation for γ, the gradient reduces to We can reintroduce the sum over the sources, which gives Inria

B MBTT (Data Space) model parameterization
In this appendix, we explicit how the computations are conducted in the case of MBTT model decomposition.

B.1 Expression of the reflectivity with MBTT
The reflectivity part in the MBTT model representation is given by, cf. (15), The technique developed with the adjoint state method allows to compute of r without explicitly using DF 0 . Indeed, the adjoint state method provides (identification from (75) and (85) where DF (s) stands for the forward operator associated with source s. The fields p (s) and γ (s) solve respectively the forward and ajoint problems, see (2) and (85). Proceeding by analogy with the reasoning of Appendix A, it is straightforward to see that r can be express as where A 0 is the Helmholtz operator with zero reflectivity (i.e., m = p),

B.2 Directional derivative computation
For the estimation of the size of the basin of attraction and of the radius of curvature, we need the directional derivative of the forward operator. Using Neumann series and Taylor expansion, we have given in Subsection A.3 their expression using the traditional model formulation. When the model is decomposed using MBTT, one can process similarly, it simply requires a few more steps. For clarity we only focus on the parameter p, we start with the chain rule, We derive from (100), The terms (∂ p p 0 )(u p ) and (∂ p γ 0 )(u p ) are computed using the same method as presented in Subsection A.3 (thus, each requires the resolution of the wave equation with specific right-hand side). After we have computed u m = (∂ p m)(u p ), we again use Subsection A.3 to compute the final ∂ m F(u m ).
Eventually, one can process similarly for s, adapting the chain rule. Regarding the second order derivatives, it is analogous with one degree more of derivation in the chain rule. Computationally speaking, it only requires the resolution of additional forward problems.

C Influence of the geometry and frequency on FWI
We illustrate the influence of the subsurface geometry with an acoustic FWI experiment, where we demonstrate the effect we have depicted in Section 4, Figure 5: 1. a high contrast object complicates the process of reconstruction compared to a Marmousi-like subsurface, by reducing the size of the basin of attraction, 2. the low frequencies, by increasing the size of the basin of attraction, provide benefits on the reconstruction.
We design an experiment for a two-dimensional area of size 12.3 × 3 km, considering an acoustic medium with constant density set to 1000 kg m −2 . The wave propagation is governed by the Helmholtz equation (2), one parameter has to be recovered: the velocity c. The forward operator, see (10), provides pressure measurement at the receivers location. For the seismic acquisition, we take 135 sources and 306 receivers per source. We take synthetic data, avoiding noise, simply to illustrate our analysis.

C.1 Comparison between salt and Marmousi geometries
We compare the subsurface reconstruction using FWI depending on the geometry of the subsurface. We consider a medium involving a high contrasting object, and the Marmousi model. The iterative minimization is conducted following a nonlinear conjugate gradient algorithm (see [44]) and using frequencies from 1 to 10 Hz, with 1 Hz increment. We perform 20 iterations per frequencies, which are taken sequentially. The corresponding FWI algorithm is shown in Algorithm 1.
For the experiment related to the Marmousi medium, the 'true' model has been given in Figure 8(a) and the initial guess in Figure 3. Here we artificially modify the dimension so that the models are now of size 12.3 × 3 km. For the model encompassing a salt dome, the initial and target models are given in Figures 34(a) and 34(b) respectively.
In Figures 34(c) and 35, we compare the reconstructions from FWI regarding the contrasting object or the Marmousi model. Both situations use media of same size, with the same frequency progression (sequential from 1 to 10 Hz), and start from initial models where no information is known (one dimensional wave speed variation, with depth only). However, only the Marmousi case is able to accurately capture the subsurface structures. It demonstrates the results of Section 4 that, namely, high contrast objects, by reducing the size of the basin of attraction, are more complicated to reconstruct.
Initial parameters: initial model c 1 and the frequency list of ω i , for i = 1, . . . , n ω . The frequencies are ordered such that ω 1 < ω 2 < . . . < ω nω , see Remark 4. Frequency loop for i ∈ {1, . . . , n ω } do Optimization loop for j ∈ {1, . . . , n iter } do set k := (i − 1)n iter + j -solve the wave equation at frequency ω i with wave speed c k ; -compute the cost function by comparing the simulation and observation data; -compute the gradient of the cost function, using the adjoint-state method of Appendix A; -compute the search direction, s k with the nonlinear conjugate gradient method (see [44]); -compute the step length α with line search method (see [44]); -update the wave speed c k+1 = c k − αs k ; end end Algorithm 1: FWI iterative minimization algorithm for the wave speed reconstruction.

C.2 Low frequencies compensation
It has been shown in Figure 5 that low frequencies enlarge the size of the radius. To illustrate this effect, we reproduce the previous experiment but incorporating nine frequencies from 0.1 to 0.9 Hz (sequentially using 0.1 Hz increment). Then, we employ the set of frequencies from 1 to 10 Hz (sequentially using 1 Hz increment). In Figure 36, we show the reconstruction starting from these very low (unrealistic) frequencies.
The FWI algorithm benefits from this incorporation of low frequencies, and is now able to accurately retrieve the contrasting objects, as expected. Here we have used synthetic experiment to demonstrate the basin of attraction dependency on the geometry and frequency. Note that we could as well have used complex frequencies to increase the basin of attraction in early stages, cf. Subsection 4.5. However, for realistic application, low frequencies are usually unavailable (due to the noise) and it is then impossible to reduce arbitrarily the initial frequency. As an alternative, techniques based on shape reconstruction could be very efficient in this context but, because the nature of the target is unknown at the beginning, it is hard to justify apriori. Hz. The initial model used for the algorithm is given in Figure 34(a), and the true model Figure 34(b). The algorithm takes full benefits of the low frequency content in the data because we use synthetic measurements without noise.