Revisiting the Hodges-Lehmann estimator in a location mixture model: Is asymptotic normality good enough?

Abstract: Location mixture models, resulting in shifting a common distribution with some probability, have been widely used to account for existence of clusters in the data. Assuming only symmetry of this common distribution allows for great flexibility, especially when the traditional normality assumption is violated. This semi-parametric model has been studied in several papers, where the mixture parameters are first estimated before constructing an estimator for the non-parametric component. The plug-in method suggested by Hunter et al. (2007) has the merit to be easily implementable and fast to compute. However, no result is available on the limit distribution of the obtained estimator, hindering for instance construction of asymptotic confidence intervals. In this paper, we give sufficient conditions on the symmetric distribution for asymptotic normality to hold. In case the symmetric distribution admits a log-concave density, our assumptions are automatically satisfied. The obtained result has to be used with caution in case the mixture location are too close or the mixing probability is close to 0 or 1. Three examples are considered where we show that the estimator is not to be advocated when the mixture components are not well separated.


A brief overview
Consider the two-component location mixture model where π ∈ [0, 1], −∞ < μ 1 ≤ μ 2 < ∞ and G is a symmetric distribution around zero, that is, G(−x) = 1 − G(x − ) for x ∈ R. This model is semi-parametric since the unknown parameters are the 3-dimensional vector (π, μ 1 , μ 2 ) and the symmetric distribution G. It has been considered by several authors, e.g. Bordes et al. (2006), Hunter et al. (2007), Chee and Wang (2013), Butucea and Vandekerkhove (2014), and more recently Balabdaoui and Doss (2016). Whether 4564 F. Balabdaoui the goal is to estimate the mixed distribution or to classify new members in each of the existing clusters, (1.1) offers more flexibility than the Gaussian mixture model when G is not the distribution function of a normal variable. The simulation study carried out by Hunter et al. (2007) shows evidence that the estimators obtained under (1.1) outperforms the maximum likelihood estimator (MLE) under the Gaussian mixture model for the heavy-tailed distributions they considered. When testing for presence of mixing, the numerical results obtained in Balabdaoui and Doss (2016) for the asymptotic power also show the higher performance of the symmetric log-concave MLE when compared to the Gaussian one. In Chang and Walther (2007), the authors considered a more general mixture model where the mixture components are assumed to have logconcave densities. See also the clustering example of Cule et al. (2010) in a two-dimensional setting using the Wisconsin Breast cancer study. The aforementioned log-concave mixture models are of course more general than the location mixture model in (1.1) but suffer from being non-identifiable. See e.g. the counterexamples given in the Concluding Discussion section in Cule et al. (2010).
To estimate the distribution G, the focus in all the aforementioned papers on the mixture model in (1.1) is on first estimating the parameters of the mixture μ 1 , μ 2 and π. In some way, these mixture parameters are viewed as nuisance parameters. In Butucea and Vandekerkhove (2014), an estimator of these parameters was constructed by converting symmetry of the density of G (assumed to exist) to the fact that the imaginary part of its Fourier transform has to be equal to 0. The authors showed that their estimator is consistent and established that it is asymptotically normal, under the assumption that the mixture model is identifiable. In fact, they used the identifiability conditions found by Hunter et al. (2007), which we now discuss. Hunter et al. (2007) considered the more flexible notion of 2-identifiability defined as follows. Consider the sets Θ = {θ = (π, μ 1 , μ 2 ) : π ∈ [0, 1], −∞ < μ 1 < μ 2 < ∞}, (1.2) S = G : G is a symmetric c.d.f. , and M = F : F = π G(x − μ 1 ) + (1 − π) G(x − μ 2 ), (π, μ 1 , μ 2 , G) ∈ Θ × S .

The plug-in estimator
Replacing θ 0 by an element θ to obtain D 2 (θ) and taking the empirical counterpart of the resulting expression, one can now construct the plug-in estimator of the mixture parameters through minimizing over Θ * , where F n is the empirical distribution. Let θ n denote this estimator. To compute the symmetric log-concave MLE of the density of G 0 , Balabdaoui and Doss (2016) replace first the vector of the unknown mixture parameters by θ n and then maximize the log-likelihood over the class of symmetric and log-concave densities on R. Choosing this estimator was motivated by the fact that it is very easy to implement and also fast to compute. Balabdaoui and Doss (2016) used the R-code of Hunter et al. (2007) made available on the web page of the first author. Moreover, Hunter et al. (2007) show, under some conditions, that the estimator convergences at the √ n-rate. Assuming that these conditions hold, the fast rate of convergence of θ n guarantees convergence of the nonparametric log-concave MLE of Balabdaoui and Doss (2016) to the true symmetric density at the (usual) n 2/5 -rate in the L 1 distance. See Theorem 4.1 in Balabdaoui and Doss (2016).

Problem, motivation and limitations
in their Theorem 3 Hunter et al. (2007) show almost sure convergence of θ n to the truth but no explicit proof on asymptotic normality was provided. Instead, Hunter et al. (2007) give in their Theorem 4 conditions under which θ n converges at the √ n-rate to a 3-dimensional centered Gaussian distribution with a dispersion matrix given by J −1 ΣJ −1 . The matrix J is the Hessian matrix of θ → D 2 (θ), assuming that it exists and is positive definite. The matrix Σ is the covariance matrix of a 3-dimensional vector playing the role of a gradient of E[f θ (x, Y )] at θ 0 where f θ is given below in (4.9); see also condition (iii) in Theorem 4 of Hunter et al. (2007). The three conditions given by Hunter et al. (2007) for Theorem 4 are connected to the theory of V -processes, since D 2 n (θ) can be identified as such.
In a personal communication with David Hunter, the validity of the conditions of Theorem 4 has never been checked. Thus, the rate of convergence of θ n and its limit are, up to now, open questions. Without this knowledge, it is not possible to construct asymptotic tests or confidence intervals for any of the mixture parameters. In this paper, we show that θ n is indeed asymptotically normal under some sufficient conditions. The approach we have taken is based on the theory of empirical processes, which is well-suited for M -estimators. Although certain results from this theory can be used off the shelf, a substantial effort has been made to re-adapt them in the current setting. The fact is that the estimator θ n minimizes a functional that is quadratic in F n whereas most of the results in van der Vaart and Wellner (1996) are tailored for empirical processes that are linear in F n . Thus, we believe that some of the techniques developed in this article are of its interest in their own right, and may be applicable in some other M -estimation problems where the dependence on F n is non-linear. Investigation of the weak convergence of θ n allowed to have a clearer insight into when this estimator can or cannot be used. The expression of the variancecovariance matrix obtained in Section 2 can be used to compute Monte-Carlo approximations of the asymptotic variances. The examples of Section 3 clearly indicate that the estimator is not to be advocated in case the mixture locations are two close to each other, or if the mixing probability is close to 0 or 1.

Organization of the paper
While thinking about the asymptotic behavior of θ n , we realized that existence of θ n is yet to be confirmed. Although the estimator was defined in a very intuitive way, a formal proof of existence has not been provided in Hunter et al. (2007) nor in a separate note (personal communication with David Hunter). A formal proof that θ n exists is given in Section 4.1. The structure of the remaining results is as follows: Section 2 is devoted to deriving the asymptotic distribution under some sufficient conditions on the density g 0 of the unknown symmetric distribution G 0 . In Section 3 we give some examples to illustrate the theory. As mentioned above, the examples show that the estimator can be very inefficient when the mixture components are not well separated. In Section 4.2 we gather the proofs of the results yielding the √ n-rate of convergence of θ n . The proof of the asymptotic distribution of the estimator along with some useful formulae are deferred to Supplement A (Balabdaoui, 2017).

Asymptotics of the (Hogdes-Lehmann) estimator
This section is devoted to the main subject of this paper: establishing the asymptotic distribution of the estimator θ n . The approach we chose to follow might look like a detour from V-processes, which may have been a more natural way to go as the functional to be minimized is a quadratic function of F n . Our attempts to check the conditions of Theorem 4 of Hunter et al. (2007) were not very successful. In the meantime, we realized that the theory of M -estimators can be applied in the current context. However, some effort was needed to be made to cast the problem into the theory of empirical processes. The proof is divided into four steps so that it is possible to apply the argmax continuous mapping theorem (or rather the argmin continuous mapping theorem here) for processes that converge weakly to a tight limit in C([−K, K] 3 ), the space of continuous functions defined on the 3-dimensional compact [−K, K] 3 for some K > 0. Our main reference is Theorem 3.2.2 of van der Vaart and Wellner (1996). In the sequel, the symmetric distribution G 0 is assumed to have a density g 0 with respect to Lebesgue measure. In the next section, we will show that θ n converges to θ 0 at the parametric rate √ n.

Deriving the √ n-rate of convergence
The first step towards establishing the rate of convergence is showing consistency. The latter follows from Theorem 3 of Hunter et al. (2007).
To be able to refine this result we will appeal for the theory of empirical processes. As we will show below, the centered and rescaled processs can be decomposed into processes whose maximal expectation over a neighborhood of θ 0 can be controlled using the theory of VC-classes. To pave the way to showing the main result of this section, we shall adopt the notation for a given t ∈ R. Also, we shall consider the collection of functions m θt defined as for θ = (π, μ 1 , μ 2 ) ∈ Θ. The symbols P n and P will be used for the empirical and true measure respectively. The starting point here is to rewrite D 2 n (θ) as where the last term on the left side is equal to D n (θ 0 ). Developing D(θ) in a similar way and using the fact that for all t ∈ R The limit distribution of the Hodges-Lehmann estimator Recall that G 0 is assumed to have a density g 0 with respect to Lebesgue measure. To derive the √ n-rate of convergence of θ n , we will show the following intermediate theorem.
The proof of the above bound will involve two main steps. In the first one, we shall control the maximal expectation of the process for a fixed δ > 0. We will achieve this by showing that M t,δ can be embedded in the sum of VC-subgraphs. This will enable us to show that we have control on its entropy as if it were itself a VC-subgraph. The bound we will obtain for the supremum of the expectation of the process G n (m θt − m θ 0 t ) will depend on t ∈ R, the variable of integration. In the second step, and under some suitable assumptions on the true symmetric density g 0 , we will show that this control is strong enough to allow for integrating the resulting supremum and hence obtain the desired bound given in Theorem 2.2. For a fixed t ∈ R let F i,t , i = 1, . . . , 6, be the functions defined as (2.7) The following theorem gives the necessary ingredients for completing the first step of the proof as described above.
Theorem 2.3. The following holds true:

Let F be a class of functions such that
integer m > 0 and F i are VC-classes with envelopes F i for i = 1, . . . , m respectively. Then, F is P -Donsker and there exists a constant M > 0 such that In the following proposition we give sufficient conditions for the first requirement for deriving the rate of convergence of an M -estimator using Theorem 3.2.5 of van der Vaart and Wellner (1996).
Proposition 2.4. Assume that g 0 satisfies that Then, there exists a small neighborhood of θ 0 = (π 0 , μ 0 1 , μ 0 2 ) and a constant κ > 0 such that for all θ = (π, μ 1 , μ 2 ) in this neighborhood we have that Theorem 2.2 and Proposition 2.4 yield now the rate of convergence of the estimator of Hunter et al. (2007).

Deriving the asymptotic distribution
In the following, we will write and consider the process • g 0 changes direction of monotonicity only a finite number of times, • g 0 admits a derivative (g 0 ) almost everywhere such that (g 0 ) is bounded on R and changes direction of monotonicity only a finite number of times, Then, under the above assumptions with A a 3 × 3 matrix whose entries are given by The limit distribution of the Hodges-Lehmann estimator V a vector R whose components are distributed as We can now state the main theorem of the paper.
Theorem 2.7. Suppose that the conditions of Theorem 2.2 and Theorem 2.6 hold true. Also, suppose that the symmetric density g 0 satisfies R |t| 5 g 0 (t)dt < ∞. Then, the matrix A defined above is definite positive and the process Q 0 admits a unique minimizer h 0 given by the solution of the linear equation Remark 2.8. The assumptions made above are satisfied by a variety of distributions. In particular, this is the case if we assume that the unknown symmetric density g 0 is log-concave on its support as done in ?. Then, the density g 0 is bounded and changes direction of monotonicity only once since log-concavity implies unimodality. By Theorem 2.3 of Finner and Roters (1993), we know that G 0 and 1−G 0 are both log-concave ion R. Hence, Lemma 2 of Schoenberg (1951) can be applied to g 0 , G 0 and 1 − G 0 and we can find constants for any γ > 0, and in particular for γ ∈ (0, 1/2).
using the section on useful formulae in Supplement A. This implies that hAh T = 0 for h = (−1, 1, 0). Remark 2.10. One may wonder whether the Hodges-Lehmann estimator is efficient. The question is obviously related to finding the efficient bound in this semi-parametric model. The tangent space with respect to the symmetric component g 0 is given by such that k is even and the corresponding orthogonal space. To obtain the latter, one can follow the lines of the proof given by Ma and Yao (2015)

Some comments on the asymptotic variance
It follows from the previous section that under the assumptions made on g 0 we have that and Γ is the 3 × 3 dispersion matrix of V. The components of Γ can be related to the covariance of U; i.e., Cov(U(x), U(y)) = x ∨ y − xy for x, y ∈ [0, 1]. However, the expressions obtained cannot be used to get explicit values. To give an example, we can write Γ 1,1 , the variance of V 1 , as From these expressions, we see that the terms F 0 (μ 0 are the main obstacle when trying to get a general formula for the integral defining Γ 1,1 . In Theorem Hunter et al. (2007), four sufficient conditions were given to ensure that the estimator θ n has an asymptotic Gaussian distribution. The asymptotic variance given in the theorem is different from the one that we have found. When comparing the formula in (2.8) with the one in Hunter et al. (2007), we see that the matrix A should play the role of J appearing in Condition (i) of Hunter et al. (2007). The latter should be equal to the second derivative of the function θ → D 2 (θ) at the true value θ 0 , under the additional assumption that is positive definite. From our calculations, we can see that the matrix A is not linked directly to the second derivative of θ → D 2 (θ) but rather to the application h → nD 2 (θ 0 + hn −1/2 ). Condition (iii) of Hunter et al. (2007) assumes existence of a function Δ such that with r θ satisfying some uniform integrability condition in the neighborhood of θ 0 . Here, f θ is the same function defined in (4.9). Using the definition of π 1 we have π 1 f θ (x) = E[f θ (x, X)] − D 2 (θ). We know that D 2 (θ) takes the value 0 at θ 0 and has its gradient equal to 0 at the same value, and hence one should focus on E[f θ (x, X)] to get Δ(X). Despite our efforts, it seems hard to connect the vector V given in Theorem 2.6 and the vector Δ(X) we obtain.
In the next section, we investigate numerically how the variances of the estimators π n , μ 1,n , μ 2,n behave as a function of the true span μ 0 2 − μ 0 1 and the mixing probability π 0 .

Limitations of the asymptotic normality: Some examples
We consider the following three symmetric distributions for which we compute the asymptotic variance of the estimators π n , μ 1,n , μ 2,n for different ranges of μ 0 1 , μ 0 2 and π 0 . To be more precise, we compute Monte-Carlo estimates for these variances based the formula in (2.8) and Brownian bridge approximation. For the latter we sampled N = 10 4 of independent uniform random variables and computed U N (t) = √ N ( G N (t) − t), t ∈ [0, 1] where G N is the empirical distribution of the obtained uniform sample. We estimated the variance matrix Γ of the vector V using B replications of U N and approximating the integrals defining V j , j = 1, 2, 3 by a finite sum over an equally spaced grid with chosen lower and upper endpoints and a given mesh. The matrix A can be explicitly computed for any θ 0 for U[−1, 1] and L(1). For the distribution N (0, 1) closed formulas can be found for the entries A 11 , A 12 and A 22 , whereas the remaining ones can be approximated using the same discretization described above. The explicit formulas can be found in Supplement A.
In Table 1 we give the values of n× the empirical variances of the Hodges-Lehmann estimators of the mixture parameters π 0 , μ 0 1 and μ 0 2 for the three symmetric (and log-concave) distributions used to illustrate the theory, and those of the corresponding theoretical asymptotic variances; i.e., the diagonal entries of the covariance matrix Σ given in (2.8). The main goal here is to obtain a numerical confirmation of our asymptotic result. The mixture parameters are chosen in a way that the mixture components are well-separated. The empirical variances are based on samples of size n = 10000 and 500 Monte Carlo repli-cations, whereas an approximation of the asymptotic variances is obtained by taking N = 10 5 independent uniform random variables, a grid step to 0.01, and −60 and 60 as the upper and lower integration bounds and B = 800 replications of the vector V. The obtained values are clearly close and we expect them to be much closer for larger sample sizes and finer approximation of the asymptotic covariance matrix.

Law
Variance (π 0 , μ 0 2 ) = (0.3, 3) (π 0 , μ 0 2 ) = (0.2, 4) (π 0 , μ 0 2 ) = (0. Values of n× the empirical variance of the Hodges-Lehmann estimators of π 0 , μ 0 1 and μ 0 2 (Empirical) and those of the corresponding asymptotic variance (Asymptotic). To obtain the empirical variances the sample size was taken to be n = 10000 with 500 Monte Carlo replications. In computing the asymptotic variances, the Brownian bridge was approximated using N = 10 5 independent uniform random variables and the integral defining V is approximated using a grid step to 0. Let us now turn to another aspect of our asymptotic result. The plots in Figure 1 show the variance of the estimators of μ 0 1 , μ 0 2 on the left and π 0 on the right as a function of π 0 ∈ [0.01, 0.49] ∪ [0.51, 0.99] for the fixed value (μ 0 1 , μ 0 2 ) = (0, 1) and in the case where the true symmetric density is Gaussian. The variances obtained are extremely large for small and large values of π 0 , especially for the estimates of the mixture locations, reaching a maximal value of 1.85 × 10 6 for the estimate of μ 0 1 when the true mixing probability is π 0 = 0.01! The variances are also very large for the estimate of π 0 with a maximal value of 1600. A similar phenomenon can be seen in Figure 4 and 7 when the true density is U[−1, 1] and L(1) respectively for the same range of π 0 and the fixed value (μ 0 1 , μ 0 2 ) = (0, 0.5). The variances seem to be take somewhat smaller values, although still large, when the true span is increased; see Figure 2, 5 and 8. Although the results are disappointing, they are somehow expected as the estimation procedure fails to be efficient when the mixture components are not well separated. This is confirmed by the reasonable values obtained in Figure 3 and Figure 6. The variances are plotted as a function of the span μ 0 2 − μ 0 1 ∈ [2, 10] for N (0, 1) and μ 0 2 −μ 0 1 ∈ [2.5, 10] for U[−1, 1], and when the true mixing probability is π 0 = 0.3.
Although our numerical results are only shown for the Gaussian, uniform and double exponential distributions, they give a clear warning that the Hodges-Lehmann is not be used when it is believed that the mixture proportion is very small or large or that the components are not well separated. The difficulty lies of course in the fact that such an information is contained in the values of the parameters that we would like to estimate and hence cannot be known in advance.

Some algebra yields
which can be shown to be in (0, 1) for |μ 1 |, μ 2 large enough. More algebra yields To see this, we use the facts that μ 1 = −|μ 1 | and so that the claimed limit follows after dividing both the numerator and denominator by 4|μ 1 | μ 2 − |μ 1 | − T n . We conclude that the minimization problem can be performed on the compact Combining this with continuity of D n (since f θ is continuous) gives existence of θ n . Almost sure convergence of θ n to the true parameter (see Theorem 2.1 below) will ensure that θ n ∈ Θ * with probability one for n large enough.

Proofs of the √ n-rate of convergence of θ n
We start with showing Theorem 2.3.
Proof of Theorem 2.3. The proof of (1) goes along the lines of the hint given in van der Vaart and Wellner (1996) for Problem 20, page 153. Take three points (x 1 , t 1 ), (x 2 , t 2 ), and (x 3 , y 3 ) ∈ R 2 such that x 1 ≤ x 2 ≤ x 3 . Then, the class of all subgraphs c1 (a,b] , a, b, c ∈ R pick out the point (x 2 , t 2 ) unless t 2 > max(t 1 , t 3 ). Now, take four points (x 1 , t 1 ), (x 2 , t 2 ), (x 3 , t 3 ) and (x 4 , t 4 ) ∈ R 2 such that We want to show that these points are not shattered by the considered class of subgraphs. Suppose they are shattered. Then, this implies that every subset of three points with nondecreasing x i is also shattered. Hence, we should have t 2 > max(t 1 , t 3 ), t 3 > max(t 2 , t 4 ) (and also t 2 > max(t 1 , t 4 ) and t 3 > max(t 1 , t 4 )), which is impossible. We conclude that the considered class is a VC-subgraph of index 4. To show (2), let us fix > 0. Note that F = F 1 + . . . + F m is an envelope for each of the classes F j , j = 1, . . . , m. Consider the covering numbers N j = N ( F Q,2 , F j , L 2 (Q)) for some probability measure Q. An element f ∈ F can be written as Hence, N ( m F Q,2 , F, L 2 (Q)) ≤ m j=1 N j . Using Theorem 2.6.7 of van der Vaart and Wellner (1996), it follows that there exist universal constant K j > 0, j = 1, . . . , m such that for any probability measure Q such that F j Q,2 > 0 for j = 1, . . . , m where V j ≥ 1 is an upper bound for the VC-index of the class F j . Now, note is also an envelope for F. Using the notation of van der Vaart and Wellner (1996), set where the supremum is taken over all probability measures Q such that . Then, by the calculations above we have that for any η > 0, in particular for η = 1. By Theorem 2.14.1 of van der Vaart and Wellner (1996), we conclude that there exists a universal constantM > 0 such that and the result follows by taking M = max(M 2 J(1, F) 2 , 1). Now, we show (3). Let m θt − m θ 0 be an element in M t,δ . We can write Then, it is clear that . . . , 6, where I is the class defined in the first statement of the theorem. On the other hand, we have that g j,t ∞ ≤ F j,t , j = 1, . . . , 6 defined in (2.7). The claim follows from 2.
Proof of Theorem 2.2. Fix δ ∈ (0, 1]. Before returning to the inequality in (2.6), we will find upper bounds for F t,j for j = 1, . . . , 6. We have that using symmetry of g 0 and max(π 0 , 1 − π 0 ) ≤ 1. Using the inequality By assumption, g 0 changes direction of monotonicity only a finite number of times. This implies that for δ > 0 small enough To see this, use the fact that the length of the interval [t − δ, t + δ] is 2δ and hence should be included in a bigger interval on which g 0 is either increasing or decreasing. In the first case, g 0 (x) ≤ g 0 (t + δ) wheres in the second g 0 (x) ≤ g 0 (t − δ). Hence, using again the inequality

4587
Similarly, we get The calculations above imply that where D > 0 is a constant depending only on g 0 . In a very similar manner it can be shown that the same bounds apply for R F t,i j P,2 dt for i = 2, 3, 4 and j = 1, 2. Now, we turn to F t,5 and F t,6 . We have that The assumption that g 0 changes direction of monotonicity only a finite number of time implies that we can find −∞ < A < B < ∞ such that g 0 is increasing on (−∞, A) and decreasing on (B, ∞). Furthermore, we know also that there exists M > 0 such that g 0 ≤ M on [A, B]. Hence, whereD 1 > 0 depends only on g 0 and μ 0 2 − μ 0 1 . As the second term in (4.12) can be handled similarly it follows that there exists a constantD > 0 depending on g 0 and μ 0 2 − μ 0 1 only such that R F t,5 j P,2 dt ≤D j δ j for j = 1, 2. A similar bound applies for R F t,6 j P,2 dt. Now, we turn to the inequality in (2.6). It follows from the calculations above, that for any t ∈ R the class M t,δ is VC of index at most 4 and with a finite envelope. It follows from Theorem 2.3, the c r −inequality and the bounds obtained above that there exists a constant C 1 > 0 depending only on g 0 such that for all δ ∈ (0, 1]. This in turn implies a bound on the maximal expectation of the first term in the right side of (2.6): (4.13) To tackle the second term, note that boundedness of g 0 implies that there exists a constant C > 0 depending only on g 0 (0) and μ 0 2 − μ 0 1 such that P|m θt − m θ 0 t | ≤ C δ for all δ ∈ (0, 1] and t ∈ R. Indeed, we have that P|m θt − m θ 0 t | ≤ F t,1 P,1 + . . . + F t,6 P,1 where similar calculations to the those developed above show that and a similar bound applies for F t,2 P,1 , F t,3 P,1 and F t,4 P,1 . Also, and a similar bound applies for F t,6 P,1 . It follows that there exists C > 0 depending only on g 0 and μ 0 2 − μ 0 1 such that To bound the maximal expectation of the third term in (2.6), we use the Cauchy-Schwarz inequality: and that there exists a constantC > 0 depending only on g 0 and μ 0 2 − μ 0 1 such that R F t,j P,1 dt ≤C δ.
If we focus on the function t → g(μ 1 − μ 0 1 − t), we see that −t − δ ≤ μ 1 − μ 0 1 − t ≤ δ − t. Hence using the assumption that g changes monotonicity only a finite number of times, we have that for δ small enough, which is integrable. We can in a similar manner bound the remaining functions.
We conclude that f (μ 1 − t) + f (μ 1 + t) is bounded above by a nonnegative and integrable function. It follows that D 2 admits a continuous first partial derivative with respect to μ 1 in (μ 0 1 − δ, μ 0 1 + δ). Furthermore, Above we used the fact that f is bounded. The function on the right side is integrable since any real a we have that ∞ 0 (1 − F (a + |t|))dt < ∞ and R F (a − |t|)dt < ∞, a consequence of integrability of X ∼ F (implied by integrability of Y ∼ G). Also, we have that which we have already shown to be bounded above by an integrable function. We conclude that D 2 admits a continuous second partial derivative with respect to μ 1 in (μ 0 1 − δ, μ 0 1 + δ) with Similar arguments can be used to show that D 2 admits a continuous second partial derivative with respect to μ 2 in (μ 0 2 − δ, μ 0 2 + δ) with We compute the crossed partial derivative of ψ with respect to π and μ 1 :