A New Estimator: Median of the Distribution of the Mean in Robustness

: In some statistical methods, the statistical information is provided in terms of the values used by classical estimators, such as the sample mean and sample variance. These estimations are used in a second stage, usually in a classical manner, to be combined into a single value, as a weighted mean. Moreover, in many applied studies, the results are given in these terms, i.e., as summary data. In all of these cases, the individual observations are unknown; therefore, computing the usual robustness estimators with them to replace classical non-robust estimations by robust ones is not possible. In this paper, the use of the median of the distribution F x of the sample mean is proposed, assuming a location-scale contaminated normal model, where the parameters of F x are estimated with the classical estimations provided in the ﬁrst stage. The estimator so deﬁned is called median of the distribution of the mean, MdM . This new estimator is applied in Mendelian randomization, deﬁning the new robust inverse weighted estimator, RIVW.


Introduction
In the application of some statistical methods, such as clinical trials, the results are, usually, described in terms of the values taken by classical estimators, such as the sample mean and sample variance. These results are combined, in a second stage, as a weighted mean in a meta analysis. The same occurs in its alternative, Mendelian Randomization, one of the main topics in causal inference.
Moreover, in many applied studies, their results have been described in these terms, i.e., as summary data, not knowing the individual observations, to compute robust estimators with them, replacing the classical non-robust estimations with robust ones.
In this paper, a solution to this problem is proposed, correcting, if necessary, the given classical estimations because, although the individual observations are unknown, the mechanism that generates the data is known because it is the model.
Focusing on the mean estimation problem, the optimal estimator (uniformly minimum variance unbiased estimator) is the sample mean, when no outliers exist in the sample, and the normal distribution N(µ, σ 2 ) is assumed as the model, with µ and σ 2 being the usual parameters of the normal distribution, population mean, and variance. Assume that a proportion of outliers exists in the sample, i.e., a contaminated normal model (see [1], p. 2) (1 − )N(µ, σ 2 ) + N(g 1 µ, g 2 2 σ 2 ) where most of the data are from a N(µ, σ 2 ), and a small part of them, , are from a normal model with more dispersion and a different location, N(g 1 µ, g 2 2 σ 2 ), where g 1 is a contamination parameter that affects the location, and g 2 is a contamination parameter that affects the scale. The optimality of the sample mean is lost because the optimal procedure and its properties heavily depend on the assumed probability model ( [2], p. 2). This is the reason why classical statistics rests, basically, on the normal model and on the sample mean.
Additionally, under a contaminated normal model, the robustness of the sample mean is lost [1,3]. Under this model, the sample mean is not the maximum likelihood estimator [4], and even the normality of the sample mean is not guaranteed [5].
In this paper, a new estimator for a location-scale contaminated normal model is proposed, avoiding the extreme sensitivity of the sample mean but coinciding with it when no outliers are present in the data. The median of the distribution F x of the sample mean is proposed as a new estimator, where the parameters of F x are estimated with the classical estimations described in previous studies. This estimator is called the median of the distribution of the mean, MdM.
The two reasons why this new estimator relies on the distribution of the sample mean are that, first, the classical estimations are given in terms of the classical mean (and classical variance) and, second, this new procedure extends the classical one in the sense that if no outliers are present, this new estimator is the classical sample mean, i.e., with this method, the classical estimation is extended to the case in which outliers are present.
Another estimator somewhat related to MdM is the median of the means estimator MoM. However, this estimator is, finally, one of the sample means and, hence, is not robust (see [6]).
With the MdM, robustness and optimality are obtained if there are no outliers. Hence, with this approach, a new vision of the dilemma between optimality and robustness is provided.
Because the exact sample distribution of x under a mixture distribution is not known, here it is estimated in a closed form with the von Mises (VOM) plus saddlepoint (SAD) method, a technique used by the author in several studies (see, for instance, [7,8]) but in another context. With this approximation, the estimator introduced in this paper can also be extended to other more general models than the normal mixture considered here.
The rest of the paper is structured as follows. In Section 2, the VOM+SAD approximation for the distribution of the sample mean is obtained under a location-scale contaminated normal model. The definition and some properties of this new location estimator are considered in Section 3, and a scale estimator, based on these ideas, is defined in Section 4, and an example of the application of this new estimator is considered in Section 5. These ideas are applied to Mendelian randomization in Section 6. Some conclusions are outlined in Section 7.

VOM+SAD Approximation of Sample Mean Distribution
Because the new estimator depends on the distribution of the sample mean, the distribution of the sample mean must be very precise, especially when the considered sample sizes are very small. For this situation, using a von Mises expansion ( [9], p. 215, or [10], p. 578) that depends on Hampel's influence function [11] is highly recommended.
Although, in the end, the obtained results are be applied to the mixture of normals model considered previously, these refer to more general models, F, G, and H, which indicate future extensions of this method.
The final approximation is called VOM+SAD and was previously obtained by the author in the context of spatial data (see [7,8]). Following the ideas developed in those two papers, considering the tail probability functional, initially, the approximation obtained is which allows the approximation of the distribution of T n when the observations follow model F by the distribution of T n when the variables of T n follow model G (pivotal distribution).
This approximation depends on the tail area influence function, TAIF, defined in [12].

VOM-SAD Approximation for the Distribution of the Sample Mean
In the particular case of the sample mean, the score function is ψ(x, t) = x − t. Remember that in the VOM+SAD approximation, the saddlepoint is computed under G = N(µ, σ 2 ). Under this pivotal distribution, it is Hence, from the saddlepoint equation , and the quotient in last term in the right side of (1) is Hence, the VOM+SAD approximation (1) is If distributions F and G are not close enough, intermediate distributions can be considered, as in [16][17][18], to obtain a more accurate approximation.

Estimator Median of the Distribution of the Mean
If the previous distribution of the mean is the median of this distribution, i.e., F −1 x (1/2), is called the median of the distribution of the mean, MdM, i.e., this estimator is the solution of The parameters of F x are estimated with the classical estimations, the sample mean x and the sample variance s 2 . Figures 1-3 show that as contamination parameters , g 1 , or g 2 increase, the difference between MdM and x, i.e., z 0 , increases.
The main reason for the definition of MdM is that the median is more robust than the sample mean and, hence, the influence of possible outliers, not knowing the individual observations, as assumed here, should be lower with the median of the distribution of the mean than with the sample mean, used in this distribution as an estimator of the location parameter. Furthermore, in the case without outliers, this estimator is equal to the classical sample mean. As a limitation, observe that MdM is also sensitive if outliers already affect the sample mean or sample variance used in the estimation of the location or scale parameter µ or σ 2 . Nevertheless, with MdM, this sensitivity is lower.
Finally, in future research, other robust estimators could be considered, such as the trimmed mean of the distribution of the sample mean.

Dispersion Estimator
With the ideas developed in this paper, a dispersion estimator should be

Example
In most application papers, only the final values of the estimators used on them are given. Additionally, these estimators are usually the classical sample mean and sample variance and do not include the individual observations from which these estimators are obtained and, therefore, not providing the opportunity to robustify these values using robust techniques.
For this reason, a large number of examples could serve as an illustration of the estimator defined in this paper. Next, let us consider just one.

Example 1.
One of these studies is [19], where some vertebral column and thorax of Neanderthals fossils were re-evaluated using their vertebrae because, probably, as stated by the author, errors occurred in the reconstruction and the samples were wrongly classified. He mentions ( [19], p. 23) a misclassification of 7/33, which can be considered as the value of the contamination parameter .
Because modern humans and Neanderthals have very similar vertebrae, no difference in the mean is assumed, using, hence a distortion factor g 1 = 1. On the other hand, Neanderthals are slightly more stockier than modern humans, with the dispersion of the latter being larger, assuming that g 2 = 1.5.
In Table 2 in [19], classical acceptance confidence intervals are provided for several vertebrae of 28 modern humans. They are based on the classical mean and variance, as the author says in this table. From the table, with respect vertebra T1, the remains of Kebara 2 and La Ferrassie can be considered as modern humans instead of Neanderthals, because they are inside of the confidence interval. The same happens with vertebra T7 but not with vertebra T5.
From these classical intervals, for vertebra T1, the classical sample mean and standard deviation are x = 16.6 and S = 3.61, respectively. In this case, the estimator median of the distribution of the mean takes the value MdM = 15.034, obtaining the new robust acceptance confidence interval equal to [13.63 , 16.43], which does not contain the remains, concluding then, that these remains are Neanderthals and not modern humans, as they were wrongly considered with the classical estimators.
With respect to vertebra T5, MdM = 17.54, and the new robust acceptance confidence interval is [15.84 , 19.24], with neither the classical nor the robust interval not including the remains, confirming that they are Neanderthals.
Finally, for vertebra T7, MdM = 19.43, and the new robust acceptance confidence interval [17.73 , 21.13], both this robust and the previous classical confidence interval including the remains of the La Ferrassie, hence being modern humans and not Neanderthals.

Robust Inverse-Weighted Estimator RIVW in Mendelian Randomization
Another field for the class of problems considered in this paper is randomized clinical trials (CTs). In each of these CTs, the sample mean and sample variance are the usual final result. These are usually combined, in a classical way, as a weighted mean in a metaanalysis. In CTs, the relationship of a variable X (called cause) with another variable Y (called effect) is analyzed, but reverse causality may exist or a lack complete randomization or, more importantly, confounders may be present.
Moreover, CTs are expensive and take a long time. With Mendelian randomization (MR), a method that has received a renewed interest in recent years, CTs are imitated because, in any person, all genetic material is randomized allocated from their parents, including DNA markers. Randomly, some people receive more DNA markers related with variable X and, for others, fewer. MR uses genetic variants (usually single-nucleotide polymorphisms (SNPs)) as instrumental variables Z.
Mathematically, MR is used to avoid possible biases in the regression of Y on X due to these three causes just mentioned. Formally, MR leads us to a two-step linear regression process; first, for every genetic variant Z j , j = 1, . . . , L, a linear regression of X on Z j is performed, where, for individuals, i = 1, . . . , n j is from which the fitted valuesX i are obtained and used in a second regression of Y on thesê X, obtaining finally [20] where β X j and β Y j represent the association of Z j with the exposure and the outcome (only through X), respectively. The parameter β · β X j represents the effect of Z j on Y through X, where β is the causal effect of X on Y that is being estimated. Moreover, α j represents the association between Z j and Y not through the exposure of interest. Finally, the errors terms e X ij and e Y ij are assumed to be independent because independent samples are assumed to be used to fit the two previous regression models.
In MR, the standard estimator of the parameter of interest β, the slope in the linear regression of Y on X, is the classical two-stage least squares estimator which is the quotient of the slope of the regression of Y on Z j ,β Y j , and the slope estimator of the regression of X on Z j ,β X j . These classical estimations, one for each value of the instrumental variable Z j , are combined with the classical inverse-variance weighted (IVW) estimator where ω j = 1/var(β R j ), which is used to weight theβ R j estimators, assuming that the L genetic variants are mutually independent. In this way, a single causal effect estimate from L genetic instruments is obtained.
This classic and widely used estimator is not robust because it has a 0% breakdown point because it is a weighted mean, see, for instance, [21].
In this section, the robustification of the classical estimator IVW is obtained, first, by replacing estimatorsβ R j with the median of the distribution of the mean, MdM j estimators and, second, by replacing the weights ω j with v j , the inverse of the new dispersion estimator,

Distribution ofβ R j Estimator
In this section, an approximation for the distribution ofβ R j is obtained for each genetic variant Z j , j = 1, . . . , L, i.e, j is fixed. Moreover, because of the usual regression assumptions, Z ij is not random in the two previous linear regressions, i.e., in the estimatorβ R j .
Hence, with µ X i denoting the constant avoiding the j in the notation of µ X i to simplify it, and with µ Y i being the constant and assuming no outliers in the sample, the variable X i | Z ij follows a normal distribution and, considering standardized data, i.e., thatβ R j is computed as a correlations quotient, Letting W i (removing the j if there is no risk of confusion) denote the random variable . . , n j , where a and Z ij are not random, the aim is to compute the distribution of the sample mean of the variables W i at 0, i.e., where W i is independent but not identically distributed because . The values of these parameters are given from previous studies following the median of the distribution the mean method.
If the data contain no outliers, it will be W i | Z ij ≡ N(µ i , σ 2 i ) but, as usual, a proportion of outliers in the data is assumed, i.e., as a model for the observations W i the following where the contamination constants g i1 and g i2 are assumed to depend on i = 1, . . . , n j .
To compute the distribution of W under models F i , assuming that the sample sizes n j are small, a von Mises approximation (VOM), based on a von Mises expansion, is used to obtain an accurate approximation with small sample sizes.

VOM Approximation of the Distribution
In general, to approximate the tail probability of statistic T n under a vector of model distributions F = (F 1 , . . . , F n ), knowing its tail distribution under the vector of model distributions G = (G 1 , . . . , G n ) (called pivotal distributions), the von Mises expansion of the tail probability of T n (X 1 , X 2 , . . . , X n ) at F is used ([10], Section 2, or [22], Theorem 2.1, or [17], Corollary 2), P F {T n (X 1 , X 2 , . . . , X n ) > t} = P F 1 ,...,F n {T n (X 1 , X 2 , . . . , X n ) > t} where T G F is the second derivative of the tail probability functional at the mixture distribution G F = (1 − λ)G + λF, for some λ ∈ [0, 1]; and TAIF i is the ith (multivariate) partial tail area influence function of T n at G = (G 1 , . . . , G n ) in relation to G i , i = 1, . . . , n, introduced in [17], Definition 1, in those x ∈ X where the right-hand side exists. In the computation of TAIF i , only G i is contaminated; the other distributions remain fixed, i = 1, . . . , n.
In the particular case that T n (X 1 , X 2 , . . . , X n ) = W, the VOM approximation for the tail of W can be expressed as (see (2) with n = n j , j = 1, . . . , L and t = 0 now) or, see (4), If it is assumed as model for the observations W i the generic component of this last equation is where Φ is the cumulative distribution function of a standard normal distribution, If Because F i is a normal mixture Moreover, because of property (3) for the partial influence functions mentioned before, it is Hence, making the change of variable (x Then, the VOM approximation to the distribution ofβ R j is Example 2. In a study [24], whether low-density lipoprotein cholesterol (LDL-C) is a cause of coronary artery disease (CAD) was analyzed considering 28 DNA markers With the method proposed in this paper, considering sample sizes of n = 37, n 1 = 17, n 2 = 10, and n 3 = 10, and contamination parameters = 0.05, g i1 = 1, and g i2 = 1.5, for the first DNA marker is obtained µ s = µ 1 + . . . + µ n j = 30 × (0.0677 − a × 0.0260)

Conclusions
In this paper, a new method for estimating the parameters in a location-scale contamination model is introduced, in the case where individual observations are not available and, therefore, applying the usual robust methods is not possible, i.e., in summary data problems.
For the location problem, a new estimator was defined that is equal to the usual sample mean when no outliers exist and correcting classical estimations when outliers exist.
This new estimator was applied to one of the most used estimators in Mendelian randomization, the inverse-variance weighted estimator (IVW), defining a new estimator robust inverse weighted estimator (RIVW).
Funding: This study was partially supported by grant PID2021-124933NB-I00 from the Ministerio de Ciencia e Innovación (Spain). Data Availability Statement: Not applicable.