The Signed-rank estimator for nonlinear regression with responses missing at random

: This paper is concerned with the study of the signed-rank es- timator of the regression coeﬃcients under the assumption that some responses are missing at random in the regression model. Strong consistency and asymptotic normality of the proposed estimator are established under mild conditions. To demonstrate the performance of the signed-rank esti- mator, a simulation study is conducted under diﬀerent settings of model error’s distributions, and shows that the proposed estimator is more eﬃ- cient than the least squares estimator whenever the error distribution is heavy-tailed or contaminated. When the model error follows a normal dis- tribution, the simulation experiment shows that the signed-rank estimator is more eﬃcient than its least squares counterpart whenever a large pro- portion of the responses are missing.


Introduction
Nowadays the missing data problem is a troubleshooting problem in almost all statistical studies; it occurs for several reasons in a large array of applications. Such reasons may occur due to equipment malfunction, unfavorable weather conditions, incorrect entry of the data or non-participation of individuals due to illness. Examples of missing data problems include: users of a recommender system rate, extremely small number of available books, movies, or songs leading to massive amounts of missing data; in a clinical trial, participants may drop out during the course of study leading to missing observations at subsequent time points; while diagnosing a patient, all applicable tests may not be ordered by a doctor; a sensor in a remote sensor network may be damaged and cease to transmit data; certain regions of a gene microarray may fail to yield measurements of the underlying gene expressions due to scratches, fingerprints, or manufacturing defects; etc.
Signed-rank with missing responses 1425 The missingness scenario to be considered in this paper is that of the missing response in the context of regression analysis that often arises in various experimental settings such as opinion polls, socioeconomic investigations, medical studies and market research surveys.
This problem has been intensively investigated in the literature and appears to be a very challenging task. The challenge comes from the fact that in most cases, missing data contains either little or no information about the missing data mechanism (MDM). The most common assumption used for the MDM and also employed in this paper, is the missing at random (MAR) assumption, whose details can be found in [35].
When f is taken to be a linear function of θ, several regression models in the framework of missing data have been studied by many authors such as [6,21,46,45,30,52,50,51,42,31], among others. All of them have done considerable work using different techniques, but most of them use the least squares approach in estimating the regression parameters.
In this paper, we consider the study of a general signed-rank estimator of the regression coefficients that is obtained as a minimizer of an objective function D nj (θ) defined in the next section with the assumption that some responses are missing at random. The motivation for using the signed-rank approach comes from the fact that it provides robust estimators compared to the more common LS approach when dealing with situations where there are outlying observations in the response space, and/or where the error distribution is heavy-tailed. A particular case of the signed-rank approach, the LAD estimator has been studied by [28] and [14]. Although robust in many situations, the LAD has been shown to be less efficient than the estimator commonly referred to as the signed-rank estimator and, also, the general form of the signed-rank estimator defined as a minimizer of D nj (θ), has been extensively studied in the literature by authors such as [16,3], and [1] for models with i.i.d. errors. Applications of this approach to factorial designs under dependent observations are given in [4]. As the signedrank procedure is based on rank scores, an important approach using such rank scores in estimating the true nonlinear regression parameter is developed in [17] and [18] for i.i.d. errors and [25] for time-series regression models. [25] provided a comprehensive review of different techniques of estimation in the literature including M-estimation, R-estimation and L-estimation. In the context of missing data with the MAR assumption and f taken to be a nonlinear function of θ, [8] considered the LAD estimator of the regression coefficients and established its asymptotic properties. Another important piece of work in the framework of missing data is given in [26], who provided the estimation of a function of the mean response with responses missing at random and also established the asymptotic properties of its proposed estimator.
In the framework of general nonlinear regression with responses missing at random, to the best of our knowledge, no asymptotic results have been provided for the general signed-rank estimator. The purpose of this paper is to provide conditions needed for the strong consistency and the √ n-asymptotic normality of the resulting estimator toward statistical inference when some responses in the regression model are missing at random. To measure the accuracy of the proposed estimator, we run a Monte Carlo simulation to obtain MSEs of the SR and compare them to those obtained via LS estimation. From the simulation study, it is shown that the signed-rank approach outperforms the least squares approach under different settings. This is also shown to be true even when the errors' distribution is normal, but with a high rate of missing information.
To this end, the rest of the paper is organized as follows: in section 2, we define the model, discuss the missing mechanism, motivate the imputation procedure and provide the estimation procedure of the regression coefficients. Section 3 and section 4 are concerned with the asymptotic properties of the signed-rank estimator. Settings of the Monte Carlo simulation are presented in section 5 with section 6 providing the discussion on the simulation results. A conclusion summarizing the findings of the paper is given in section 7. Finally, Appendix A contains complete proofs and some sketches of proofs of the theorems presented in the paper.

Model definition and estimation
Consider the nonlinear regression model where θ ∈ Θ ⊂ R p is a vector of parameters, x i 's are i.i.d. p-variable random covariate vectors, the function f is twice continuously differentiable as a function of θ and measurable as a function of x. Θ is assumed to be compact and E(ε i |x i ) = 0, var(ε i |x i ) < ∞. In this paper, we are interested in inference on the true parameter θ 0 when there are missing responses in the nonlinear model (2.1). Specifically, we consider the case where some values of y in the sample of size n may be missing, but x is fully observed. That is, we obtain the following incomplete observations {y i , δ i , x i } i=1,2,··· ,n from (2.1), where x i 's are observed, and, δ i = 0, y i is missing; 1, otherwise.
One major problem of dealing with missing data is that it is usually unknown how nonresponse for each variable is generated, that is, the distribution P (δ = 1|y, x), referred to as the nonresponse mechanism is unknown. It is often necessary to make assumptions about this distribution, which are almost impossible to verify [19,27,22,24]. To that extent, we assume that y is missing at random (MAR). The MAR assumption implies that δ and y are conditionally independent given x, that is P (δ = 1|y, x) = P (δ = 1|x). Although usually very difficult to test in practice, this is a common assumption for statistical analysis with missing data and is reasonable in many practical situations, see chapter 1 of [23]. Under the MAR assumption, [45] developed inference tools in the missing response case for the mean of y based on the least squares estimation approach, and [47] studied the least squares estimator of the regression coefficient θ in the semi-parametric linear model. One method for constructing the confidence interval for the true mean of y is that of the empirical likelihood method introduced by [29]. This approach is used to investigate a variety of statistical problems by [13,6,21,30,52,50,51], just to mention a few. These works demonstrate that the method of empirical likelihood has a number of advantages over methods such as those based on normal approximations or the bootstrap. Studies which used this approach to study (2.1) under the MAR assumption include [46,45,42], and [49]. Other missing mechanisms include missing completely at random (MCAR) and missing not at random (MNAR), and are discussed in [36]. Imputation in general makes use of a number of auxiliary variables that are statistically related to the variable in which the item-nonresponse occurs by means of an imputation model [22,37]. The main reason for carrying out imputation is to reduce bias, which occurs because the distribution of the missing values, assuming it was known, generally differs from the distribution of observed items. When imputation is used properly, it is possible to recreate a balanced design such that procedures used in tackling complete data can be applied in many situations. Rather than deleting cases that are subject to itemnonresponse, the sample size is maintained resulting in a potentially higher efficiency than case deletion analysis. Imputation usually makes use of observed auxiliary information for cases with nonresponse maintaining high precision [38]. However, such an imputation may result in a negative impact if the imputed values are not properly used. Since the variance estimation is subject to the imputation procedure, an adequate estimation requires adjustment methods such as those proposed by [26] to correct for the increase in variability due to nonresponse and imputation. In most of the regression problems, the commonly used approaches include linear regression imputation [15], nonparametric kernel regression imputation [7,46], and semi-parametric regression imputation [45,47]. Another approach for handling missing data is the inverse probability weighting. This approach has gained considerable attention as a way of dealing with missing data problems. For a discussion of this approach, see [34,54,44,45] and references therein. As pointed out by [47], for missing problems, the inverse probability weighting approach usually depends on high dimensional smoothing for estimating the completely unknown propensity score function. This suffers from the curse of dimensionality that may restrict the use of the resulting estimator. One way to avoid such a problem is to use the inverse marginal probability weighted method proposed by [45].
To this end, let us introduce the following notations: σ 2 (x) = E(ε 2 |x), Δ(x) = P (δ = 1|x = x). Taking advantage of the imputation procedure in [45] and [47], that is, by simple imputation and inverse marginal weighting probability, define Under the MAR assumption and the fact that E(ε|x i ) = 0, it can easily be Consider the signed-rank objective function, , R ij is the i th rank of |z ij (θ)| and ϕ is a differentiable positive function with bounded derivative. It can also be shown that E(z ij (θ)|x i ) = 0. The signed-rank estimator θ jn of θ 0 is the minimizer of D jn (θ), that is From the definition of y ij , θ jn is obtained based on complete observations, that is, ignoring observations with missing responses. Note that D jn (θ) is a norm (see [16]) and hence convex as a function of the residuals. The existence of θ nj is ensured by the continuity of D jn (θ) on Θ compact. Also, setting ϕ(t) = 1 for all t ∈ (0, 1) in (2.2), θ nj becomes the LAD estimator.
Remark 2.1. For the complete case which considers observations without missing responses, [3] give a brief discussion about the relative efficiency of the SR with respect to the LS and the least absolute deviation (LAD). They show that the SR is robust and more efficient than the LS and the LAD when dealing with heavy-tailed model error's distributions and/or outliers in the response space.
The same paper provides a weighted version of the signed rank that is robust and more efficient than the LS and the LAD when dealing with high leverage points (outliers in the design space). It is also well-known that when dealing with heavy-tailed error distributions and/or outliers in the response space, the SR is robust and more efficient compared to the LS, LAD and MLE, see [16].
In the expression of y ij , we have the function Δ(x) = P (δ = 1|x) that might be unknown and then needs to be estimated. Being a probability and under the assumption that inf x Δ(x) > 0, one can estimate Δ(x) non-parametrically. To this end, definẽ where K is a kernel function and h n a bandwidth sequence satisfying h n → 0 as n → ∞. Γ(x) defined above is known as a kernel smoother and is used in the literature for different purposes including the machine learning literature under the banner of semi-supervised learning, see [9]. Also, under the MAR condition and the fact that Note that under assumptions (I 1 ) and (I 3 ) given in Theorem 2.1 below, Note that the objective function given in Equation (2.4) is defined on dependent residuals, as the imputation procedures introduce a dependence structure. Therefore, the asymptotic properties (consistency and asymptotic normality) of the minimizer of D jn (θ) cannot be established following [3]. [2] established such results, but strong mixing conditions were imposed on the model error. One way to avoid such strong mixing conditions is to establish the following theorem whose proof is provided in the Appendix.
Theorem 2.1. Under the following assumptions: is a regular kernel function of order r > 2 and the bandwidth h n satisfies nh n → ∞, nh 4r Remark 2.2. Assumptions (I 1 ) and (I 3 ) ensure the almost sure uniform convergence of Γ n (x) to Δ(x). The optimal bandwidth for practical purposes can be chosen to lie in the interval [an −1/5 , bn −1/5 ], for 0 < a < b < ∞. More discussion about this fact can be found in [12]. If we set Theorem 2.1 stands for the asymptotic equivalence between θ jn and θ jn ; that is, in the neighborhood of θ 0 , they share the same asymptotic properties (strong consistency, √ n-consistency and asymptotic normality).

Consistency
Let (Ω, F, P ) be a probability space. For i = 1, ..., n, assume that ε i and x i are independent. Also, let x i ∼ H, ε i ∼ G and |z ij (θ)| ∼ G j θ . Consider the following assumptions: The conditional density of ε i given x i is symmetric about 0 and strictly decreasing on R + .
and (I 4 ) ensure the uniform convergence in Lemma 3.1. Also, if we restrict ourselves just to the consistency result, continuity is enough for f .
The following Lemma is essential for the proof of Theorem 3.1. Its proof can be constructed along the lines given in [3]. For sake of brevity, it will not be included here.
and Θ * is a closed subset of Θ not containing θ 0 Remark 3.2. θ jn being also a strong consistent estimator of θ 0 [1], and by , and this implies that y ijn → y ij as n → ∞.

Asymptotic normality
Setting S jn (θ) = −∇ D jn (θ), it is clear that θ jn is a zero of S jn (θ). With the following notations for the gradient and the Hessian matrix, respectively: we have, Also, setting The proof of this theorem is provided in the Appendix, and is just a slight modification of that given in [2] taking into account the strong consistency of Γ(x). Now assuming that the true parameter θ 0 satisfies On the other hand, note that Theorem 4.2. Let τ jn represent the minimum eigenvalue of Σ jn . Then assuming that The proof of this theorem can be constructed along the lines that of Theorem 5.2 of [4]. Hence for sake of brevity, it will not be included here.
From (4.1) and the fact that To this end, we will assume that where W is a positive definite matrix. Hence, we have the following asymptotic normality theorem.

Theorem 4.3. Under assumptions
where Σ jn is defined by equation (4.2) and W is obtained from assumption (I 8 ).

Remark 4.1.
This result allows us to make statistical inferences (confidence intervals and hypothesis tests) about the true regression parameters. To achieve this in practice, the covariance matrix in Theorem 4.3 above needs to be estimated. This turns out to be a complicated task in the nonlinear regression setting, as it depends on the true parameters and the imputation procedure. One way to avoid this problem is to use a plug-in or sandwich estimator as discussed in [3] for the complete case and [26] in the missing responses case under the MAR assumption. To this end, it needs to be stressed that standard variance estimation for a point estimator valid for complete data may, in many cases, lead to severe underestimation of the true variance if applied to observed and imputed data [33]. Standard variance estimation techniques are therefore not adequate in the presence of imputation. In general, all imputation methods require adjustments to the variance estimation formula to reflect the additional variability due to nonresponse and imputation correctly. Various methods exist for estimating the variance of an estimator under imputation, including multiple imputation [36,42,20], two-phase approaches [32,40], model-assisted approaches [11,5] and replication methods, such as jackknife variance estimators [33,39,53,41] and references therein.

Simulation
To illustrate the performance of the signed-rank estimator, a simulation study is conducted using R. In model (2.1), two scenarios are being considered: , where x is generated from the normal distribution with mean 1 and standard deviation 1, θ = (β 0 , β 1 ) τ with the true parameter set at θ 0 = (5, 0) τ . The random error is generated from a set of two different distributions: a contaminated normal distribution CN ( ) = (1 − )N (0, 1) + N (0, σ 2 ) with σ = 3 and taken to be a proportion of contamination, and the t-distribution with various degrees of freedom (df ). In this scenario the sample size is set at n = 30. As pointed out in [20], under the MAR assumption and when the distribution of the model error is fully parametric, when it comes to estimating the regression parameters, it is enough to consider the complete case as it results in consistent estimators. The same remark is valid for the considered approach, as is shown in Theorem 3.1, and discussed in Remark 3.1 and Remark 3.2. In practice, unfortunately, with high rates of missingness, it is almost impossible to fully specify the distribution of the model error. Also, when making inference (hypothesis testing, confidence intervals) about regression parameters, it is often of interest to estimate the asymptotic variance of the corresponding estimators. Such an estimation of the variance turns out to be dependent upon the imputation procedure; see [45,47,42] and [26] regarding a discussion of this fact. To this end, we show through the simulation below, based on both imputation procedures, that the SR provides more efficient estimators compared to the LS.
To carry out the imputation for missing responses, δ was generated from the Bernoulli (Δ(x)) distribution, where three cases were considered: • Case 1: Δ(x) = η is taken to be a sequence of proportion of no missingness starting from 50% to 100% with steps of 10%. The first case is used to investigate the effect of missing completely at random, as δ is independent of x, while the last two are used for the effect of missing at random, as δ and x are dependent via Δ(x). Also, while all three cases are being and, secondly, it is taken to be the logistic propensity score (Log), defined by as in [26]. The kernel estimation involves the choice of the bandwidth. This can be done by either a joint minimization of D nj (h, θ) as a function of the bandwidth h and θ, or a simultaneous minimization of D nj (h, θ) by first choosing h ∈ {h : an −1/5 ≤ h ≤ bn −1/5 } for a, b > 0, and obtaining the estimator of θ, and next obtaining the estimator of h when θ is replaced by its estimator. As shown in [10], the two minimization procedures are equivalent. Also, as pointed out in [12] and [10], the estimated bandwidth is proportional to n −1/5 . So for simplicity, in our computations, the bandwidth is set at n −1/5 . From 5000 simulations, MSEs of the signed-rank (SR) estimator of θ 0 are reported and compared to those of the least squares (LS) under different settings. Tables 1-8, summarize the results of the simulation study which are discussed in the next section.

Discussion
Scenario 1: In Case 1, based on simple imputation and inverse marginal probability procedures with either the Gaussian kernel or the logistic propensity score under the contaminated normal distribution model error, the SR performs better than the LS as the rate of contamination increases (Table 3). Although, their MSEs are comparable for the normal distribution model error (CN (0)), the SR is still more efficient than the LS for the considered proportions of missingness. A similar observation is made under the t-distribution model error with the SR being superior to LS for the considered degrees of freedom (from heavier tails to the situation when n gets larger), see Table 4. In Case 2, once again under contaminated distribution model error, the SR outperforms the LS as the proportion of contamination increases (Table 1), whereas under the t-distribution model error, based on SI and IP(Ker, Log), the SR is more efficient than the LS for the used degrees of freedom. As expected, these MSEs become comparable with the increase of the degree of freedom. See Table 2.
Finally, in Case 3, either under the contaminated normal distribution or the tdistribution model error, we have a similar observation as in Case 2. See Tables 5  and 6.
In this entire scenario, as expected, the MSEs increase as the rate of contamination increases for the contaminated normal distribution model error and decrease as the degrees of freedom increase for the t-distribution model error. Also, for the inverse marginal probability imputation, we note that the MSEs obtained based on the Gaussian kernel are generally smaller than those obtained based on the logistic propensity score. We suspect that this is due to the fact that the given logistic propensity score is not a density. It is worth pointing out that the higher the proportion of no missingness, the smaller the MSEs are for both SR and LS.  IP(Ker, Log), it is observed that the SR outperforms its LS counterpart by providing smaller MSEs. Also, as expected, these MSEs decrease with increase of the sample size. See Table 7. A similar observation is made when considering these same two cases with high rates of missingness (Table 8). Generally, while the MSEs of the SR obtained via SI and IP(Ker) for the two scenarios considered are comparable, those obtained via IP(Log) are larger. This suggests that for practical purposes the SR obtained via SI (j = 1) because of its simplicity in the way the imputation is carried out, can be preferred to that obtained via IP (j = 2). However, when it comes to estimating the variance, SR via IP is more flexible as the nonparametric estimator of Δ(x) can be  further adjusted to result in a more efficient estimator of the variance.
Overall, from this simulation study, we see that the SR performs better under heavy-tailed error's distributions and for cases containing contaminations. Although more efficient than LS, even for the normal distribution model error when dealing with high rates of missingness, it is generally comparable to LS under normality (CN (0)) as the proportion of missingness decreases, or under the t distribution model error as the degrees of freedom increase. This makes the SR estimation extremely appealing for situations where we encounter high rates of missing data.

Conclusion
This paper provides asymptotic properties (strong consistency and asymptotic normality) of the signed-rank estimator under the assumption that some responses in the regression model are missing at random. Simulation studies demonstrate the performance of SR when dealing with heavy-tailed or contaminated model error's distributions and even for the normal model error's distribution when dealing with a high rate of missing responses.

Appendix A: Proofs
This appendix provides sketches of proofs of some results stated in this paper.
Proof of Theorem 2.1. In this proof, L is taken to be an arbitrary positive constant. Set b nij = R ij /(n + 1) and a nij = R ij /(n + 1), and without loss of generality, assume that a nij < b nij . By the mean value theorem, there exists Also, by uniform continuity of ϕ (since ϕ is bounded with bound L > 0) and for fixed n, one can choose (n) = ξ(n)/(L + 1) > 0, such that for ξ(n) → 0 as n → ∞ and From this, we have, From the boundedness of ϕ, there exists a constant L > 0 such that |ϕ(b nij )| ≤ L. Then, Then,

Signed-rank with missing responses
This implies that On the other hand, note that a direct application of the Dominated Convergence Theorem gives → 0 a.s as n → ∞.
Hence, again by the Strong Law of Large Numbers, Now without loss of generality, assume that b nij > b * nij . Then, by the mean value theorem, ϕ being differentiable with bounded derivative, we have where α nij ∈ (b * nij , b nij ). On the other hand, f being twice continuously differentiable, again by the mean value theorem, we have To this end, (A.2) and (A.3) in (A.1) give By assumptions E(ε|x) = E(e|x) = 0. Also, var(ε | x) = var(e | x). Then ε and e have the same distribution. Now taking into consideration that ∞ × 0 = 0, we have, where |ϕ (α ij )| ≤ M for some positive constant M , J is an integrable function of x such that ∇ θ f (x i ) 1 ≤ J(x) for any θ ∈ B and · 1 represents the L 1 -norm; that is, J(x)dH(x) < ∞. Again, applying the strong law of large numbers for functions of order statistics [43], Combining this to Theorem 4.2, we complete the proof.