Simulation Studies of a Hölder Perturbation in a New Estimator for Proportion Considering Extra-Binomial Variability Estudios de simulación de una perturbación Hölder en un nuevo estimador de proporción considerando la variabilidad extra-binomial

This present work aims to propose an estimator in order to estimate the probability of success of a binomial model that incorporates the extrabinomial variation generated by zero-inflated samples. The construction of this estimator was carried out with a theoretical basis given by the Holder function and its performance was evaluated through Monte Carlo simulation considering different sample sizes, parametric values (π), and excess of zero proportions (γ). It was concluded that for the situations in (γ = 0.20) and (γ = 0.50) that the proposed estimator presents promising results based on the specified margin of error.


Introduction
The inference on the parameter of a binomial population proportion, in general, is carried out considering sampling units are independent and provenient from a single population.
However, there are situations in certain data sets where the sampling variance may be superior in relation to the expected variability in the binomial model.Uncountable factors may cause overdispersion, among them, we can mention the existence of a correlation among the individual responses, data clustering and outliers.
Starting from the assumption that individuals belonging to the same population are more likely to provide correlated responses, meaning that an individual response depends on the previous response, consequently excess of zeros in a sample may occur, and in these cases, an alternative is given for modeling through the binomial model correlated and proposed by Kupper & Haseman (1998), in order to adjust extra-binomial variance caused by overdispersion or subdispersion (Achcar & Junqueira 2002).
In the presence of covariates, Hinde & Demetrio (1978) studied binary responses with overdispersion assuming random variables Y i represent the success number of samples of size m i (i = 1, . . ., n), where n is the n th element of each sample.Thus, writing E[Y i ] = µ i = m i π i through the generalized linear model, the proportion π i is modulated assuming the explanatory variables X i with a fitting link function.
Regarding robust inferential methods, which in general attenuate the present effects of outliers in the samples, we can mention some kinds of estimators, as M-estimators (Huber 1964) and minimum disparity estimators (Lindsay 1994).Specifically in the case of discrete data we can refer to M-estimators (Simpson 1987), minimum disparity estimators (Simpson, Carrol & Ruppert 1987) and Eestimators (Ruckstuhl & Welsh 2001).
As we know zero-inflated binomial samples in general exhibit an asymmetric form, thus explaining E-estimator use in the π proportion estimation of a binomial population, which model is described by According to Ruckstuhl & Welsh (2001), the E-estimator is derived from a modification in the likelihood in order to reduce the effect of observations in the tails of the distributions.A brief presentation of the construction of this estimator is given below.
Assuming the disparity function H(π, f n ) defined by I(Y i = y), y = 0, . . ., m, correspondent to the proportion of observations equal y in a sample of size n and p π (y) to the probability of success of π of the binomial model, and where c 1 and c 2 are tuning constants from which the estimator depends.Based on these specifications, the estimator π that minimizes H is given by: The choice of tuning constants acts directly on the robustness of the estimators, providing them with full asymptotic efficiency, and good robustness properties (Basu, Shiyoa & Park 2011).In this way the accuracy of this estimator is given by the choice of tuning constants c 1 and c 2 .In the case of binomial mixture, Silva & Cirillo (2010) concluded that the appropriate value for the tuning constant c 1 depends on the degree of contamination of the sample and, therefore, it is desirable that the researcher have some prior information about the probability of mixture.Some recommendations made by Ruckstuhl & Welsh (2001) were mentioned, in a way that, when assuming c 1 = 0 and c 2 → ∞, π will correspond to the minimum relative entropy estimator which is identical to the maximum likelihood estimator (MLE) of the binomial model.The authors also point out that the estimates become more robust assuming c 1 < c 2 = 1.Determination of the values that will guarantee better accuracy and precision is still a matter of study.Ruckstuhl & Welsh (2001) have mentioned that when f n is a finite-sample realization of a binomial distribution and c 1 < c 2 = 1, the E-estimator may be biased and the substitution of c 2 by another value will be discussed in future work.
As a result of the above motivation, the present work aims to construct a new estimator in order to estimate the probability of success of a binomial distribution given a zero-inflated sample.
The main advantage provided by this estimator is highlighted in the computational aspect once the estimator (4) that minimizes (2) is obtained assuming infinite values belonging to the interval [0,1].Therefore, we understand that a problem of a continuous nature that is treated in a discretized way, depending on the algorithm to be used, or even, in an application with real data may occasionally cause a non-fitted estimate.
Due to this fact, the estimator proposed in this work is shown in Section 2. Basically with of a modification in (3), in such manner that the researcher may fix an only value for the constants c 1 and c 2 , based on a single point represented by the maximum likelihood estimator and not on every point of the parametric dominium, as suggested in (4).

Methodology
The binomial samples with different zero percentages (γ) were generated via Monte Carlo method according to the zero-inflated binomial model (ZIB): According to the model mentioned and because it is an empirical study, arbitrarily, the parametric values were determined deliberately to represent different situations of sample of sizes (n = 20, 30, 50, 70, 90), extracted from a population of m = 100 elements, zero percentages (γ = 0.2, 0.5 and 0.7) and parametric values at "small", "medium" and "large" proportions (π = 0.3, 0.5, 0.8).Therefore, the resultant combination from these factors provided different configurations, in which the estimator P zib was evaluated by 10000 Monte Carlo simulations using the R software (R Development Core Team 2013).
The disparity function as defined by Lindsay (1994), such discrepancy between the data and the model density, based on the function G(•) and considering the sample space as Ω = {0, 1, 2, . ..}, is given by where G(•) is a thrice differentiable convex function on [−1, ∞) with G(0) = 0 and δ the Pearson residual at x, given by with f θ (x) representing a density function and d (x) the empirical density at x.
According to Park, Basu & Lindsay (2002), the range of the Pearson residual is [−1, ∞) and under certain regularity conditions, all minimum disparity estimators are first order efficient; in addition many of them have attractive robustness properties.
Due to these considerations, based on a function belonging to the Hölder class, a new estimator was constructed, modifying the function ρ(x), given in (3).
The construction of the estimator denominated P zib was based on a modification in the disparity function H(π, f n ), given in (2), in order to reduce the speed ρ(x) = xln(x) tends towards the infinite.This modification was made considering the following definitions: Revista Colombiana de Estadística 38 (2015) 93-105 Definition 2. We say that a function f : X ⊂ → is Hölder continuous with exponent 0< α <1 if there is H>0 so that Example 1.Consider the function: The function described in ( 6) is Hölder continuous with exponent α = 1 2 .In fact: For x 0 > x 1 we have: For x 1 > x 0 we have: In (*) and (**) we conclude that: In a general way, given 0< α <1 the function is Hölder continuous with exponent α (Begehr 1994).
The property below will be important for the following: Property 1: Given 0 < α < β 1, x α tends to infinite when x tends to infinite less rapidly than x β .We note β − α >0 and therefore In order to reduce the speed, the function, g (x) = x ln(x) tends to infinite when x tends to infinite, the proposal of the P zib estimator implies modifying the function ρ(x) given in 3, in order to reduce its growth when x tends to infinite.Define: We point out that the function ρ 1 (x) has the same kind of differentiability as the function ρ(x), and ρ(x) and ρ 1 (x) ∈ C 1 ( + ).Based on the foregoing, the constructed estimator P zib is given by: Note that this estimator will depend only on α parameter, once the constants c 1 and c 2 were fixed in 0.1 and 1 , which makes it accurate with a tolerable margin of error defined by | P zib − πmle | where k indicates the tolerable value for this difference, being interpreted as a deviation resulting from the incorporation of the extra-binomial variability in the estimation of π when compared with the maximum likelihood estimator.

Results and Discussion
As we know, the usual maximum likelihood estimator submitted to zero-inflated samples may present gross errors along with the P zib estimate, the researcher may use it as a reference of Table consulting, since through the deviation it is possible to verify error magnitude.
Starting from a situation where the fitting of P zib estimator is verified only for a simple value of c 1 , maintaining c 2 = 1, as well as the study of the function ρ 1 (x) in relation to the speed of convergence when x → ∞, the results found in the Tables 1-3 correspond to the maximum likelihood estimates, P zib and deviations.
The results shown in Table 1 were obtained from a study via Monte Carlo method which goal was to verify whether the choice of α could be made correctly with the knowledge of πmle .Fixing k = 0.20 for this purpose, we noted that the values of deviation satisfied the tolerable margin of error.Naturally, πmle may be obtained given a zero-inflated sample.Therefore, taking this estimate as a reference, the P zib estimate within the tolerable margin of error will provide information to verify whether the assumed value of α is in fact the value to be used in the P zib estimator to render accurate estimates.In this context, the results found in Table 1, confirm that given the low concentration of zeros (γ = 0.20) and different samples of size (n), the estimates of P zib were considered reasonable.Notice that the deviations obtained were inferior to k = 0.20 and the estimates πmle were not so accurate when compared to the estimates obtained in the P zib estimator.Increasing the concentration of zeros to γ = 0.50, given the searched values of α the results shown in Table 2 provided fitted estimates.However, we can note that in a general way, the πmle estimates resulted in inappropriate values, leading to an increase of k value in the tolerable margin of error.For high concentrations of zero (Table 3), the same behavior of the πmle estimates in relation to the estimator P zib was observed.
According to Ruckstuhl & Welsh (2001) The choice c 2 = 1 gives improved first order robustness against gross error contamination and the choice c l = 1 gives improved robustness against truncation.Under the binomial model the asymptotic distribution of the E-estimator is Gaussian for c 1 < 1 < c 2 and non-Gaussian for c1 < c2 = 1.
Choosing c1 < c2 we cannot treat both types of contamination simultaneously.The study of the properties for other c 2 values will be covered in future work.It must be noted that other values of c 1 and c 2 may be obtained in the function described in the Appendix, constructed in the R software (R Development Core Team 2013).

An Ilustrative Example
For didactic purposes, is shown, where the application of the P zib estimator considering the ρ 1 (x) in a sample of size n, each sampling unit was considered independent and identically distributed with a binomial (m, π).Given this specification and for comparison with a simulated sample from an inflated binomial model with m = 100 and π = 0.3 (Table 4).20 25, 30, 31, 25, 24, 27, 0, 26, 31, 32, 26, 28, 21, 27, 29, 30, 0, 0, 28, 32 Obtaining the maximum likelihood estimate: Based on this estimate, the Table 1 can be used, to seek a value of π mle near 0.236.Obeying the rule that deviations must be less than 0.20, provides an idea of parameter estimates, which could be 0.30 with α equal to 0.26.
Calculating probabilities considering the mle:  Comparing the estimated P zib with the parametric value (π = 0.30), note that the value of α used resulted in an accurate estimate, following the criteria established by | P zib−π mle | < 0.20 shows that the value α is suitable for performing this inference.

Considerations Regarding the Use of RAF
(Residual Adjustment Function) Park et al. (2002) develop another graphical representation to summarize the behavior of the minimum disparity estimators in relation to maximum likelihood.
For this purpose, the authors developed a function called RAF (residual adjustment function), which considers as input the Pearson residual, represented by δ(x).
Referring to this methodology, taking ρ 1 function (9) as a function of δ, redefined by δ = f n (y) /p π (y) − 1 instead of x = f n (y) /p π (y), the graphic procedure RAF used to evaluate the disparity is not suitable, since Park et al. (2002) mention a graphical interpretation of the robustness of the estimators, but this representation is not completely satisfactory since the domain of the RAF is infinite.With emphasis on the domain of ρ 1 function (9), we do not recommend the use of this procedure for the following reasons: • The ρ 1 function (9) proposed for obtaining the estimator allows researchers to search by simulation, the value of α that defines ρ 1 function (9).That is, each value of α has corresponding a ρ 1 function ( 9).Therefore, we note a limitation in assuming x < 0, since x α cannot be calculated.Such statement may be conjectured if we consider α = 1/2, which implies that x α can only be calculated for greater than or equal to zero values.
• Another point to be emphasized, refers to the fact that ρ 1 function ( 9), considered in its construction ln(x) with the R + domain, so we cannot allow negative or zero values in the evaluation.

Conclusions
The P zib estimator proposed in this work, fit the situations where the samples presented low (γ = 0.20) and medium (γ = 0.50) concentrations of zero.The accuracy and precision of the P zib estimates are flexible to computational improvement, adopting the criterion | P zib − π mle | < k, where k corresponds to a tolerable value subjectively specified by the researcher.

Table 1 :
Values of α for P zib estimates with ρ1(x) approach minimizing the difference in relation to the parameter of reference π with proportion of zeros γ = 0.2, m = 100, c1 fixed in 0.1 and c2 fixed in 1.

Table 2 :
Values of α for P zib estimates with ρ1(x) approach minimizing the difference in relation to the parameter of reference π with proportion of zeros γ = 0.5, m = 100, c1 fixed in 0.1 and c2 fixed in 1.

Table 3 :
Values of α for P zib estimates with ρ1(x) approach minimizing the difference in relation to the parameter of reference π with proportion of zeros γ = 0.7, m = 100, c1 fixed in 0.1 and c2 fixed in 1.

Table 4 :
Values assumed to illustrate the estimation of π using the estimator P zib.

Table 5 :
Values for P zib.