Estimating Population Proportions by Means of Calibration Estimators

This paper considers the problem of estimating the population proportion of a categorical variable using the calibration framework. Diﬀerent situations are explored according to the level of auxiliary information available and the theoretical properties are investigated. A new class of estimator based upon the proposed calibration estimators is also deﬁned, and the optimal estimator in the class, in the sense of minimal variance, is derived. Finally, an estimator of the population proportion, under new calibration conditions, is deﬁned. Simulation studies are considered to evaluate the performance of the proposed calibration estimators via the empirical relative bias and the empirical relative eﬃciency, and favourable results are achieved.


Introduction
In the presence of auxiliary information, various approaches may be used to improve the precision of estimators at the estimation stage. The book of Singh (2003) contains several examples, including ratio, difference or calibration estimators, following the methodology proposed by Deville & Särndal (1992) and Särndal (2007), or regression estimators, as the papers of Arnab, Shangodoyin & Singh (2010) and Singh, Singh & Kozak (2008) show. These techniques are generally more efficient than other methods not using auxiliary information. Usually social surveys are focused on categorical variables as sex, race, potential voters, etc.
Efficient insertion of available auxiliary information would improve the precision the estimations for the proportion of a categorical variable of interest. Conceptually, it is difficult to justify using a regression estimator for estimating proportions. Duchesne (2003) considered estimators of a proportion under different sampling schemes and presented an estimator which used the logistic regression estimator. The model calibration technique proposed by Wu & Sitter (2001) can be also used to estimate a proportion by using a logistic regression model. Based on logistic models, these estimators efficiently facilitate good modeling of survey data assuming that unit-specific auxiliary data in the population U are available. In this case it is assumed that the values of auxiliary variables are known for the entire finite population (referred to as complete auxiliary information) but the values of main variable are known only if the unit is selected in the sample.
It is very common for population data associated with auxiliary variables to be obtained from census results, administrative files, etc., and these sources often provide different parameters for these auxiliary variables. For example, position measures (mean, median and other moments) are normally provided, but there is no access to data for each individual. In the present study, it is assumed that the only datum known is the proportion of individuals presenting one or more characteristics related to the study variable.
Under this assumption, Rueda, Muñoz, Arcos, Álvarez & Martínez (2011) defined an estimator and various confidence intervals for a proportion using the ratio method. The results of their simulation studies show that ratio estimators are more efficient than traditional estimators. Confidence intervals outperform alternative methods, especially in terms of interval width.
Calibration techniques were first employed by Deville & Särndal (1992) to estimate the total population, but this approach is also applicable to the estimation more complex of parameters than the total population. Relevant papers estimating population variances are Singh (2001), Singh, Horn, Chowdhury & Yu (1999) and Farrell & Singh (2005). The estimation of finite population distribution functions is studied in papers by Harms & Duchesne (2006) or  and the estimation of quantiles in Rueda, Martínez-Puertas, Martínez-Puertas & Arcos (2007). In Section 2 we review a proportion estimator using the calibration technique. Section 3 describes alternative methods for deriving the calibration estimator for the proposed parameter. In Section 4 we extend these methods to the multiple case. A simulation study is performed in Section 5 and our conclusions are reported in Section 7.

Definition of the Calibration Estimator
Assume a sample s with size n from a finite population U = {1, 2, . . . , N } with size N , selected by a specific sampling design d, with inclusion probabilities π k and π kl assumed to be strictly positive. Let A be an attribute of study in the population U , defining A k = 1 when a unit k of the population U has the attribute A and A k = 0 otherwise. The population proportion of attribute A in the population U is given by: To estimate (1), the usual design-weighted Horvitz-Thompson estimator is: where d k = 1/π k .
If we consider an auxiliary attribute B in which the value B k is known for every unit k in the sample s and P B is also known, the above estimator cannot incorporate the information provided by the attribute B, in estimating the population proportion of A. One way of incorporating auxiliary information in the parameter estimation is via replacing the weights d k by new weights ω k , using calibration techniques.
Calibration is a highly desirable property for survey weights, as Särndal (2007) argues, for the following reasons: • it provides a systematic way of taking auxiliary information into account; • it is a means of obtaining consistent estimates, with known aggregates; • it is used by statistical agencies for estimating different finite population parameters. Several national statistical agencies have developed software designed to compute calibration weights based on auxiliary information available in population registers and other sources. Such agencies include CLAN (Statistics Sweden) and BASCULA (Central Bureau of Statistics, The Netherlands).
Following Deville & Särndal (1992) to obtain a calibration estimator for the attribute A based on the attribute B, we calculate the weights ω k minimizing the chi-square distance subject to the condition where q k are known positive constants unrelated to d k and 0 < P B < 1.
By minimizing (3) under (4), the new weights ω k are given by: where λ is the following Lagrange multiplier and P BH is the usual Horvitz-Thompson estimator for the attribute B.
With the calibration weights (5), assuming k∈s d k q k B k = 0, the resulting estimator is: By (4), when the estimator is applied to estimate the population proportion of B, it coincides with P B .

Properties of the Calibration Estimator
Following Deville & Särndal (1992), it can be shown that the estimator P AW is an asymptotically unbiased estimator for P A and its asymptotic variance is given by where An estimator for this variance is Example 1. Under SRSWOR and q k = 1 for all k ∈ U the estimator P AW is: and the asymptotic variance is where Q A = 1−P A ; Q B = 1−P B , P AB = 1 N k∈U A k B k and f = n N . This variance can be estimated by

Alternative Calibration Estimators
The usual estimator under SRSWOR, p A has the following shift invariance property Hence p A has the same performance in the estimation of P A as the performance of q A in the estimation of Q A . In general, this property is not satisfied by P AW . It is easy to see that this property is fulfilled if Thus, we have two ways of obtaining an estimator with the above property: (i) By considering a calibration estimator Q AW for Q A based on Q B , and determining when the estimator P AW has a smaller variance than the estimator Q AW in order to define a new estimator based on these two.
(ii) By considering a calibration estimator for P A based on P B and Q B because if we derive a calibration estimator that provides perfect estimates for P B and Q B , then: 1. An Estimator Based on the Complementary: The P AT Estimator The first alternative is developed only under SRSWOR, minimizing (3) subject to The resulting estimator, assuming q B = 0, can be expressed by with In the same way as with the estimator P AW in Example 1, the asymptotic variance of Q AW is given by: where An estimator V ( Q AW ) for (15) can be easily defined by Let us now compare the asymptotic variance of P AW with the following estimator of P A , P AQ = 1 − q AB . We have (see Appendix A) AV ( P AW ) < AV ( P AQ ) when Hence, asymptotically, a more efficient estimator for the population proportion P A is Note that the asymptotic variance of P AT is To estimate the asymptotic variance of P AT the estimator can be defined by (10) when the following condition is satisfied and it can be defined by (16) if When in condition (20) the equality is satisfied, we have P AW = P AQ , which implies that AV ( P AW ) = AV ( P AQ ) and V ( P AW ) = V ( P AQ ).
The reason for defining P AT is to obtain an estimator with the property P AT = 1 − Q AT . Accordingly, the estimator Q AT is defined by Another important question relative to the estimator P AT is: for low proportions, estimators such as P AW cannot be calculated when p B = 0. The estimator P AT does not present this problem, because if p B = 0 the estimator P AT = 1 − Q AW and since q B = 1 we have Then the estimator P AT can be obtained for low proportions.
Finally, the estimator P AT has the following drawback: by (9); (15) and (18), its asymptotic behaviour is worse than that of the usual estimator when Thus, if condition (21) occurs, we propose to use the attributeB = B c because Therefore, with attributeB, the above problem, when (21) occurs, is solved.

An Optimal Estimator: The P Aα Estimator
Another way to improve the asymptotic behaviour is as follows: let us define and take the value α that minimizes the variance.
The minimum variance of P Aα is given by (see Appendix B) and this is achieved where The estimator P Aα has the desirable property , we find that minimizing the variance of Q Aα with respect to β is equal to minimizing the variance 1 − Q Aα with respect to α, and consequently α is again given by (44) and P Aα = 1 − Q Aα .
The P Aα estimator presents the following disadvantages: • It cannot be calculated if p B = 0 or q B = 0 • The optimum value α given by (44) depends on theoretical variances and covariances, which are generally unknown.
With respect to the second question, the value α can be easily estimated when the sample is drawn byα Therefore, we obtain the following estimator The estimatorP Aα also has the desired property (11).

3.
3. An estimator that Calibrate in P B and Q B at the Same Time: The P AR Estimator When using the population size and the population proportion of B, the auxiliary information is the same as in the case of a post-stratified estimator. The second way (ii) to obtain the property (11), based on an estimator for P A calibration with P B and Q B , can be developed in any sampling design. To do so, we must minimize the distance (3) under the conditions The calibration weights with this new calibration process are and the resulting estimator is where the estimator P AR can be developed only with the attribute B. Thus, the estimator P AR is well defined.
To prevent this article from becoming excessively long, in the same way as with the estimator P AW , the asymptotic variance of (26) is given by (28) is determined using the following estimator where Example 2. Under SRSWOR and q k = 1 for all k ∈ U , the estimator P AR can be expressed by In the same way as before with the estimator P AW , the asymptotic variance of P AR is (see Appendix C).
To estimate (45) the following estimator is defined Thus, the estimator P AR has the same asymptotic variance as the estimator P Aα . Then, by (45), under SRSWOR the estimator P AR is always more efficient than the estimators p A and P AT .
Note that the proposed estimator is essentially a post-stratified estimator and, in this sense, is not new. Here, we look at it from a different point of view. In a practical situation, the estimator P AR is preferred the estimator P Aα , since P AR does not need estimation of any unknown population quantity. The case of estimator P Aα requires estimating value α, which is generally unknown.

Extension to Multivariate Auxiliary Information
The previous section considered only an auxiliary attribute B; let us now assume that the study attribute A is related to J auxiliary attributes B 1 , . . . B J .
To develop the usual way of incorporating the information provided by J attributes in the estimation of P A with calibration techniques, we consider a new weight ω k subject to the following conditions Next, we denote where By T we denote the following matrix With the minimization of (3) under the P conditions given by (33), the new weights obtained are: The calibration estimator based on (34) is given by: with Note that the weights (34) and the estimator P AW M cannot be obtained if the matrix T is singular.
Following Rueda et al. (2007a), the asymptotic variance of the estimator P AW M is The asymptotic variance (36) can be estimated by Example 3. Under SRSWOR and q k = 1, when only the two auxiliary attributes B and C are considered, the matrix T can be expressed by Thus, the estimator P AW M , under SRSWOR with two auxiliary attributes, is Now, if we denote The asymptotic variance of the estimator P AW M is To determine (39) we use the following estimator:

Simulation study
A limited study was carried out to investigate the design-based finite sample performance of the proposed estimators in comparison with that of conventional estimators.

Simulated data
The estimators were evaluated using 15 simulated populations with a population size N = 1000. These populations were generated as a random sample of 1000 units from a Bernoulli distribution with parameter P A = {0.5, 0.75, 0.9}, and the attributes of interest were thus achieved with the aforementioned population proportions. Auxiliary attributes were also generated, using the same distribution, but a given proportion of values were randomly changed so that Cramer's V coefficient between the attribute of interest and the auxiliary attribute took the values 0.5, 0.6, 0.7, 0.8 and 0.9.
For each simulation, 1000 samples with sizes n= 50, 75, 100 and 125, were selected under SRSWOR to compare the estimators: (1) the Horvitz-Thompson estimator P AH (2) the ratio estimator P Aratio (see Rueda et al., 2011), in terms of relative bias (RB) and relative efficiency with respect to the ratio estimator (RE), where p A is a given estimator and E[·] and M SE[·] denote, respectively, the empirical mean and the mean square error. Values of RE less than 1 indicate that p A is more efficient than P Aratio .
The results derived from this simulation study gave values of RB within a reasonable range. All the calibration estimators produced absolute relative bias values of less than 1% except in case P A =0.9 and φ=0.9. Univariate ratio estimator produced the highest bias values, especially for small sample sizes. • The ratio estimator performs poorly when there is little association between the variables. When φ = 0.5 this estimator is worse than the Horvitz-Thompson estimator. Even when φ = 0.6 as is sometimes the case (P A = 0.75 and P A = 0.9) the ratio estimator has a large MSE. In populations with a large φ this problem does not arise.
• With large φ values, all the estimators that use auxiliary information produce good results: for φ ≥ 0.7 all calibration and ratio estimators are better than the Horvitz-Thompson estimator. It is also seen that as φ increases, all the estimators achieve greater precision, which is particularly marked for very high proportions.
• Of all the calibration estimators, the first one proposed P AW has the lowest degree of efficiency. Although it performs better than the Horvitz-Thompson estimator on most occasions (except when P A =0.9, φ=0.5 and 0.6) the others produce a smaller MSE.
• The P AT ,P Aα and P AR estimators perform very well in all cases. For high proportions (P A =0.75 and 0.9) the efficiency of the estimators is fairly similar; only in the case of P A =0.5 and small values of φ is there a noticeable difference between them, in terms of efficiency. In these cases, the best results are achieved by the P AR estimators that calibrate in P B and Q B at the same time. • The sample size does not produce a clear effect on the behaviour of the estimators; in some cases, as the sample size increases, the efficiency of the estimators increases, while in others, it decreases (as when φ = 0.9 and P A =0.9).
• Ratio and calibration estimators using two auxiliary variables always have a lower RE than those using a single auxiliary variable. For P A =0.5 and 0.5 ≤ φ ≤ 0.7 the multiple calibration estimator is slightly more efficient than the multiple ratio estimator. For P A =0.75 and 0.9 both estimators have similar levels of efficiency.
Ratio estimation is usually known to work well when the variables (auxiliary and of interest) are positively correlated. In this case it is applied to 0-1 variables, so that a positive association is expected for the method to work (higher frequen- cies for the A=1; B=1 (A=0; B=0) cases instead of the 0;1 (1;0) cases). From this study we can conclude that the association between the variables is the most important factor influencing the behaviour of ratio and also of calibration estimators. As expected, as φ increases the M SE of the calibrated estimators decreases. Even for moderate values of φ the calibration estimators improve considerably, in terms of efficiency, on the Horvitz-Thompson estimator. The behaviour of calibration estimators is similar for small proportions, whereas when the proportion approaches 1 there are larger differences among the proposed calibration estimators. Hovewer, it is not an easy task to quantify how much association is needed for a good improvement in terms of efficiency, or when too small that it becomes harmful to introduce extra variables in the calibration constraints.

Real Data
In this section we apply some proposed estimators to data obtained in a survey on perceptions of immigration in a certain region in Spain. A sample of size n = 1919 was selected from a population with size N = 4982920, using stratified random sampling.
Among topics of interest in the survey was estimating the percentage of citizens who believe that the authorities should make immigration more difficult by imposing stricter conditions. The auxiliary variable available is the respondent's gender. This variable was observed in the sample and the totals are known for each province (stratum).
Three main variables are included in this study, related to "goodness of immigration" and "amount of immigration". The main variables are the answers to the following questions: • in general, do you think that for Andalusia, immigration is . . . ? c1-Very bad, c2 Bad, c3 Neither good nor bad, c4 Good, c5 Very good, • and in relation to the number of immigrants currently living in Andalusia, do you think there are . . . ? c1-Too many, c2-A reasonable number, c3-Too few.
In this simulation study, we use the sample as population and we draw stratified random samples of size n = 240 with proportional allocation (eight stratum). Relative efficiency with respect to the ratio estimator is computed, as in the previous case, for compared estimators over 1000 simulation runs. We computed this relative efficiency for each category of the main variable (5 categories in the first case, and 3 in the second case) and the average over categories is also computed. At the same time, confidence intervals based on a normal distribution and using proposed estimated variances are computed for each proportion. Table 1 shows the average length of the 1000 simulation runs for each category and the average over categories. In a similar way, the empirical coverage of the confidence estimation is computed.
Tables 1 and 2 show that, from an efficiency standpoint P Aα is best. Looking the average length of confidence intervals for the proportion in each category, and the average over categories, the best estimator is P AR , but the optimal P Aα has very similar results. However, the empirical coverage of confidence intervals is closer to the nominal level when the optimal estimator is used.

Application
IESA, the Institute for Advanced Social Studies conducted a survey between January 14th and February 13th, 2011 on the perception of culture in the Spanish region of Andalusia (Barometer of Culture of Andalusia -BACU). It is based on a sample drawn from a landline phone frame (N = 5,064,304).
Among several topics of interest in the survey, is the interest to estimate perception of their culture in relation to European citizens. An auxiliary variable available is gender which totals are known for each strata. From Table 3 we observe that the P Aα estimator produces the best confidence intervals.

Conclusions
In practice, it is important to make the best possible use of available auxiliary information so as to obtain the most efficient estimator possible.
When a proportion can be estimated in the case of complete auxiliary information (i.e., when auxiliary information is available at the population level for each unit) it is possible to consider estimators that use the logistic regression model (Duchesne, 2003, Wu andSitter, 2001), as an improvement on the simple estimator. When there is merely a rearrangement of the population proportion of an attribute with respect to the study variable, then traditional indirect methods such as the ratio by Rueda et al. (2011) or the calibration studied in this work can be applied.
We have studied four calibration estimators for the proportion which are simple to calculate from standard calibration packages and can give rise to considerable increases in the precision achieved, as illustrated by the theoretical results reported here and by the simulation performed. The proposed estimators P AW and P AR can be obtained from any arbitrary sampling design, whereas P Aα and P AT estimators are defined under SRSWOR. However, the extension to a stratified random sampling is straightforward.
Confidence intervals based on the estimated variances of the studied calibration estimators is also investigated through a limited simulation study, under a more realistic survey (stratified random sampling) using real data. P Aα and P AR have good properties in confidence estimation, and in some sense (a balance between length and coverage) the optimal estimator P Aα provides the best results. Appendix A. Comparison between AV ( P AW ) and AV ( P AQ ) Because AV ( P AQ ) = AV ( q AB ) we have AV ( P AW ) < AV ( P AQ ) when

Now, we have
Then AV ( P AW ) < AV ( P AQ ) when we deduce that AV ( P AW ) < AV ( P AQ ) when k 1 < 0, that is Appendix B. Obtaining the minimum variance of P Aα If we denote V 1 = AV ( P AW ); V 2 = AV ( Q AW ) and C = Cov( P AW , Q AW ) the minimum variance of P Aα is and this is achieved when It is easy to see that Therefore, we have: By substituting the values V 2 + C and V 1 + V 2 − 2C in (43), the value of α is found to be: On the other hand Thus, by substituting the values C 2 ; V 1 × V 2 and V 1 + V 2 + 2C in (42), we have: Now, taking into account that P AB − P A Q B = P A − P AB − P A + P A P B = P A P B − P AB the asymptotic variance is