Localised estimates and spatial mapping of poverty incidence in the state of Bihar in India—An application of small area estimation techniques

Poverty affects many people, but the ramifications and impacts affect all aspects of society. Information about the incidence of poverty is therefore an important parameter of the population for policy analysis and decision making. In order to provide specific, targeted solutions when addressing poverty disadvantage small area statistics are needed. Surveys are typically designed and planned to produce reliable estimates of population characteristics of interest mainly at higher geographic area such as national and state level. Sample sizes are usually not large enough to provide reliable estimates for disaggregated analysis. In many instances estimates are required for areas of the population for which the survey providing the data was unplanned. Then, for areas with small sample sizes, direct survey estimation of population characteristics based only on the data available from the particular area tends to be unreliable. This paper describes an application of small area estimation (SAE) approach to improve the precision of estimates of poverty incidence at district level in the State of Bihar in India by linking data from the Household Consumer Expenditure Survey 2011–12 of NSSO and the Population Census 2011. The results show that the district level estimates generated by SAE method are more precise and representative. In contrast, the direct survey estimates based on survey data alone are less stable.


Introduction
Bihar is third-most populous state in India. According to the 2011 Population Census, the population of state is 103 million, which is about 8.58 percent of the total population of the country. Poverty is a very complex issue in Bihar and there is an exigent need to devise a focused strategy for poverty eradication. Reliable, qualitative and timely disaggregate level data is essential for effective planning, implementation and monitoring of various Government schemes in Bihar. Spatially disaggregated level data is inevitable for identifying the areas more in need and for developing focused and target oriented intervention programs. The geographic distribution of poverty and wealth is used to make decisions about resource allocation and provides a foundation for the study of inequality and the determinants of economic growth a1111111111 a1111111111 a1111111111 a1111111111 a1111111111 [1][2]. In developing countries, however, the scarcity of reliable quantitative data represents a major challenge to policy-makers and researchers. In India, National Sample Survey Office (NSSO) surveys are the main source of official statistics. A range of invaluable data at state and national level are generated through these surveys. The state level estimates generated by these surveys often masked the local level heterogeneity. More importantly, state level estimates do not adequately capture the extent of geographical inequalities which restricts the scope for evaluating progress locally within and between administrative units. But, the NSSO survey data cannot directly be used to produce reliable disaggregate level (e.g. district or further disaggregate level) estimates due to small sample sizes. In the survey literature, an area (or domain) is regarded as small if the area-specific (or domain-specific) sample is not large enough to support a direct survey estimator of adequate precision with unacceptably large coefficient of variation [3][4]. At the same time it is also true that conducting district specific survey is going to be very trivial and costly as well as time consuming job. An alternative solution to this problem is to use small area estimation (SAE) techniques. The SAE approach produces reliable estimates for such small areas with small sample sizes by borrowing strength from data of other areas. The SAE techniques are based on model-based survey estimation methods. The idea is to use statistical models to link the variable of interest with auxiliary information, e.g. Census and Administrative data, for the small areas to define model-based estimators for these areas. In other words, the SAE method uses indirect small area estimators that make use of the sample data from related areas or domains through linking models, and hence increases the effective sample size in the small areas. Such estimators can provide significantly smaller coefficient of variation than direct estimators, provided the linking models are valid, see [5]. Recently, some researchers have also used satellite imagery and mobile phone networks data to predict the poverty. Existing high resolution daytime satellite imagery is used to predict the spatial distribution of economic well-being across five African countries namely Nigeria, Tanzania, Uganda, Malawi, and Rwanda [6]. Anonymized data from mobile phone networks, combined with survey data, are also used to predict the poverty and wealth of individual subscribers, as well as to create high-resolution maps of the geographic distribution of wealth [7].
Based on the level of auxiliary information available from secondary data sources, SAE methods are categorized as based on area level and unit level small area models. Area level small area models are used when auxiliary information is available only at area (or aggregated) level. They relate area-specific direct survey estimates to area-specific covariates [8]. Unit level small area models, proposed originally by [9], relate the unit values of a study variable to unitspecific covariates. In this paper, we consider the area level version of small area model since auxiliary variables (covariates) are available only at the area (or aggregated) level. We apply SAE techniques to produce reliable small area estimates of the poverty incidence at district level in the State of Bihar in India by linking data from the existing Household Consumer Expenditure Survey data and the Population Census. Small areas are defined as the different districts of State of Bihar in India. In addition, poverty map is also produce to show spatial inequality in distribution of poverty incidence in the state. This paper, in particular, illustrates and provides a guidelines on how the existing large scale survey and Census data can be linked to generate reliable small area estimates for various policy relevant parameters.
line. Two types of variables are required for SAE analysis, the variable of interest and the auxiliary variables. In this study, the variable of interest for which small area estimates are required is drawn from the Household Consumer Expenditure Survey 2011-12 of NSSO for rural areas of the State of Bihar in India. The NSSO survey data is not freely downloadable but it can be obtained from the NSSO, Ministry of Statistics and Programme Implementation, Government of India (http://mospi.nic.in/). The sampling design used in the NSSO data is stratified multi-stage random sampling with districts as strata, villages as first stage units and households as the second stage units. A total of 3312 households were surveyed from the 38 districts of the Bihar. The district-wise sample size varied from minimum 64 to maximum 128 with average of 87 (Table 1). From Table 1, it is evident that district level sample sizes are very small with very low values of average sampling fraction of 0.00025. Therefore, it is difficult to produce reliable estimates of the poverty incidence and their standard errors at district level. Hence, the application of SAE technique is an obvious choice for obtaining the district level estimates of poverty incidence. The SAE technique is expected to provide reliable estimates for the districts having small sample data [3][4][5]. The target variable used for the study is poor households. The poverty line has been used to identify whether given household is poor or not. A household having monthly per capita consumer expenditure below the state's poverty line (Rs 778) is categorised as poor household. The poverty line used in this study is same as those of year 2011-12, given by the planning commission, Government of India (see http://planningcommission.nic.in/news/press_pov2307.pdf).
The auxiliary (covariates) variables used in this analysis are drawn from the Population Census 2011. The Population Census 2011 data can be accessed freely from the Census of India website: http://www.censusindia.gov.in/2011census/population_enumeration.html. These auxiliary variables are only available as counts at district level, and there are approximately 50 such covariates that are available for use in SAE analysis. We therefore carried out a preliminary data analysis in order to define appropriate covariates for SAE modelling, using Principal Component Analysis (PCA) to derive composite scores for selected groups of variables. The reader is referred to [10][11] for a more detailed discussion on PCA. The PCA variables (i.e. composite scores derived for selected groups of variables using PCA) instead of Census variables are often used in model-based small area estimation method because maximum variability is explained with reduced dimension of auxiliary variables, see for example [12][13]. We carried out PCA separately on three groups of variables, all measured at district level and identified as X 1 , X 2 and X 3 respectively. The PCA was done in SPSS software. The first group (X 1 ) consisted of literacy rates by gender and proportions of worker population by gender. The first principal component for this group (X 11 ) explained 52% of the variability in the X 1 group, while adding the second principal component (X 12 ) increased this to 100%. The second group (X 2 ) consisted of the proportions of main worker by gender, proportions of main cultivator by gender and proportions of main agricultural labourer by gender. The first principal component (X 21 ) for this second group explained 67% of the variability in the X 2 group, while adding the second component (X 22 ) increased this to 94%. Finally, the third group (X 3 ) consisted of proportions of marginal cultivator by gender and proportions of marginal agriculture labourers by gender. The first principal component (X 31 ) for this third group explained 52% of the variability in the X 3 group, while adding the second component (X 32 ) increased this to 77%. We then fitted a generalised linear model using direct survey estimates of proportions of poor households as the response variable and the six principal component scores X 11 ,X 12 ,X 21 ,X 22 ,X 31 , and X 32 as potential covariates. The final selected model included the three covariates X 11 , X 21 and X 31 . This final model was then used to produce district wise estimates of poverty incidence, i.e. estimates of the head count ratio (HCR) used in poverty mapping. This model was fitted using the glm() function in R and specifying the family as "binomial" and the district specific sample sizes as the weight.

Methodology
In this Section we illustrate the theoretical framework used to produce small area estimates of the poverty incidence and their measure of precision. The details presented here are followed from [12][13]. Let us assume a finite population U of size N and a sample s of size n is drawn from this population with a given survey design. We assume that this population consists of D small areas or small domains (or simply areas or domains) Throughout, we use a subscript d to index the quantities belonging to small area d (d = 1,. . .,D), where D is the number of small areas (or areas) in the population. The subscript s and r are used for denoting the quantities related to the sample and nonsample parts of the population. So that n d and N d represent the sample and population (i.e., number of households in sample and population) sizes in district d, respectively. Let s d denotes the part of sample from area d such that s ¼ [ D d¼1 s d and n ¼ X D d¼1 n d : Let y di denotes the value of target variable of interest y for unit i in small area d. Let assume that the variable of interest y is binary and the target is the estimation of population counts y d ¼ see for example [14]. Let us denote by y sd and y rd the sample and non-sample counts of poor households in area (or district) d. The sample count y sd has a Binomial distribution with parameters n d and p d , denoted by y sd~B in(n d ,p d ), where p d is the probability of a poor household in area d, often termed as the probability of a 'success'. Similarly, y rd~B in(N d − n d ,p d ). Further, y sd and y rd are assumed to be independent Binomial variables with p d being a common success probability. Here we assume that only aggregated level data is available for the small area modelling. For example, from survey data y sd and from secondary data sources (i.e. Census and administrative records etc) x d , the p-vector of the covariates, are available for area d. Following [12][13], the model linking the probabilities of success p d with the covariates x d is the logistic linear mixed model given by with Here β is the p-vector of regression coefficients, often known as fixed effect parameters, and u d is the area-specific random effect that capture the between area heterogeneity. We assume that u d 's are independent and normally distributed with mean zero and variance ϕ. Here, we observe that equation number (1) relates the area (or district) level proportions (direct estimates) from the survey data to the area (or district) level covariates. This type of model is often referred to as 'area-level' model in SAE terminology, see for example [4,8]. Area level model was originally proposed by Fay and Herriot [8] for the prediction of mean per-capita income (PCI) in small geographical areas (less than 500 persons) within counties in the United States. Fay-Herriot model [8] is widely used area level model for the estimation of small area quantities. In many small area applications, when data are non-linear on original scale, Fay-Herriot model is fitted on transformed scale. For example, some function of small area direct survey estimates is linearly related to the area aggregates of auxiliary variables. In small area income and poverty estimation project of the US Census Bureau, namely SAIPE, Fay-Herriot model is fitted using logarithm of direct poverty rate estimates [15]. Similarly, in Chilean poverty estimation methodology, Fay-Herriot model is fitted with transformed poverty rate estimates using the arcsine transformation [16]. In such cases, model parameters are estimated under Fay-Herriot model fitted on transformed scale. This is followed by back transformation to obtain the estimate for small area quantities on original scale. However, back transformation leads to biased estimates of small area quantities on original scale [15,17]. This approach of poverty estimation based on Fay-Herriot method using the transformed direct estimates is often criticised. The Fay-Herriot method for SAE is based on area level linear mixed model and their approach is applicable to a continuous variable. This model is not applicable for non-normal data. Equation number (1) on the other hand, a special case of a generalized linear mixed model (GLMM) with logit link function, is suitable for modelling discrete data, particularly the binary variables. Here, This leads to Eðy sd Collecting the area level models given by equation number (1), we can write population level version of model of form Here p = (p 1 ,. . .,p D ) T , X ¼ ðx T 1 ; . . . :; x T D Þ T is a D×p matrix, Z is a D×D diagonal matrix and u = (u 1 ,. . .,u D ) T is a vector of D×1 of area random effects, which is normally distributed with mean zero and variance O = ϕI D . Here, I D is a D×D diagonal matrix. Note that estimation of fixed effect parameters β and area specific random effects u d 's uses the data from all small areas. We used an iterative procedure that combines the Penalized Quasi-Likelihood (PQL) estimation of β and u with restricted maximum likelihood (REML) estimation of ϕ to estimate these unknown parameters. Detailed description of the approach can be followed from [18][19][20]. Let us write the total counts, i.e. the total number of poor households in district d as y d = y sd + y rd , where y sd (sample count) is known and y rd (non-sample count) is unknown. Therefore, a plug-in empirical best predictor (EBP) estimate of the total count in area (or district) d, obtained by replacing y rd by its predicted value, is given bŷ where Z T d ¼ ð0; ::; 1; :; 0Þ is 1×D vector with 1 in position d-th. An estimate of proportion in area d is then obtained asp EBP For area with zero sample sizes (i.e. non-sampled areas), the conventional approach for estimating area proportions or counts is synthetic estimation, based on a suitable GLMM fitted to the data from the sampled areas [12]. From equation number (1), for non-sampled areas, the synthetic type predictor of total count for area d iŝ where x d,out denote the vector of covariates associated with non-sampled area d. An alternative to predictor (3) has been proposed by [21]. Unfortunately, this predictor does not have a closed form and can only be computed via numerical approximation. This is generally not straightforward, and so many users tend to favour computation of a plug-in empirical predictors like (3). There are several alternative approaches for estimating the small area counts. For example, Bayesian approaches for modelling the counts, using a negative binomial distribution or via a hierarchical Poisson-gamma model, are popular in the disease mapping and ecological regression literature, see for example, [22][23][24][25][26] and references therein.
The mean squared error (MSE) estimates are computed to assess the reliability of estimates and also to construct the confidence interval for the estimates. Following [12,13,19,20], the MSE estimate of small area predictor (3) is given by In equation number (4), the first two components m 1 and m 2 constitute the largest part of the overall MSE estimate. These are the MSE of the best linear unbiased predictor type estimator when variance component parameter ϕ is known [20]. The third component m 3 is the variability due to the estimate of ϕ. For simplicity and ease of implementation, we define few notations to express different components of the MSE estimate given in equation number (4). We denote byV sd ¼ diag fn dp EBP d ð1 Àp EBP d Þg andV rd ¼ diag fðN d À n d Þp EBP d ð1 Àp EBP d Þg, the diagonal matrices defined by the corresponding variances of the sample and non-sample part respectively. We then define A ¼ fdiagðN À 1 d ÞgV rd , B ¼ fdiagðN À 1 d ÞgfV rd X À AΣV sd Xg and Σ ¼ ð0 À 1 I D þV sd Þ À 1 , where I D is an identity matrix of order D. We further writeV ð1Þ ¼ fX TV sd X À X TV sdΣVsd Xg À 1 andV ð2Þ ¼Σ þΣV sd XV ð1Þ X TVT sdΣ . Using these notations, the various components of MSE estimate are: Here vð0Þ is the asymptotic covariance matrix of the estimate of variance component0, which can be evaluated as the inverse of the appropriate Fisher information matrix for0. Further, this also depends upon whether we are use ML or REML estimate for0. We use REML estimate for0, then vð0Þ ¼ 2ð0 À 2 ðD À 2a 1 Þ þ0 À 4 a 11 Þ À 1 with a 1 ¼0 À 1 traceðV ð2Þ Þ and The empirical results reported in the next Sections are obtained using R software.

Results and discussion
We now discuss the results (i.e. estimates of the proportion of poor households at district level in the State of Bihar) generated by the model-based small area method (3). In this analysis, we use survey data from the Household Consumer Expenditure Survey 2011-12 of NSSO and the Population Census 2011, and assume a binomial specification for the observed district level sample counts. Model specification for this application was discussed in previous Section, and resulted in the identification of three PCA-based covariates, labelled X 11 ,X 21 and X 31 , there. In SAE applications, generally two types of diagnostics measures are suggested and employed, the model diagnostics and the diagnostics for the small area estimates, see [12,27]. The model diagnostics are applied to verify the assumptions of the underlying model. In equation number (1), the random district (or area) specific effects u d are assumed to have an independent and identical normal distribution with mean zero and constant variance ϕ. Fig 1  shows normal q-q plot of the district level residual, which provides evidence in support of the normality assumption for the district-level residuals.
Besides, this visual method for checking normality, we also perform Shapiro-Wilk (SW) test of normality (i.e., test based on uncertainty measurement in terms of p-value). The p-value from SW test indicates the chance that the sample comes from a normal distribution. Typically, a value of 0.05 is used as cutoff, i.e. if p-value is less than 0.05 we can conclude that the sample deviates from normality. We use shapiro.test() function in R software to implement this test. The SW test (W = 0.968 and p-value = 0.330) result with large p-value confirming the normality of district-level residuals. These results clearly indicate that the normality assumption is satisfied reasonably well for the data.
Other diagnostics are used to examine reliability (and validity) of the model-based small area estimates. Such diagnostics are suggested in [27]. The model-based small area estimates should be consistent with the unbiased direct survey estimates, be more precise than the direct survey estimates, and provide reasonable results to users. The values for the model-based small area estimates derived from the fitted model should be consistent with the unbiased direct survey estimates, wherever these are available, i.e. they should provide an approximation to the direct survey estimates that is consistent with these values being "close" to the expected values of the direct estimates. The model-based small area estimates should have mean squared errors significantly lower than the variances of corresponding direct survey estimates. For this purpose, we consider three commonly used measures, a bias diagnostic, a percent coefficient of variation (CV) diagnostic, and a 95 percent confidence interval-diagnostic. See for example, [12,13,27], for more information.
The bias diagnostic is used to examine if the model-based small area estimates are less extreme when compared to the direct survey estimates, when it is available. If direct survey estimates are unbiased, their regression on the true values should be linear and correspond to the identity line. Further, if model-based small area estimates are close to the true values the regression of the direct survey estimates on these model-based estimates should be similar. We therefore plot direct survey estimates on the y-axis and corresponding model-based small area estimates on x-axis and we look for divergence of the fitted least squares regression line from the y = x and test for intercept = 0 and slope = 1. In particular, the aim of the diagnostic is a simple test that the straight line found by regressing the direct estimate against the modelbased estimate provides an adequate fit of the small area estimates. The bias scatter plot of the district level direct survey estimates against the corresponding model-based small area estimates is given in Fig 2, with fitted least squares regression line (dotted line) and line of equality (solid line) superimposed. The bias diagnostic plot in Fig 2 indicates that the district level model-based estimates generated by EBP are less extreme when compared to the direct survey estimates, demonstrating the typical SAE outcome of shrinking more extreme values towards the average. The estimates of poverty incidence generated by EBP method lies close to the line y = x for most of the districts, which indicates that they are approximately design unbiased. This is expected, since the EBP estimates are realisation of random variables and so the regression of the direct estimates on the EBP estimates is unbiased for a test of common expected values. Such a test is provided by the Goodness of Fit (GoF) diagnostic.
This diagnostic tests whether the direct estimates and the model-based estimates generated by EBP are statistically different. The null hypothesis is that the direct estimates and the model-based estimates are statistically equivalent. The alternative is that the direct estimates and the model-based estimates are statistically different. The GoF diagnostic is computed using the following Wald statistic for EBP estimate: The value from the test statistic is compared against the value from a chi square distribution with D degrees of freedom. The value of chi square statistic with D = 38 degrees of freedom is 24.88 at 5% level of significance. In this analysis, the value of Wald statistic is W = 11.81. A smaller value (less than 24.88 in this case) indicates no statistically significant difference between the direct estimates and the model-based estimates generated by EBP. The diagnostic results clearly show that the model based small area estimates are consistent with the direct survey estimates.
In small area applications, aggregation or benchmarking of small area estimates at higher level is always desirable by the users. National statistical offices involved in generating the small area estimates always expect that the small area estimates are aggregated/ benchmarked to higher level estimate. At higher level of aggregation, the direct estimates are considered to be reliable and therefore the model-based small area estimates are expected to be near to the direct estimates when they are aggregated. We checked the aggregation of model-based small area estimates at state level. We computed state level incidence of poverty by aggregating the direct estimates and the model-based small area estimates (i.e. EBP), as ∑ d (N d ×Direct esti- The state level estimate of incidence of poverty computed by aggregation of direct and EBP methods are 0.200 and 0.202 respectively. As one expects, the model-based estimates aggregate well to state level direct estimate. We use the percent CV to assess the improved precision of the model-based small area estimates (EBP) and the direct survey estimates. The CVs show the sampling variability as a percentage of the estimate. Estimates with large CVs are considered unreliable (i.e. smaller is better). In general, there are no internationally accepted tables available that allow us to judge what is "too large". Different organization uses different cut off for CV to release their estimate for the public use. For example, some country uses cut off CV value of 20% for acceptable estimates [13]. However, the CV value of 20% is not standard in all the countries. The % CV of the direct and the EBP estimates are given in Table 1. Fig 3 presents the district-wise distribution of % CV of the model-based estimates generated by EBP and the direct estimates. The results in Table 1 and district-wise values in Fig 3 clearly show that the direct estimates of the proportion of poor households within each district are unstable, with CVs varying from 13.33% to 64% with average of 24.69%. The CVs of the EBP estimates are ranging from 12.96% to 37.27% with average of 21.19%. In Fig 3 and Table 1 we further notice that the CVs of the direct estimates are greater than 20% (30%) in 22 (9) out of the 38 districts. However, the model based estimates are greater than 20% (30%) in 20 (3) out of the 38 districts. The estimate of poverty incidence of Madhepura district is zero because the sample count is zero. As a result, the standard error of direct estimate is zero and hence CV cannot be computed for this district. This is one of the drawback of direct estimation. Except in two districts (i.e. Samastipur and Jamui), the model-based estimates are less variable (i.e. smaller CV), and hence relatively more precise than the direct estimates. In these two districts (Samastipur and Jamui) the value of CVs are at par for both direct and model-based estimates. Overall, using SAE improves the precision of the small area estimates.
The districts-wise 95 percent confidence interval (95% CI) of the EBP and the direct estimates are given in Table 1. It is important to note that the 95% CI for the direct estimates are calculated assuming a simple random sample generated the simple proportions. This ignores the effects of differential weighting and clustering within districts that would further inflate the true standard errors of the direct estimates. The 95% CIs for the model-based estimates are more precise and contain both direct and model-based estimates of the poverty incidence.
The spatial mapping of district-wise poverty incidence generated EBP method for the State of Bihar is shown in Fig 4. This map provides the spatial inequality in distribution of poverty incidence, i.e. the degree of inequality with respect to distribution of poor households in different districts. This map is very useful in identifying the districts and regions with low and high level of poverty incidence in the state. The district-wise poverty incidence generated by the EBP method in rural areas of Bihar ranges from 6 to 36% with average of 21%. From Fig 4 and Table 1, it can be seen that Madhepura (6%) has lowest poverty in the state. Supaul, Begusarai, Araria, Saharsa, Madhubani, Vaishali, Kishanganj, Khagaria and Purba Champaran have Small area estimation of poverty incidence smallest poverty rate (9-14%) whereas Buxar, Rohtas, Pashchim Champaran, Jamui, Bhojpur and Sitamarhi have highest rate of poverty (31-36%). This map clearly shows that districts bordering with eastern Uttar Pradesh have higher poverty incidence. The district level estimates as well as the spatial map of poverty rates are expected to provide invaluable information to policy-analysts and decision-makers for identifying the regions and districts requiring more attention in the state. This is an example of a "poverty map" showing reliable estimates of poverty incidence across a region of interest.

Conclusions
Theory of SAE method for estimation of proportions is well developed, however, its application in the field of agricultural or social sciences are not so popular. In developed countries like USA, UK, Australia etc., SAE has been initiated and included as a part of their objectives in the national statistical offices. Although need of small area statistics has been felt in different agencies and organization in India, but, not much initiative has been taken place. In India, the Census is usually limited in its scope in collection of data; it focuses mainly on basic social and demographic information and that too at decennial interval. On the other hand, NSSO conducts regular surveys on a number of socioeconomic indicators, but their utility is restricted to generate national and state level estimates, but not administrative units below state because of small sample sizes for such units. This paper demonstrates that the SAE can be used as cost effective and efficient approach for generating fairly accurate disaggregate level estimates of the poverty incidence from existing survey data and using auxiliary information from different published data sources. The results clearly indicates the advantage of using SAE technique to cope up the small sample size problem in producing reliable estimates. Notably, the modelbased SAE method brings gain in efficiency in district level estimates of the poverty.
Disaggregate level estimates of poverty incidence and poverty map are useful information for identifying the districts/regions with higher level of poverty rate. In particular, the poverty map shows how household poverty incidence varies by district across the State of Bihar in India. This type of map is a useful aid for policy planners and administrators charged with taking effective financial and administrative decisions that can impact differentially across the region. We conclude by observing that the estimates and spatial distribution of poverty incidence generated from this research should be useful for meeting the data requirements for policy research and strategic planning by different international organizations and by Departments and Ministries in the Government of India. These information can be used by state government of Bihar in allocation of budget in various government schemes. The methodology including MSE estimation and application to real data presented in this paper can also be used for producing reliable, timely and cost effective estimates using survey data from different sectors.