Poverty Mapping in the Age of Machine Learning

The Policy Research Working Paper Series disseminates the findings of work in progress to encourage the exchange of ideas about development issues. An objective of the series is to get the findings out quickly, even if the presentations are less than fully polished. The papers carry the names of the authors and should be cited accordingly. The findings, interpretations, and conclusions expressed in this paper are entirely those of the authors. They do not necessarily represent the views of the International Bank for Reconstruction and Development/World Bank and its affiliated organizations, or those of the Executive Directors of the World Bank or the governments they represent.


Introduction
Poverty maps provide granular estimates of poverty at the sub-national level in order to deepen the understanding of poverty in a given country, better inform the targeting of resources, and support the design of interventions tailored to local needs (Bedi et al. 2007;Elbers et al. 2007).Household surveys provide estimates that are sufficiently reliable for large areas in a given country but often lack the desired precision and coverage to be able to properly inform targeting interventions at a granular geographic level.Thus, it is necessary to rely on small area estimation.
Small area estimation is a branch of statistics focused on obtaining estimates of higher quality than those obtained directly from the household survey.These small area estimation techniques often combine data from household surveys and auxiliary information from censuses, registers, or others to produce estimates of higher quality than what is possible from survey data alone for areas or groups with small sample sizes.Since there is no such thing as a free lunch, to achieve these gains in quality it is necessary to rely on model assumptions which must be thoroughly checked.
The literature on small area estimation is rich and several variations on this basic procedure have been proposed.Unit-level models conduct estimation and prediction at the household level, assuming a linear relationship between the welfare measure and covariates (Hentschel et al., 1998;Elbers et al., 2003;Molina and Rao, 2010).Unit-level models are not well-suited to situations where the survey and census data correspond to different years, which is often the case in developing countries where censuses are conducted infrequently.Area-level models represent a feasible alternative that similarly relies on linear functional forms, but conduct estimation and prediction using only aggregate data for the geographical entities of interest (Fay and Herriot, 1979;Torabi and Rao, 2014).Unit-context models represent another alternative and are characterized by an estimation stage wherein household-level measures are modeled exclusively as a linear function of area-level characteristics (Nguyen, 2012;Lange et al., 2018;Masaki et al., 2020). 1ecent developments in poverty mapping have focused on the application of machine learning to remotely-sensed data, largely in response to the issue of out-of-date census information.These approaches also use survey-derived welfare measures but are distinguished by a first stage where a machine-learning model is fit to remotely-sensed covariates (e.g., data derived from satellite imagery or call detail records) rather than census-based covariates.For example, Chi et al. (2022) matched data from several remotely-sensed data sources to survey-based measures of "village" wealth for 56 countries, and then fit a prediction model using gradient boosting machines. 2They then applied the model to the populated surface of all 135 low-and middle-income countries to develop granular poverty maps for each country.Similar approaches have been used for individual countries, including Rwanda (Blumenstock et al., 2015), Senegal (Pokhriyal and Jacques, 2017), Bangladesh (Steele et al., 2017), and Belize (Hersh et al., 2021), among others. 3hile often subsumed under the term "machine learning," this modern approach to poverty mapping actually consists of several practices that differ from more traditional approaches.Most obviously, the modern approach substitutes non-parametric statistical methods for the parametric approaches traditionally used for poverty mapping.In addition, the modern approach relies heavily on remotely-sensed covariates rather than covariates derived from census data.Finally, poverty maps produced with machine-learning methods generally share a common validation procedure.
The key feature of this procedure is that model performance is assessed by calculating the R 2 from a regression of the observed poverty measures on the estimated poverty measures for the same geographical units. 4The poverty measures used in this procedure are most often direct, sample-based estimates rather than the true poverty measures for the regions of interest.While direct estimates are unbiased, they can be imprecise estimates of true poverty measures, meaning that it is unclear whether the validation procedure is informative of actual model performance (Corral et al., 2021b).
In this paper, we provide a more rigorous assessment of the performance of the modern approach to poverty mapping.Our approach consists of constructing a pseudo-census (hereafter "census") that we use to conduct design-based simulation experiments.The census provides the true values we wish to estimate.Our experiments entail repeatedly drawing survey samples from the census where each sample produces direct poverty estimates for the sampled geographic areas.For each direct estimate, the presence of the census means that we observe the true poverty measure corresponding to the direct estimate and we use these true poverty measures to gain insight not only into the credibility of model validation based on direct estimates, but also the performance of machinelearning methods when validated against the true poverty measures.Given that population censuses rarely collect detailed data on income or expenditures, we construct our census from a large-scale household survey conducted in Mexico: the Mexican Intercensal Survey of 2015.
Prior to presenting our simulation results, we discuss three ways in which the R 2 can be misleading in terms of model assessment.Specifically, we show analytically that the R 2 based on direct estimates is biased downward, that it is insensitive to differences in location and scale between the poverty estimates and the reference measures, and that it is context-dependent in the sense that it is influenced by the variance of the true poverty rates.We argue that the mean squared error is a better measure for understanding the strength of the association between the poverty estimates and reference measures.In addition, we argue that the ultimate concern is to understand how different methodological choices affect poverty by influencing the efficiency of targeting.We therefore propose a targeting experiment through which we examine the relative ability of alternative mapping methods to alleviate poverty in the context of the Mexican data.
The results of our simulations can be summarized as follows.First, we find that the magnitude of the downward bias of the R 2 based on direct estimates is large, with the true R 2 being around 35 to 50 percent higher depending on the model considered.We further find that this bias has important implications for model selection in that the direct R 2 incorrectly identifies the appropriate level of estimation in the vast majority of our simulations.Second, when assessing model performance based on the true poverty rates using the mean squared error, we find that our closest approximation to the standard machine-learning implementation performs poorly relative to several benchmark models.We find that these performance issues are largely due to the limited predictive power of remotely-sensed covariates relative to census-based covariates.Finally, our targeting simulations show that none of our machine-learning implementations outperform more traditional poverty mapping methods that are feasible with the same data.
Our paper builds on the work of Corral et al. (2021a) and Corral et al. (2022), who similarly used the Mexican Intercensal Survey to examine the performance of several poverty mapping methods.Corral et al. (2021a) focused exclusively on traditional poverty mapping approaches and did not consider the performance of machine-learning methods.While Corral et al. (2022) consider the performance of machine-learning approaches, their implementation only uses census-based covariates and thus does not examine the performance of the standard implementation that relies on remotely-sensed covariates.Importantly, we are not the first paper to evaluate the performance of machine-learning methods relative to census data.While Yeh et al. (2020) and Chi et al. (2022) validate their models relative to censuses, both papers use the R 2 as a performance metric and focus on wealth estimates rather than poverty.Finally, van der Weide et al. ( 2022) examine how well machine-learning methods using remotely-sensed data can reproduce poverty maps estimated using census data.They do not, however, compare their estimates to a ground truth.This paper then represents the first attempt to rigorously assess the performance of modern poverty mapping methods relative to true poverty rates.In what follows, Section 2 discusses the data we use for our experiments and Section 3 discusses the various methods we examine, including both machine-learning methods and some traditional poverty mapping methods that we use as performance benchmarks.Section 4 then considers the issue of model validation where we critically assess the R 2 as a validation metric and then propose some alternative metrics.Finally, Section 5 presents the results of our experiments and Section 6 provides some concluding remarks.

Data
The Mexican Intercensal Survey was carried out by the Mexican National Institute of Statistics and Geography (INEGI).The sample consists of 5.9 million households and is representative at the national, state (32 states), and municipality level (2,457 municipalities).It is also representative for localities with populations of 50,000 or more inhabitants.The administered questionnaire gathered information on household income, geographic location, household demographics, dwelling characteristics, and economic activities, among others.The size of the dataset is especially important, as it allows us to draw repeated samples that are sufficiently large and diverse.In addition, the fact that the survey gathered detailed income information on all households allows us to reliably calculate the required poverty measures that serve as the basis for our simulation experiments. 5rior to sampling from the Intercensal Survey, we make three modifications to the data.First, as a large number of households reported earning no income, we randomly remove 90 percent of these households.6Second, to ensure that all municipalities are sufficiently large, all municipalities with less than 500 households are removed.Finally, to ensure that all primary sampling units (PSUs) are also sufficiently large for sampling purposes, we combine several neighboring PSUs such that all include at least 300 households.Our final census dataset consists of 3.9 million households in 1,865 municipalities and 16,297 PSUs.The resulting census is used to draw 500 survey samples that serve as the basis for our simulation experiments (Tzavidis et al., 2018).In what follows, we describe our approach to constructing these samples.
Our sampling procedure is intended to reflect standard design elements used in many face-to-face household surveys conducted in the developing world (Grosh and Muñoz, 1996).Mexico's 32 states comprise the intended domains of the sample and the indicator of interest is household income per capita.For each sample, we seek to achieve a relative standard error (RSE) of 7.5 percent for average household income in each state. 7We use a two-stage clustered design wherein PSUs or clusters are selected within each domain and then a sample of 10 households is selected within each cluster.Our decision to select 10 households within each cluster is consistent with the design of other large-scale household surveys conducted in developing countries.Under simple random sampling, the minimum sample size required for each state is as follows: where ws and σ s represent average household income per capita and the standard deviation of per capita income for state s, respectively.
The minimum sample size under simple random sampling must be adjusted to account for the clustered design.The design effect due to clustering is accounted for by estimating the intracluster correlation of per capita income within each state.The correlation estimates can then be used to adjust the above sample sizes for the clustered design as follows: where n sc denotes the number of households selected in cluster c within state s (10 in this case) and ρ s is the intra-cluster correlation of per capita income.Given the above, we can calculate the number of clusters needed to achieve the minimum sample size for each state and then multiply this number by 10 to find the final household sample size.Taking these sample size requirements as given, each of our samples is then drawn in accordance with the two-stage design.
Specifically, the clusters within each state are selected with probability proportional to size (without replacement), where the measure of size is the number of households within each cluster.The households within each cluster are then selected via simple random sampling.According to this design, the inclusion probability for a given household is then approximately: where ñsc is the total number of households in cluster c within state s, N s is the number of clusters selected in state s, and ñs is the total number of households in state s. 8 The sample size across the 500 samples is roughly 23,500 households.Under this sampling scenario, not all municipalities are included, and the number of municipalities included varies from sample to sample, ranging between 951 to 1,020 municipalities.The median municipality included in a given sample is represented by a single cluster and thus its sample size is 10 households.
In addition to the Intercensal Survey, INEGI has also released geo-referenced data that is similar to the remotely-sensed data often used in machine-learning applications.These geographic variables were calculated by INEGI for the year 2015 (the same year as the Intercensal Survey) and are only available at the municipality level.The geo-referenced data includes 105 different indicators.
Specifically, the dataset includes information from 21 different dimensions and each dimension is associated with five different summary measures.For example, we summarize the atmospherically resistant vegetation index for each municipality using the minimum, maximum, mean, sum, and standard deviation of the index.The same procedure is applied to all dimensions.9We then have two alternative sets of covariates that we use in our simulations: the geo-referenced data and the standard set of sociodemographic characteristics from the census.

Methods
This section describes the methods we apply to the Mexican Intercensal Survey.The first subsection discusses machine learning, with a special focus on gradient boosting, as it is among the most popular machine-learning methods for poverty mapping.The second subsection then discusses more traditional approaches to poverty mapping, which we use as the benchmark in our performance assessment of the machine-learning methods.

Machine learning
The goal of machine learning is to develop high-performance algorithms for prediction, classification, and clustering/grouping tasks (Varian, 2014;Athey, 2018;Athey and Imbens, 2019).Such tasks can be divided into supervised and unsupervised learning problems.While unsupervised learning focuses on identifying clusters of observations that are similar in terms of their features, supervised learning uses a set of features to predict some outcome of interest.Supervised learning can be further divided into regression and classification tasks, where regression is concerned with predicting continuous outcomes and classification focuses on categorical outcomes.While there are many machine-learning methods used for regression and classification tasks (e.g., lasso, random forests, or support vector machines), here we fix ideas by focusing on a method that has been particularly popular for poverty mapping: gradient boosting.
Originally developed by Friedman (2001), gradient boosting machines are a family of machinelearning techniques that combine a large number of weak learners to form a stronger ensemble prediction.Unlike some common ensemble techniques (e.g., random forests) that simply average models in the ensemble, gradient boosting adds models to the ensemble sequentially by fitting new weak learners to the negative gradient of some chosen loss function.With the classic squarederror loss function, this amounts to sequentially fitting new models to the current residuals of the ensemble.10Extreme gradient boosting (XGBoost) is a particularly popular implementation of gradient boosting that uses classification and regression trees as the base models or weak learners.
The popularity of XGBoost is due to the fact that it is fast, scalable, and has been shown to outperform competing methods across a wide range of problems (Chen and Guestrin, 2016).
Classification and regression trees are based on a sequential binary splitting of the covariate space that serves to partition the data into subsamples that are increasingly homogeneous in terms of the outcome variable.A tree begins with a single root node that is split into two child nodes according to some decision rule.A decision rule pertains to a single explanatory variable and observations are assigned to the child nodes depending on whether they meet some condition associated with the explanatory variable. 11The child nodes can be further split into new child nodes based on different decision rules that involve other explanatory variables and cutoff points.Any child node that is not further split is referred to as a terminal node or leaf, and each leaf is associated with a parameter that provides a prediction for the subsample associated with that leaf.For any given observation, the ensemble prediction used by XGBoost is the sum of that observation's predictions across all trees in the ensemble.
In what follows, we discuss the key features of the XGBoost algorithm, drawing on the presentation provided in Chen and Guestrin (2016).Let y d i denote the outcome of interest (e.g., direct estimates of poverty rates) for observation or entity i, and let x i denote a vector of explanatory variables.In line with the above, the predicted value of the outcome y p i is given by the sum of predictions across the individual trees in the ensemble: where the trees are indexed by t and f t (x i ) gives the prediction of a given tree as a function of x i .To build the trees used in the model, XGBoost minimizes the following regularized objective function: where l(y d i , y p i ) is the loss associated with the i th observation and r(f t ) is a regularization term that serves to mitigate overfitting.In our implementation, we use the classic squared-error loss such that l(y d i , XGBoost uses the following specification of the regularization term: where M t is the total number of leaves in tree t, m tj is the prediction associated with the j th leaf in a given tree, and γ and λ are hyperparameters that must be chosen by the user (we discuss our selection of hyperparameters below).The regularization term serves to mitigate overfitting by penalizing complicated trees and smoothing the predictions associated with each tree.That is, the method relies on a large number of weak learners and the regularization term helps control the complexity of the learners.It is important to note that the above specification of the regularization term is only one of many possible specifications, though Chen and Guestrin (2016) claim that it is simpler than the leading alternatives and easier to parallelize.Further note that when the hyperparameters are set to zero, the objective function becomes equivalent to that used in traditional gradient boosting.
To minimize the objective function, the model is trained in a sequential manner, building one tree at a time.The objective function at iteration t can be written as follows: where y p i,t = y p i,t−1 + f t (x i ).Each iteration then adds the tree f t (x i ) that greedily minimizes the objective function at that iteration.More specifically, XGBoost uses a second-order Taylor series expansion to approximate the loss function at any given iteration: where g(•, •) and h(•, •) denote the first-and second-order derivatives of the loss function l(y d i , y p i,t−1 ) with respect to y p i,t−1 .That is, the point y p i,t−1 is taken as the basis for the expansion.The above conveniently omits the first term of the series l(y d i , y p i,t−1 ) because it is a constant.
The XGBoost algorithm is based on a re-expression of Eq. ( 8).Let z j denote the set of observations that belong to the j th leaf of the tree being built at iteration t.Substituting in the explicit form of the regularization term and noting that all observation that belong to the same leaf receive the same prediction m tj , we can write or, equivalently, where G j and H j represent the sums over g(•, •) and h(•, •) for all observations in z j .Note that the summation in the above is over j, which indexes the leaves in the tree.
For a given tree structure, we can find the optimal prediction values by minimizing the objective function with respect to m tj , which yields m * tj = −G j /(H j +λ).With a squared-error loss function, where g( , the optimal prediction values are effectively the (penalized) average of the residuals associated with a given leaf.We can then substitute the optimal prediction values into the objective function to obtain the following: which gives us the minimal objective function for a given tree structure.Ideally, one would enumerate all possible tree structures and then use the tree associated with the smallest value of the (minimized) objective function, but this is often intractable.XGBoost instead implements an alternative approach that seeks to minimize the objective at a given iteration by greedily optimizing one level of the tree at a time.
To this end, consider splitting a node in a given tree into "left" and "right" child nodes based on some candidate splitting rule.Let G L and G R represent G j for the left and right child nodes, respectively.Similarly, let H L and H R represent H j for the child nodes.We can then find the reduction in the objective function as follows: where the first two terms in parentheses capture the loss associated with the child nodes and the third term captures the loss associated with the parent node being split.When splitting a node in any given tree, XGBoost then evaluates Eq. ( 12) for every possible split rule for every available explanatory variable and chooses the split associated with the maximal reduction in the objective function.12XGBoost then continues to split nodes in a given tree until a stopping criterion is met (e.g., a user-specified maximum tree depth).Once a given tree is built, XGBoost then moves to building the next tree in the ensemble and continues to do so until reaching the user-specified number of trees to build.
The above references several hyperparameters that must be chosen by the user, including the hyperparameters in the regularization term (i.e., γ and λ), the maximum tree depth, and the number of trees in the ensemble.There are two additional hyperparameters that can be used to further prevent overfitting.First, one can use column or feature subsampling where each tree in the ensemble is built based on a (uniformly) random subset of all explanatory variables.This procedure can also speed up the algorithm because it reduces the number of covariates that must be searched across when creating new splits.Second, one can use shrinkage by scaling the predictions associated with each tree by a factor η, which is often called the learning rate.The predicted value of the outcome at iteration t is then given by y p i,t = y p i,t−1 + ηf t (x i ).Shrinkage reduces the influence of each tree so that no individual tree has too much influence on the ensemble prediction.
One possible approach to hyperparameter selection is to use a grid search where the user (1) specifies the domain of the hyperparameters, (2) applies the model to every combination of hyperparameters, and (3) selects the hyperparameters that perform the best in terms of some metric (e.g., mean squared prediction error in the validation dataset).Grid search, however, is computationally inefficient because hyperparameter proposals are uninformed by previous evaluations.We instead use a Bayesian hyperparameter optimization procedure, which sequentially updates a probability model that relates the hyperparameter space to the performance metric and then uses the model to intelligently explore the hyperparameter space (Bergstra et al., 2011).Specifically, our implementation of gradient boosting relies on the open-source library XGBoost, which we combine with the Hyperopt library that conducts Bayesian hyperparameter optimization using the tree-structured Parzen estimator.
Finally, regarding our application to the Mexican data, recall that mapping procedures entail an estimation stage and a prediction stage.For the machine-learning methods, we perform this procedure at two different levels of analysis, namely the PSU and municipality levels.Our focus is nevertheless on obtaining estimates of poverty rates at the municipality level, meaning that when we conduct estimation and prediction at the PSU level, we further aggregate the predictions to the municipality level using population-weighted averaging.Further, recall that we have two alternative sets of covariates available, including geo-referenced and census-based variables.Given that the geo-referenced data is only available at the municipality level, we conduct estimation and prediction at the municipality level for any implementation that uses these covariates.When using census-based covariates, however, we consider implementations at both the PSU and municipality levels.13

Traditional small-area methods
We consider three alternative traditional methods for obtaining small area estimates of poverty as a way to benchmark the performance of the non-parametric machine-learning approach.14Specifically, we consider area-level models, unit-level models, and a method that attempts to combine unit-and area-level models.First proposed by Fay and Herriot (1979), area-level models conduct estimation and prediction using data aggregated to the geographic area of interest (e.g., the PSU or municipality level).
The area-level model consists of two basic parts.The first part of the model links the actual or true poverty rates y a i for the i th entity to a vector of area-level covariates x i as follows: where β is a vector of parameters to be estimated and ε i represents random area-level effects that are assumed to be mean zero with a constant variance σ 2 ε .Since we lack information on the true poverty rates, this model cannot be immediately fit.
Instead, direct estimates of poverty rates from survey data ŷd i are used as the outcome variable.These estimates are nevertheless subject to sampling error because they are based on the areaspecific sample data.The second part of the model, thus, assumes that the direct estimates are centered around the true poverty rates as follows: This model is referred to as the sampling model, and is assumed to have heteroscedastic error variance var(ω i |y i ) = σ 2 ω,i .These variances are assumed to be known although in reality these are estimated using the survey data.The best linear unbiased predictor (BLUP) is obtained by predicting y i through the model, ŷd i = x i β + εi .The β are the weighted least squares estimator of β under the Fay-Herriot model and, εi = ψ i ŷd i − x i β is also the BLUP of the area effect ε i .Additionally, In essence, the final estimates for areas contained in the survey are the weighted average between the direct estimate ŷd i and the synthetic estimator x i β, where the weights, ψ i and (1 − ψ i ), are determined by the quality of the model and the quality of the direct estimator.For the non-sampled entities, the area-level model simply uses the synthetic estimates to produce predictions.As Molina et al. (2022) notes, normality assumptions are not required for the estimates to be BLUP, but assumptions are needed for estimation of the mean squared error (MSE).Finally, similar to the machine-learning approach, the area-level model can be implemented at the PSU or municipality levels. 15  Regarding unit-level models, the well-known methods of Elbers et al. (2003) and Molina and Rao (2010) are based on the nested error model originally proposed by Battese et al. (1988).This model assumes that (transformed) household income relates linearly to a vector of household-level covariates according the following model: where w iv denotes the transformed equivalized income of household v in location i, x iv represents a vector of household-specific characteristics, δ is a vector of coefficients on the household characteristics, µ i is location-specific random effect such that µ i ∼ N (0, σ 2 µ ) , and ξ iv is a household-level error term such that ξ iv ∼ N (0, σ 2 ξ ).The two errors are assumed to be independent from each other. 16We use a particular version known as the "census empirical best" method (CensusEB), which is a variant of the "empirical best" method of Molina and Rao (2010). 17 It is evident that the dependent variable of the model is not the poverty status of a given household.
Instead, the method attempts to replicate the welfare distribution based on an assumed data generating process.The parameter estimates obtained when fitting model ( 15) to the survey data ( δ, σ2 µ , and σ2 ξ ) are applied to the census household level data to generate a simulated value of w iv for each household in the census.From the simulated census welfare vector it is possible to obtain estimates of any welfare indicator, not just poverty headcount rates.
The parameter estimates are fit through a suitable method such as REML or Henderson's Method 15 Estimates at the PSU level may be aggregated to a higher level, see Rao and Molina (2015) for more details. 16Elbers et al. (2003) offer the option of accommodating for non-normally distributed errors -however, despite this flexibility Corral et al. (2021b) illustrate how even under non-normality the method lags Molina and Rao's (2010) EB approach.
17 As opposed to empirical best estimates, census empirical best estimates do not include survey observations when calculating area-level poverty rates.Corral et al. (2021b) show that when the sample size is small relative to the population the difference between the two methods is negligible.For additional discussion, see Molina (2019).
III (Henderson 1953). 18The w iv for each household is simulated M times using Monte Carlo simulation where each simulated welfare m follows: where µ * im and ξ * ivm are drawn from their assumed distributions. 19Here, µ * im is generated as µ * im ∼ N (μ i , σ2 µ (1 − κi )) for sampled municipalities, where μi = κ wi − x iv δ and κi = σ2 µ / σ2 µ + σ2 ξ /n i , n i is the sample size for municipality i.For municipalities that are not part of the sample, then µ * im is generated as µ * im ∼ N (0, σ2 µ ) which corresponds to the method used in ELL for all municipalities regardless of whether or not they are in the sample.Thus, EB makes more use of the survey data for predictions corresponding to sampled areas.Note that EB methods only ensure that the area level means of the dependent variable are Empirical Best Linear Unbiased Predictors (EBLUP); it does not guarantee the same for poverty which is a non-linear parameter.Hence, the importance of replicating the welfare distribution.
One major aspect of concern for small area estimation is the estimation of noise.A statistical agency in a given country may seek to undertake a small area estimation application because the quality of the estimates derived from survey data may not be sufficiently precise or estimates may not even be possible since the area was not sampled.Hence, a big concern surrounding small area estimation is the estimation of MSEs for each area's estimate.Estimates of noise are obtained via bootstrap replications where the model's assumptions are exploited in order to obtain an estimated MSE (González-Manteiga et al., 2008). 20This aspect is one of the biggest differences between poverty maps based on small area estimation and those based on machine learning.Statistical agencies across the globe will assess an estimate's noise before deciding to publish it or not, thus the importance of such estimates.
The key assumption of the unit-level model is that the distribution of the explanatory variables is similar between the survey used for estimation and the census used for prediction.This assumption is generally violated if the census is outdated -which is often the case in developing countriesand in this event the application of the unit-level model will lead to similarly outdated poverty maps (Lange et al., 2018).Importantly, machine-learning approaches to poverty mapping are commonly motivated by the fact that censuses are outdated and are thus viewed as a way to produce timely poverty maps in a data-constrained environment.Stated differently, given the differing data requirements between the unit-level model and the machine-learning approach, the two methods are not generally considered to be direct competitors.The unit-level model nevertheless serves as a useful performance benchmark, as it is often viewed as the "gold standard" of poverty mapping.
Unlike unit-level models, unit-context models were specifically developed to be used in off-census years.Initially developed by Nguyen (2012) and further refined by Masaki et al. (2020), these models are based on a simple modification of the unit-level approach: rather than modeling householdlevel income as a function of household-level characteristics, unit-context models instead use only area-level covariates from the census.Specifically, the unit-context model replaces the survey-based covariates x iv in Eq. ( 15) with census aggregates at the municipality and the PSU level, x i and x ic , and thus relaxes the assumption that the survey-based covariates match the moments of those from the census.Like the area-level model, the unit-context model is a direct competitor to the machinelearning approach because it is feasible to implement with the same data.For this reason, we use the unit-context model as reference model as we evaluate the performance of machine-learning methods.However, it is worth noting that the unit-context model has been shown to produce biased and noisy poverty estimates, largely due to its inability to explain between-household variation in income (Corral et al., 2022).

Model Validation
The coefficient of determination or R 2 is the most commonly used metric for validating poverty maps generated using machine-learning methods (see Jean et al. (2016) In this section, we first critically assess this validation approach by discussing three ways in which the R 2 can be misleading in terms of model assessment.We then argue that a better alternative is to validate poverty estimates using the mean squared error, which is not subject to many of the limitations of the R 2 .Finally, we outline another validation procedure that focuses more directly on what is of ultimate concern with poverty mapping, namely informing the targeting of resources for the purposes of poverty alleviation.
Let y a i denote the actual or true poverty indicator for the i th geographical entity.Further, let y d i and y p i denote the direct estimate and model-based prediction of the poverty indicator, respectively.The standard method for assessing the predictive performance of model-based poverty estimates calculates the R 2 from a regression like the following: where α and δ are the regression coefficients and ζ represents the error term. 21The R 2 for the above regression can be written as follows: where σ 2 d is the variance of the direct estimate and σ 2 ζ is the variance of ζ.R 2 d thus conveys information about how much of the variance of the direct estimate is explained by the model-based prediction.
While unbiased, direct estimates can be imprecise estimates of true poverty indicators (Corral et al., 2021b).Ideally, one would assess the performance of model-based predictions on the basis of the true poverty indicators themselves.That is, model-based predictions are more appropriately assessed using the R 2 from the following regression: where υ represents the error term.The R 2 for this regression can written as where σ 2 a is the variance of the true poverty indicator and σ 2 υ is the variance of υ.We are specifically interested in characterizing the bias of R 2 d when it is used to assess true quantity of interest, R 2 a . Let where ω is a mean-zero error term with variance σ 2 ω .That is, we view this as a classical measurement error problem where the direct estimate of the poverty indicator is a noisy estimate of the true poverty indicator.Note that the preceding equation implies that Majeske et al. (2010), we then define the bias as: Substituting Eqs. ( 17) and ( 19) into our expression for the bias, we have: Finally, we can substitute Eq. ( 21) into Eq.( 20) and rearrange, which gives and establishes a basic relationship between R 2 d and R 2 a .See Majeske et al. (2010) for additional mathematical details.
In Eq. ( 22) above, we see that R 2 d = R 2 a when σ 2 ω = 0 or, equivalently, when y d i = y a i for all i.In the presence of measurement error or when σ 2 ω > 0, we then see that R 2 d will be biased downward and the magnitude of the bias will be determined by the size of (σ Validation exercises based on R 2 d will thus tend to mischaracterize the performance of model-based approaches to poverty mapping, with the resulting estimates of R 2 being biased downward.One implication of the above is that R 2 d is context-dependent and will generally be influenced by the design of the survey data (e.g., larger sample sizes will tend to reduce measurement error, all else equal).In particular, a poorly performing model applied to a low bias setting may yield an estimate of R 2 d that is higher than an estimate of R 2 d from a strong model applied to a high bias setting.The presence of bias in the measures of R 2 can thus reverse the rank ordering of models when comparisons are made across contexts. 22here is an additional form of bias in R 2 that can lead to overoptimistic conclusions related to model validity.To illustrate this, note that R 2 a in Eq. ( 19) is mathematically equivalent to the squared (Pearson) correlation coefficient between y a i and y p i .A well-known property of the correlation coefficient is that it is invariant to any positive affine transformation of its arguments, implying that the squared correlation coefficient is invariant to any affine transformation (Gujarati, 2003). 23e can thus write where q 1 and q 2 are some arbitrary constants.The implication of the above is that the R 2 is insensitive to systematic bias in the model-based poverty estimates.For example, consider some unbiased poverty estimate y u i , meaning that E(y u i − y a i ) = 0. Now consider another estimate y b i that is identical to y u i up to some locational shift, implying that y b i = y u i + q and E(y b i − y a i ) = q.While y b i is biased (i.e., it will systematically over-or under-estimate poverty), it will achieve an R 2 that is identical to y u i .A final issue related to R 2 as a performance metric is that it is affected by the variance of the true poverty measures.To see this, consider the following re-expression of R 2 a : R where N is the number of geographical entities.The above shows that R 2 can be viewed as a normalized version of the mean squared error with the variance of the true poverty indicator used as the normalization factor.This approach to normalization produces an additional source of context-dependence in R 2 .Specifically, consider two models that perform equally well in that they achieve the same mean squared error.However, assume that one model is applied to a context where the variance of y a i is large and the other is applied to a low variance context.The above then implies that the model applied to the high variance context will achieve a higher R 2 , even though the performance of the two models is arguably identical.
Rather than relying on R 2 as a performance metric, one can avoid regression-based model assessment altogether by calculating the mean squared error directly: which avoids the normalization issue and provides a more direct understanding of the expected squared error of the estimates.The mean squared error in Eq. ( 25) is sensitive to systematic biases in the predictions.Reconsidering the biased and unbiased poverty estimates described above (i.e., y u i and y b i ), we can write MSE(y b i , y a i ) = MSE(y u i + q, y a i ) = MSE(y u i , y a i ) + q 2 .Systematic biases in the estimates will thus be penalized by the mean squared error.Further, with multiple estimates for each entity, one can calculate entity-specific mean squared errors, which can be decomposed into parts associated with the bias and variance of the estimates.Where y p ik denotes estimate k = 1, 2, . . ., K for the i th entity, we have where the first term on the right-hand side of the above captures the variance of the estimates for entity i and the second term captures the (squared) bias.24 The primary reason for obtaining granular poverty estimates is to improve the targeting of resources.
While understanding the error of the poverty estimates is useful in this regard, the ultimate concern is to understand how different methodological choices affect poverty alleviation by influencing the efficiency of targeting.Consequently, we also consider how machine-learning methods perform in terms of reducing poverty in the context of the Census created from the Mexican Intercensal Survey.
Similar to Elbers et al. (2007), we propose simulating a hypothetical poverty-alleviation program that distributes resources to households according to a given set of small area estimates. 25In addition to the estimates, the procedure requires true household-level incomes from a census-type dataset.
The procedure we use in our simulations is as follows: 1. Using the true household-level incomes, calculate the poverty headcount and poverty gap for the country as a whole for some chosen poverty line.Using the poverty gap measure, determine the budget required to completely eradicate poverty. 26Calculate an individuallevel transfer amount by dividing the total budget by the number of people living in poverty.
2. Using a given set of small area estimates, order the municipalities according to their estimated poverty headcounts, from most to least poor.For the poorest municipality, augment the true per capita income of all households by an amount equal to the transfer (i.e., each household receives an amount equal to the transfer times the number of household members).Continue this distribution procedure for the second poorest municipality, the third poorest municipality, and so on until there are not enough resources to cover all households in a given municipality.
For the marginal municipality, distribute the remaining budget equally among individuals in that municipality.
3. Using the post-transfer household incomes, calculate an updated poverty headcount to assess the effects of the distribution scheme on poverty alleviation.
4. Repeat steps 2 and 3 for each estimation method for all 500 sets of small area estimates corresponding to each sample.
The benchmark poverty alleviation from this scheme corresponds to the results of applying the above to the true municipal poverty rankings.The results for each method are compared to that result.In our implementation, we fix the poverty line at the 25th percentile of per capita income prior to transfers.The benchmark poverty rate, after delivering transfers to people in municipalities ranked by their true poverty rate, is 19.9 percent.

Results
Recall from Section 4 that the R 2 measure often used to assess the Recall that the R 2 measure often used to assess the performance of machine-learning models is biased downward. 27To understand the extent of this bias in practice, we have calculated the R 2 associated with gradient boosting when using both the direct estimates and true poverty rates as the basis for performance assessment.We conduct this comparison at two different levels of estimation, namely the PSU and municipality levels.All R 2 calculations are nevertheless made at the municipality level, meaning that the PSU-level model is estimated at that level and then the poverty estimates are aggregated to the municipality level.For each of our 500 samples, we thus calculate four R 2 measures: (1) PSUlevel estimates evaluated against the direct poverty estimates, (2) PSU-level estimates evaluated against the true poverty rates, (3) municipality-level estimates evaluated against the direct poverty estimates, and (4) municipality-level estimates evaluated against the true poverty rates.All models use only census-based covariates.28 Panel (a) of Figure 1 presents these results and we find evidence of considerable downward bias for both the PSU-and municipality-level models.For example, for the PSU-level model, we find that the median "true" R 2 is 0.88 whereas the median "direct" R 2 is 0.59.Interestingly, the true R 2 tends to be higher for the PSU-level model than the municipality-level model, but the opposite is the case for the direct R 2 .This suggests that model selection based on the direct R 2 can be misleading in that the direct R 2 incorrectly suggests that the municipality is the preferred level of estimation.More specifically, we find that the true R 2 identifies the PSU as the preferred level of estimation for 437 of the 500 samples, but the direct R 2 only selects the PSU level for four of the 500 samples.The direct R 2 then identifies the incorrect level of estimation for 433 of the 500 samples.
In Section 4, we discussed two other ways that the R 2 can be misleading for assessing model performance: it is invariant to locational shifts and influenced by the variance of the true poverty rates.Using the Mexican data, we can assess the relevance of the former issue by comparing the true R 2 to a constrained version of the true R 2 that fixes the intercept and coefficient to zero and one, respectively.By constraining the R 2 in this manner, we remove its ability to adjust for systematic differences between the poverty estimates and the true poverty rates.Panel (b) of Figure 1 presents the results and we see that, as expected, the unconstrained R 2 overstates the model's performance, albeit only to a small degree.We nevertheless find that even these small differences can lead to model selection issues.In particular, the constrained R 2 identifies the PSU level as the appropriate level of estimation for 355 of the 500 samples whereas the unconstrained version, as mentioned, selects the PSU level for 437 out of the 500 samples.The commonly used unconstrained R 2 then selects the incorrect model for 82 of the 500 samples.
Given the shortcomings of the R 2 , we have argued that the empirical MSE of the poverty estimate at the area level represents a better metric for validating the performance of a particular set of estimates.Panel (a) of Figure 2 thus presents our empirical MSE results for several alternative models.The first three models apply gradient boosting at the municipality level using three alternative covariate sets: geo-reference covariates ("GIS-MUN"), census-based covariates ("CEN-MUN"), and then a model with all available covariates ("ALL-MUN").The fourth model applies gradient boosting at the PSU level and only uses census-based covariates ("CEN-PSU"). 29The final three models are all traditional methods that we use to benchmark the performance of gradient boosting.We present unit-level (CensusEB), unit-context, and area level models.We consider three alternative implementations of the area-level model that, much like for gradient boosting, vary the level of analysis and the covariates used.In particular, the empirical best (CensusEB) model is often considered the "gold standard" of small-area estimation and corresponds to an ideal scenario where up-to-date, micro-level census data are available.For these traditional models, all available covariates enter the models linearly (i.e., we do not include interaction or quadratic terms).
One advantage of the MSE is that it can be calculated for each geographical entity.As such, each box plot in panel (a) of Figure 2 summarizes the distribution of municipality-level MSE estimates rather than the aggregate MSE for each sample (i.e., the results are based on Eq. [26]).It is useful to first consider how gradient boosting performs relative to the traditional models when using the same census-based covariates.As is evident in the results from panel (a) and (b) of Figure 2 the gradient boosting methods using census aggregate data -Gradient boosting (CEN-PSU) and Gradient boosting (CEN-MUN) -perform better than unit-context and Fay-Herriot models, despite using similar data, and approximate the performance of unit-level (CensusEB) estimates.
The median MSE for the CensusEB model is 0.003 while that of gradient boosting methods is 0.002 and both methods have similar interquartile ranges.The results suggest that gradient boosting methods can approximate CensusEB unit-level methods and will likely outperform unit-context and Fay-Herriot methods by a considerable margin, which is quite encouraging for cases where estimates are needed when the survey and census data are not contemporaneous.30after running gradient boosting with all covariates on each sample and averaging each covariate's importance across the 500 samples.We find that all of the top 10 covariates are census-based, with only five geo-referenced covariates entering the top 20.
Figure 4 presents MSE and bias estimates for various models across different "poverty quantiles." That is, we order all municipalities by their true poverty rates, divide them into 50 quantiles, and calculate the average MSE and bias for each quantile.For the sake of simplicity, we only present results for two machine-learning models -the municipality-level models using only geo-referenced and census-based covariates -and select traditional methods.Panel (a) presents the MSE results and panel (b) presents the bias results.In both plots, gradient boosting with census-based covariates performs quite similarly to the "gold standard" unit-level model across all quantiles.However, gradient boosting with geo-referenced covariates again departs considerably from this benchmark, particularly at the lowest and highest quantiles.The bias results are particularly notable: the model with geo-referenced covariates systematically underestimates poverty at the lowest quantiles and overestimates poverty at the highest quantiles.The standard implementation thus understates the spatial distribution of poverty.
An additional consideration is how the methods perform for predictions which are out of sample.
Because the experiment consists of 500 samples taken from the census data, a given municipality may be present in one of the samples and may be absent in others.Consequently, to evaluate out of sample properties we rank municipalities by their likelihood of selection across the 500 samples.
The likelihood of being present ranges from 0.09 to 1, with a median of 0.48 and a mean of 0.53.
Among the 20 percent least likely to be sampled municipalities, the gradient boosting model with GIS-based covariates is the worst performing in terms of MSE followed closely by unit-context models (Figure 5).In terms of bias among the 20 percent of municipalities that are least likely to be sampled, the gradient boosting model with the geo-referenced covariates shows very wide variability, while unit context models show downward bias.The other gradient boosting models perform similarly to CensusEB models in terms of bias and MSE.The performance, in terms of MSE, of the different methods seems to improve with the likelihood of being selected.Among municipalities most likely to be selected, the gradient boosting model with the geo-referenced covariates is still the worst performing (Figure 6).However, in terms of bias, among the most likely to be selected municipalities, the bias of Fay-Herriot methods at the municipality level is the best performing illustrating the importance of the methods EBLUP.
While we have focused on the performance of gradient boosting up to this point, there are nevertheless several other machine-learning models that can be used for poverty mapping.Leading alternatives include the least absolute shrinkage and selection operator (lasso) (Tibshirani, 1996), the random forest (Breiman, 2001), and Bayesian additive regression trees (BART) (Chipman et al., 2010).While the lasso is a simple way to conduct variable selection and regularization in the context of linear regression, the random forest and BART models are similar to gradient boosting in that they all rely on regression tree frameworks.We ask whether these alternative models can perform better than gradient boosting within the confines of the standard implementation, which   and bias for each applied method.The design MSE is calculated as the mean squared difference between the model based estimate for a given municipality (obtained from each of the 500 samples drawn from the census) and the municipality's true poverty rate.The bias is just the average difference between the model based estimate for a given municipality (obtained from each of the 500 samples drawn from the census) and the municipality's true poverty rate.The box-plots shows the municipality level spread of the design MSE and bias of the method's estimates for the least likely to be sampled municipalities.and bias for each applied method.The design MSE is calculated as the mean squared difference between the model based estimate for a given municipality (obtained from each of the 500 samples drawn from the census) and the municipality's true poverty rate.The bias is just the average difference between the model based estimate for a given municipality (obtained from each of the 500 samples drawn from the census) and the municipality's true poverty rate.The box-plots shows the municipality level spread of the design MSE and bias of the method's estimates for the most likely to be sampled municipalities.relies exclusively on remotely-sensed covariates.Panels (a) and (b) of Figure 7 present the MSE and bias results for all models.With the possible exception of the lasso, we find that all models perform comparably, meaning that the performance of gradient boosting when using geo-referenced covariates cannot be remedied by simply adopting a different machine-learning model.Finally, recall from Section 4 that we argued that the ultimate concern with poverty mapping is to understand how different methodological choices affect poverty alleviation.Figure 8 thus compares the machine-learning and traditional approaches in terms of their ability to reduce poverty in the context of our targeting simulations.In these simulations, a hypothetical poverty-alleviation program targeted on the basis of true municipality-level poverty rates is able to reduce the overall poverty headcount to just under 20 percent.While the achieved poverty rates of all models fall within the 20 percent range, we find that gradient boosting with geo-referenced covariates is one of the worst performing models.With the exception of the area-level model using geo-referenced covariates, the standard implementation is thus outperformed by all traditional methods, including those that are direct competitors.However, when incorporating census-based covariates, we find that gradient boosting achieves poverty rates that are comparable to the traditional methods, yet again showing that an exclusive reliance on remotely-sensed data can limit the performance of the machine learning.
The results of the targeting simulation are relevant to cases where targeting is based solely on the ranking of municipalities -all methods appear to do a solid job.Nevertheless, there are many instances where the targeting mechanism may rely on the method's poverty estimates.For example, Senegal's Unique National Registry (RNU -in French) of vulnerable households relies on actual estimates from poverty maps to determine the quotas for sampling by municipality (Ferre 2018).
It is in those cases where a method that yields unbiased and precise estimates should be employed.

Conclusion
This paper sets out to validate traditional small area estimation and machine learning approaches for poverty mapping.Machine learning approaches have become popular across the globe for producing highly disaggregated estimates of poverty which often feed into policy making.The methods are particularly popular in the most data deprived scenarios where census data are not frequently collected, and even when collected may not be contemporaneous to a household survey.
Hence, the methods are also a popular alternative for cases where the census data is too old for use under a traditional unit-level small area estimation exercise.To the best of our knowledge, a rigorous validation of machine learning methods like the one done here, where machine learning methods are compared to traditional small area estimates and to a true welfare measure, has not been done before.
We first showed analytically that that the R 2 can be a misleading measure of model performance: it is biased downward when direct estimates are used as the reference measure of model performance, it is insensitive to differences in location and scale between the reference measure and the model predictions, and it is influenced by the variance of the reference measure.One practical issue this raises is that it can be misleading to compare the performance of poverty mapping methods across different datasets, as is done in Yeh et al. (2020), Chi et al. (2022), andAiken et al. (2022), among others.Most notably, survey design elements can affect measurement error in the direct estimates (e.g., through different sample sizes), which in turn influences the magnitude of bias in the R 2 .All else equal, models applied to low-bias settings will appear to perform better than those applied to high-bias settings.
Our simulations built on these analytical results.First, we showed empirically that the direct R 2 exhibits considerable downward bias, with the true R 2 being roughly 35 to 50 percent higher than the direct R 2 .We also showed that this bias has implications for model selection, as the direct R 2 commonly identified the incorrect level of estimation in our data.Second, we assessed the performance of machine learning methods and found that they may outperform or achieve similar bias and MSE as the best performing traditional small area estimation approaches, particularly in out-of-sample predictions.Third, we present a comparison on the potential benefits of publicly available GIS covariates.We show how GIS covariates under the data used here do not outperform census aggregate data.This result suggests that there may be considerable gains in linking census and administrative data for poverty mapping in off census years.Fourth, we expand the validations from Corral et al. (2022) and illustrate how Fay-Herriot models are still the preferred approach for off-census years among traditional small area estimation methods.Finally, we present evidence on how, despite different performance across models, all models are beneficial for targeting.While some models may outperform others in their potential poverty reduction when relying on estimates for targeting, even the worst performing models achieve decent poverty reduction, but may come at a considerable cost.
The results found here should be caveated by the fact that they are dependent on the Mexican Intercensal data used.Results and model performance may differ considerably in a different country or using different data.It is quite possible that in less-developed economies the geo-referenced covariates perform better than what is observed here, but there is no guarantee this will be the case.Validating the usefulness of a particular type of data is not straight forward as it may not work as well everywhere, it is for this reason that comparisons here are made in a more holistic manner.
When comparing the performance of geo-referenced covariates to other type of data one should use the same method, as is done here.On the one hand, we show how using census aggregates at the municipality level relying on gradient boosting comes very close to the best performing method's estimates.On the other hand, using the same method with GIS covariates we show how it falls short of competing with a model where only census aggregates are used.
The paper opens considerable avenues for future research.One avenue is related to methodology.Specifically, how the noise of machine learning methods can be accurately estimated since the noise measures presented here rely on having the true poverty rates which is not possible in the vast majority of contexts.Second, the results suggest that most methods would do a solid job for spatial targeting type exercises.However, a key question is whether or not this will hold elsewhere and under what circumstances will one method outperform others.

Figure 1 :
Figure 1: R 2 estimates at the PSU and municipality levels

Figure 2 :
Figure 2: Empirical design MSE and bias for select models

Figure 4 :
Figure 4: Empirical average MSE and bias by poverty quantiles for select models (a) MSE

Figure 5 :
Figure 5: Empirical design MSE and bias with alternative machine-learning models among 20 percent least likely to be sampled municipalities (a) Design MSE MSE

Figure 6 :
Figure 6: Empirical design MSE and bias with alternative machine-learning models among 20 percent most likely to be sampled municipalities (a) Design MSE MSE

Figure 7 :
Figure 7: Empirical design MSE and bias with alternative machine-learning models (a) Design MSE