Program Targeting with Machine Learning and Mobile Phone Data: Evidence from an Anti-Poverty Intervention in Afghanistan

Can mobile phone data improve program targeting? By combining rich survey data from a"big push"anti-poverty program in Afghanistan with detailed mobile phone logs from program beneficiaries, we study the extent to which machine learning methods can accurately differentiate ultra-poor households eligible for program benefits from ineligible households. We show that machine learning methods leveraging mobile phone data can identify ultra-poor households nearly as accurately as survey-based measures of consumption and wealth; and that combining survey-based measures with mobile phone data produces classifications more accurate than those based on a single data source.


Introduction
Each year, hundreds of billions of dollars are spent on targeted social protection programs.
The importance of these programs increased dramatically in the past year: In 2020, global extreme poverty increased for the first time in two decades, and most countries expanded their social protection programs, with more than 1.1 billion new recipients receiving governmentled social assistance payments (Gentilini et al., 2020).
Determining who should be eligible for program benefits -targeting -is a central challenge in the design of these programs Lindert et al., 2020).
In high-income countries, targeting frequently relies on tax records or other administrative data on income. In low-and middle-income countries (LMICs), where a large fraction of the workforce is informal, programs often require primary data collection. The difficulty and cost of collecting data, and the variable quality of what gets collected, can introduce significant errors in the targeting process (Deaton, 2016;Jerven, 2013;Grosh et al., in press).
These issues are exacerbated in fragile and conflict-affected countries, where two thirds of the world's poor are expected to reside by 2030(Corral et al., 2020. This paper evaluates the extent to which non-traditional administrative data, processed with machine learning, can be used for program targeting. Specifically, we match call detail records (CDR) from a large mobile phone operator in Afghanistan to household survey data from an impact evaluation of the Afghan government's Targeting the Ultra-Poor (TUP) anti-poverty program. Eligibility for the TUP program was determined through a combination of a community wealth ranking (CWR) and a short follow-up survey (we refer to this combination as the hybrid targeting method ). We then assess the accuracy of three counterfactual targeting approaches at identifying the actual beneficiaries of the TUP program: (i) our CDR-based method, which applies machine learning to data from the mobile phone company; (ii) an asset-based wealth index, which uses asset ownership to approximate poverty in a spirit similar to a proxy-means test (PMT); and (iii) consumption, a common benchmark for measuring poverty in LMICs.
Our analysis produces three main results. First, by comparing errors of inclusion and exclusion using the program's hybrid method as a benchmark, we find that the CDR-based method is nearly as accurate as the asset and consumption-based methods for identifying the phone-owning ultra-poor households. Second, we find that methods combining CDR data with measures of assets and consumption are more accurate than methods using any single data source to identify the ultra-poor. Third, we find that when non-phone-owning households are included in the analysis, the CDR-based method remains accurate if non-phone-owning households are classified as ultra-poor and therefore program-eligible; however, targeting performance is quite poor if households without phones are ineligible for benefits.
These results connect two distinct strands of prior work. The first is a rich literature on program targeting, which studies the effectiveness of different mechanisms for identifying program beneficiaries. In LMICs, research has focused in particular on the performance of proxy means tests (PMTs), (Grosh & Baker, 1995;Filmer & Pritchett, 2001;Brown et al., 2018), community-based targeting strategies (Alatas et al., 2012;Fortin et al., 2018), and related approaches (Banerjee et al., 2007;Karlan & Thuysbaert, 2019;Premand & Schnitzer, 2020). A meta-analysis by Coady et al. (2004), which includes 8 PMTs and 14 communitybased programs, finds little difference in targeting accuracy between the two methods -but notes that targeting is regressive in a quarter of programs reviewed. In addition to issues with targeting accuracy, the current methods available for poverty targeting in LMICs are time-and resource-intensive, and may be infeasible in fragile or conflict-affected areas or in contexts when mobility and social interaction is limited, such as during a pandemic.
The second body of work explores the extent to which non-traditional sources of data, in conjunction with machine learning, might help address data gaps in LMICs (e.g. Blumenstock, 2016;Burke et al., 2021). Much of this work focuses on estimating the geographic distribution of wealth and poverty at fine spatial granularity, using data from satellites (Jean et al., 2016;Engstrom et al., 2017), mobile phones (Blumenstock et al., 2015;Hernandez et al., 2017), social media (Fatehkia et al., 2020;Sheehan et al., 2019), or some combination of these data sources (Steele et al., 2017;Pokhriyal & Jacques, 2017;Chi et al., 2020). Most relevant to our current analysis, two prior papers investigate whether the wealth of individual mobile subscribers can be accurately estimated using mobile phone data. Blumenstock et al. (2015) show that CDR data are predictive of an individual-level asset-based wealth index among a nationally representative sample of 856 Rwandan mobile phone owners (crossvalidated r = 0.68). Blumenstock (2018b) finds similar results with a sample of 1,234 male heads of households in the Kabul and Parwan districts of Afghanistan.
Our paper connects these two distinct literatures by rigorously assessing the extent to which phone-based estimates of poverty can help with program targeting (Blumenstock, 2020;Aiken et al., 2021). 1 The context of our empirical analysis -identifying ultra-poor households in Afghanistan -is a particularly challenging environment for data collection and program targeting, as 62% of the households classified as not ultra-poor still fall below 1 The anti-poverty program implemented and described by Aiken et al. (2021) in Togo was based on the methods developed and evaluated in this paper. Due to the more time-sensitive nature of the COVID-19 response described in Aiken et al. (2021), the two academic articles are in circulation concurrently.
the national poverty line. The fact that these methods show promise in this context suggests that they may be relevant to a broad class of targeting applications. We therefore conclude by discussing the important ethical and logistical considerations that may influence how CDR methods are used to support targeting efforts in practice.

Data and Methods
Our main analysis evaluates the extent to which machine learning and mobile phone data can accurately differentiate between ultra-poor and non-ultra-poor households in rural Afghanistan. This section describes the study population, the key datasets, and methods used to perform the evaluation.

Household Survey Data
The ground-truth data that we use to evaluate this new approach to program targeting were collected as part of the Targeting the Ultra-Poor (TUP) program implemented by the government of Afghanistan with support from the World Bank. The TUP program included an impact evaluation of a "big push" anti-poverty program that provided multifaceted benefits to ultra-poor households (Bedoya et al., 2019). Our analysis is centered on a baseline survey that was collected for the TUP program, which contains well-being measures for 2,852 households in 80 of the poorest villages in Afghanistan's Balkh province, surveyed prior to the TUP launch (between November 2015 -April 2016). 2 These data include surveys of nearly all of the 1,173 ultra-poor households in the villages deemed eligible for the program, and a random sample of 1,679 non-ultra-poor households. 3 Baseline surveys were conducted in two in-person interviews, one with the primary woman of each household, and one with the primary man.
Ultra-Poor Designation Eligibility for the TUP program was determined based on based on geographic criteria, 4 followed by a two-step process including a community wealth ranking (CWR) and a follow-up in-person survey. CWRs were conducted separately in each village, 2 Our analysis restricts to 2,814 households for whom consumption and all asset data are non-missing. 3 The response rate for ultra-poor households was 96%. Approximately 20 households in each of the study villages were randomly drawn (excluding TUP-eligible households), to provide a representative benchmark for the TUP sample. 4 The poorest villages in the province were identified subject to having availability of veterinary services, financial institutions, and social services, and being relatively accessible. coordinated by a local NGO and village leaders, in collaboration with the government's Microfinance Investment Support Facility for Afghanistan (MISFA). CWRs divided households into four categories: well-off (6%), better-off (18%), poor (33%), and extreme-poor (43%).
The CWR was followed by an in-person survey to determine whether nominated households met a set of qualifying criteria, coordinated by the NGO and MISFA representatives, and based on a measure of multiple deprivation.
For a household to be designated as ultra-poor, and therefore eligible for program benefits, it had to be considered extreme-poor in the CWR, and also meet at least three of six criteria: 5 1. Household is financially dependent on women's domestic work or begging.
2. Household owns less than 800 square meters of land or is living in a cave.
3. Targeted woman is younger than 50 years of age.
4. There are no active adult men income earners. 5. Children of school age are working for pay. 6. Household does not own any productive assets.
Ultimately, 11% of the households classified as extreme-poor in the community wealth ranking step -6% of the total population in the study villages -were classified as ultrapoor and eligible for TUP benefits. Of the 2,852 households surveyed for the TUP project, 1,173 (41%) were designated as ultra-poor, and 1,679 (59%) were non-ultra-poor.
Consumption The consumption module of the TUP survey contains information on food consumption for the week prior to the interview and non-food expenditures for the year prior to the interview. These are used to construct monthly per capita consumption values, as detailed in Bedoya et al. (2019). While consumption data are reported for the household as a whole, the survey questions were asked of the primary woman of the household. Based on these data, we construct as an outcome measure the logarithm of per capita monthly consumption, consistent with the approach used by the Afghanistan government to determine the national poverty line.

Asset Index
We construct an asset-based wealth index to assess the relative socioeconomic status of surveyed households. The asset questions, which describe the household as a whole, were asked of the primary woman of the household. The asset index is calculated as the first principal component of variation in household asset ownership for sixteen items detailed in Table S1. The principle component analysis (PCA) is calculated over the dataset of 2,814 households not missing any asset data, after standardizing each asset variable to zero mean and unit variance. This wealth index explains 25.3% of the variation in asset ownership. Figure S1 shows the distribution of the underlying asset index components and Table S1 shows the direction of the first principal component.

Other Variables
The TUP surveys collected several other covariates that we use in subsequent analysis. These include a food security index (composed of variables relating to the skipping and downsizing of meals, separately for adults and children), a financial inclusion index (composed of access to banking and credit, knowledge of banking and credit, and savings), and a psychological well-being index for the primary woman (standardized weighted average scores on the 7-item Center for Epidemiological Studies Depression scale, the World Values Survey happiness and satisfaction questions, and Cohen's 4-item stress scale). The construction of each index is documented in Bedoya et al. (2019). Crucially, the survey also collected data from each household on mobile phone ownership. Nearly all (99%) households with a cell phone provided their phone numbers and consented to the use of their call detail records for this study.
Sample Representativity Portions of our analysis are restricted to the 535 households from the TUP survey with phone numbers that match to our CDR (see Section 2.2). While the 2,852 households in the TUP survey are representative of the 80 study villages, they are not nationally representative of Afghanistan as a whole, and the 535-household subsample is not representative of the overall sample from the TUP survey. Table 1 compares characteristics of households included and excluded from the 535-household subsample; Figure S2 compares the distributions of these characteristics. There are some systematic differences: the 535-household sample we analyze is richer on average than households surveyed in the TUP study, which is consistent with households in the subsample being required to own at least one phone. For instance, while 88% of non-ultra-poor households in the TUP survey own at least one phone, only 72% of ultra-poor households own at least one phone. Table 1 and Figure S3, the three measures of well-being in our dataset are only weakly correlated with one another: for example, the correlation between the asset index and consumption measure is 0.37 in the full survey and 0.34 in the matched subsample. It is particularly important to note the characteristics of the ultra-poor:

Summary Statistics As shown in
while the ultra-poor population makes up 27% of the overall sub-sample, less than half of the ultra-poor fall into the bottom 27% of the sample by wealth index or consumption.
Sample Weights Since the TUP survey oversampled the ultra-poor (by a factor of roughly 12), portions of our analysis use sample weights to adjust for population representativity.
When sample weights are applied, it is explicitly noted; if not mentioned, no weights are applied. The sample weights are derived from the population of the village, and the household's ultra-poor designation. 6 After sample weights are applied, the ultra-poor make up 5.98% of the overall population, and 4.63% of our matched subsample.

Mobile Phone Metadata
In a follow-up survey conducted in 2018, we requested informed consent from survey respondents to obtain their mobile phone CDR and match them to the survey data collected through the TUP project. CDR contain detailed information on:

Machine Learning Predictions
CDR-based Method Extending the approach described in Blumenstock et al. (2015), we test the extent to which ultra-poor status can be accurately predicted from CDR. This analysis uses the 535 TUP households who match to CDR to train a supervised machine learning algorithm to predict ultra-poverty status from the mobile phone features. The intuition -also highlighted in Figure S4 -is that ultra-poor individuals use their phones very differently than non-ultra-poor individuals, and machine learning algorithms can use those differences to predict ultra-poor status.
Our main analysis uses a forest of gradient boosted decision trees (hereafter referred to as the "gradient boosting model"), which generally out-performs several other common machine learning algorithms for this task (including a standard logistic regression, a regularized logistic regression with L1 penalty, and a random forest). The feature importances for the trained model are shown in Table S2. For comparison, results using other machine learning algorithms are provided in Table S3.
Probabilistic predictions are generated via 10-fold cross-validation with each model, with folds stratified to preserve class balance. We tune hyperparameters using five-fold crossvalidation for each prediction fold separately, optimized over a wide grid of hyperparameters for each model. For the linear models and random forest, features are standardized to zero mean and unit variance and missing values are mean-imputed, separately for each prediction fold. Additional details on the machine learning methods are provided in Appendix A.
Combined Methods We also evaluate several approaches that use data from multiple sources to predict ultra-poor status. Our main combined method trains a logistic regression to classify the ultra-poor and non-ultra-poor households using the predicted probability from the CDR-based method (i.e., the output of the gradient boosting algorithm described above), as well as asset and consumption data collected in the TUP survey. For comparison, we similarly evaluate the performance of methods that combine only two of the available data sources (i.e., assets plus consumption, assets plus CDR, and consumption plus CDR).
Predictions for each of the combined methods are pooled over 10-fold cross-validation.

Evaluation
Evaluation on Matched Subsample Our main analysis focuses on the 535 households for which we observe both CDR and survey data, and evaluates whether machine learning methods leveraging CDR data can accurately identify households designated as ultra-poor by the TUP program (using the two-step hybrid approach described in Section 2.1). We compare the performance of the CDR-based method to the performance of methods based on the wealth index, consumption data, and combinations of these different data sources. 7 Each targeting method is evaluated based on classification accuracy, errors of exclusion (ultra-poor households misclassified as non-ultra-poor) and errors of inclusion (non-ultrapoor households misclassified as ultra-poor). We focus on the ultra-poor designation as the 'ground truth' status of the household, against which other methods are evaluated, since it is the most carefully vetted measure of well-being for this population, and the proxy that the government decided to use in targeting TUP benefits.
To evaluate the performance of the CDR-based and combined methods, we pool outof-sample predictions across the ten cross-validation folds, so that every household in our dataset is associated with a CDR-based predicted probability of ultra-poor status that is produced out-of-sample. To account for class imbalance, we evaluate model accuracy using a "quota method", by selecting a cut-off threshold for ultra-poor qualification (a maximum wealth index, maximum consumption, and minimum CDR-based predicted probability of being ultra-poor) such that each method identifies the proportion of ultra-poor households in our subsample; this cut-off also balances inclusion and exclusion errors. In our 535household matched dataset this threshold is 27%, as 27% of households are ultra-poor; in other samples (see following subsection), the percentage is different. We evaluate each model at this threshold for precision (positive predictive value) and recall (sensitivity). To capture the trade-off between inclusion and exclusion errors for varying values of this threshold, we also construct receiver operating characteristic (ROC) curves for each method and consider the area under the curve (AUC) as a measure of targeting quality. For each evaluation metric (precision, recall, and AUC), we bootstrap 1,000 samples from the original dataset to calculate the standard deviation of the mean of the accuracy metric. Each bootstrapped sample is of the same size as the original dataset, drawn with replacement.
Accounting for Households Without Phones Our main results assess the performance of different targeting methods on the sample of 535 households for whom we have both survey data and mobile phone data. We also present results that show how performance is affected when the analysis includes TUP households for whom we do not have mobile phone data (typically because they do not have a phone or because they use a different phone network than the one who provided CDR). For such households, it is straightforward to assess the performance of asset-based and consumption-based targeting. To evaluate households without CDR, we assume the CDR-based targeting would target such households (1) before households with CDR, or (2) after households with CDR. More details on this procedure are provided in Section 3.4.
We present results based on three different samples: 1. Matched Sample: The 535 households for whom we were able to match survey responses to CDR.
2. Balanced Sample: This sample includes the 535 matched households as well as the 472 households in the TUP survey who report not owning any phone. It excludes households that own a phone on a different phone network than the one who provided CDR.
The motivation for this sample is to provide an indication of targeting performance in a regime in which CDR can be used to target all phone-owning households. In addition to applying sample weights from the survey, households that do not own a phone are downweighted so that the balance of phone owners to non-phone-owners (with sample weights applied) is the same as in the baseline survey as a whole (with sample weights applied, 84% phone owners).
3. Full Sample: All 2,814 households in the TUP baseline survey for which asset and consumption data are available, with sample weights applied.
Note that the quota used to evaluate targeting changes for each sample, based on the number of households that are ultra-poor in the sample. For the matched sample, the targeting quota is 27.29%; for the balanced sample and full sample the quotas are 5.47% and 6.02%, respectively.

Performance of Targeting Methods
Our first set of results evaluate the extent to which different targeting methods can correctly identify ultra-poor households. This analysis compares the performance of CDR-based targeting methods to asset-based and consumption-based targeting, using the sample of 535 households for which survey data and CDR data are both available.
An overview of these results is provided in Figure 1. Figure 1a, shows the distribution of assets and consumption, as well as the distribution of predicted probabilities of being nonultra-poor generated by the CDR-based and combined methods, separately for the ultra-poor (pink) and non-ultra-poor (blue). The dashed vertical line indicates the threshold at which point 27% of households are classified as ultra-poor; we use this quota because 27% of households in this sample were designed as ultra-poor by TUP. Figure 1b provides confusion matrices that compare the true status (rows) against the classification made by each method (columns). These confusion matrices are also used to calculate the measures of precision and recall reported in Table 2 Panel A.
We find that the CDR-based method (precision and recall of 42%) is close in accuracy to methods relying on assets (precision and recall of 49%) or consumption (precision and recall of 45%). To evaluate the trade-off between inclusion errors and exclusion errors resulting from selecting alternative cut-off thresholds, Figure 1c shows the ROC curve associated with each classification method, and the associated Area Under the Curve (AUC). AUC scores are comparable among methods, with assets (AUC=0.73) slightly superior to consumption (AUC=0.71) and the CDR-based method (AUC=0.68).

Comparison of Errors Across Methods
To better understand where the targeting methods are making mistakes, Panel A of Table   3 indicates how the ultra-poor misclassified as non-ultra-poor (errors of exclusion, or false negatives) compare to the correctly classified ultra-poor (true positives). Panel B shows how the non-ultra-poor misclassified as ultra-poor (errors of inclusion, or false positives) compare to the correctly classified non-ultra-poor (true negatives).
We find that, broadly speaking, the classification errors made by all three methods tend to be sensible: when these methods make mistakes, they are generally not egregious. Across methods, false negatives tend to score higher on food security, financial inclusion, and psychological well-being than true positives -that is, all three targeting methods misclassify ultra-poor households as non-ultra-poor when those ultra-poor households are better-off, according to other observable characteristics not used in the targeting per se. Likewise, false positives (non-ultra-poor misclassified as ultra-poor) tend to score lower than true negatives across these same measures. The CDR-based method in particular tends to prioritize households that score low on these alternative measures of well-being.
To test for systematic misclassification of certain types of households, Table 4 displays the overlap in errors of exclusion and inclusion between methods. Our results suggest that the three classifiers misidentify the same households at a rate only slightly above random. 8

Combining Targeting Methods
Since the different targeting methods are identifying different populations as ultra-poor, there may be complementarities between asset, consumption, and CDR data. We therefore test a set of methods that integrate multiple data sources into a single classification. As shown in Panel A of Table 2, we find that this combined method, which takes as input the wealth index, total consumption, and the output of the CDR-based method, performs better (AUC = 0.78) than methods using any one data source (AUC = 0.68 -0.73). As shown in Table S5), the full method also outperforms methods based on any two data sources (AUC = 0.75 -0.76). However, it is worth noting the strong performance of a method that combines CDR data and the asset index (AUC = 0.76); this two-component method may be more practical than the combined method, since consumption data can be difficult to collect for large populations.
8 The rates of overlap should be interpreted relative to the expected overlap in errors for random classifiers with the same cut-off threshold for ultra-poor classification. Based on our selection of thresholds such that 27% of the sample is identified as ultra-poor, our three classifiers misidentify 15-27% of the non-ultra-poor and 51-65% of the ultra-poor. If these classifiers were random, we would expect approximately 20% overlap in errors of inclusion and 55% overlap in errors of inclusion.

Targeting Households Without Phones
An important limitation with CDR-based targeting is that households without phones do not generate CDR. This is a conceptual issue that we revisit in Section 4; for now, we present results that show how predictive performance is impacted by the inclusion of these households in the analysis.
This analysis uses two additional samples of TUP households to evaluate targeting performance: (i) the balanced sample, which adds all of the 472 households without phones to the sample of 535 for whom we have matched CDR; the balanced sample is intended to illustrate the performance of CDR-based targeting if CDR were available from all operators in Afghanistan -though it relies on the assumption that phone-owners observed on our mobile network are representative of all phone owners in Afghanistan (an assumption that is not fully satisfied, as shown in Table 1); and (ii) the full sample, which includes all 2,814 households surveyed in the TUP baseline; this sample includes an additional 1,807 households who report owning a phone, but whose number does not match to any number in the CDR provided to us by the single mobile operator. 9 Results in Panels B and C of Table 2 show the performance of each targeting approach on the balanced and full sample, respectively. Note that as described in Section 2.4, different targeting quotas are applied for each panel based on the proportion of each sample that is ultra-poor. In the CDR-based and combined approaches, we report performance when the households without CDR are targeted first (i.e. households without CDR are targeted in a random order and then the households predicted to be poorest are targeted until the quota is reached) as well as when households without CDR are targeted last (i.e., after the 535 households with phones are targeted, households without phones are included in a random order until the quota is reached).
Unsurprisingly, these results suggest that CDR-based targeting is not particularly effective when a large portion of the target population does not own a phone. This is particularly true in Panel C of Table 2, where only 16% of the sample (with sample weights applied) has matching CDR. However, when we simulate more realistic levels of phone ownership in Panel B (84% of the households, based on our survey data), we note that CDR-based targeting is once again comparable to asset-or expenditure-based targeting, particularly when households without phones are targeted first (AUC = 0.72, 0.70, 0.68 for assets, consumption, and CDR, respectively). On the other hand, if the CDR-based method is used and households without phones are targeted last (for example, if program administrators base targeting wholly on CDR and provide no benefits to any household without a phone), the CDR-based method only improves marginally on random targeting.

Additional tests and simulations
Our main analysis considers the household head to be the unit of analysis. As described in Section 2.2, this analysis is based on matching survey-based measures of well-being to phone data from the household head -to the best of our ability. This approach is most consistent with the design of the TUP program and the TUP sample frame. An alternative approach that we explore matches survey data reported by the household head to all phone numbers associated with the household. As shown in Table S6, the predictive accuracy of these models is slightly attenuated relative to the benchmark results (Table S3), particularly for the more flexible machine learning models.
We also explore the extent to which CDR can be used to predict other measures of socioeconomic status. The preceding analysis focuses on the household's TUP's ultra-poor designation as the ground truth measure of poverty, since this was a carefully curated label and the actual criteria used to determine TUP eligibility. In Table S7, we report the accuracy with which CDR (obtained from the household head, who is typically male) can predict consumption and asset-based wealth (elicited from the primary woman of each household). 10 In general, these machine learning models trained to directly predict consumption or asset-based wealth from CDR do not perform well. This contrasts with prior work documenting the predictive ability of CDR for measuring asset-based wealth (e.g. Blumenstock et al., 2015). We suspect a key difference in our setting -aside from the fact that we are matching CDR to socioeconomic status at the household rather than the individual levelis the homogeneity of the beneficiary population: whereas Blumenstock et al. (2015) uses machine learning to predict the wealth of a nationally-representative sample of Rwandan phone owners, our sample consists of 535 individuals from the poorest villages of a single province in Afghanistan, where even the relatively wealthy households are quite poor.

Discussion
Our key finding is that, in a sample of 535 phone-owning households in a set of poor villages in one province of Afghanistan, machine learning methods leveraging behavioral indicators computed from CDR are nearly as accurate as standard asset-and consumption-based methods for identifying ultra-poor households. Further, we find that methods combining survey data with CDR perform better than any of the methods using a single data source. In contexts like Afghanistan where standard targeting benchmarks are unavailable or of questionable quality, methods that integrate CDR may create new options for program targeting.
However, as we demonstrate empirically, low rates of phone ownership -or the inability to access data from all operators -can quickly undermine the value of CDR-based targeting.
While mobile phone penetration rates continue to rise in LMICs (GSMA, 2020), we expect that, for the forseeable future, CDR-based methods may be best deployed in conjunction with alternative approaches. In our specific setting, the CDR-based approach still works well if households without phones are targeted before the CDR-based algorithm then selects the poorest households with phones. However, this approach may not be appropriate in other contexts where phone ownership is less predictive of wealth, or where potential beneficiaries have the ability to strategically under-report phone ownership (Björkegren et al., 2020).
Our analysis also highlights several broader considerations that we believe are worth deeper investigation in future work. These include: Tradeoffs in data privacy and predictive accuracy CDR contain sensitive and personally identifying information, including phone numbers, contact networks, and location traces (De Montjoye et al., 2013;Taylor, 2016). Informed consent can help ensure participant autonomy, but also creates significant logistical complications. Differential privacy and related methods can provide formal privacy guarantees on CDR and other data (Hu et al., 2015), but there is an inherent trade-off between privacy and data utility when such privacy guarantees are introduced.
Algorithmic transparency and strategic behavior Using CDR to determine program eligibility may introduce incentives for people to manipulate their phone use. This consideration is not unique to CDR, as varying degrees of manipulation have been documented in social programs that use proxy means tests and other traditional targeting mechanisms (Camacho & Conover, 2011;Banerjee et al., 2018). Indeed, complex and non-linear machine learning algorithms, like the one presented in this paper, may obfuscate the logic behind targeting decisions and thereby reduce the scope for manipulation. However, society often demands transparency in algorithmic decision-making, as black-box decisions are difficult to audit or hold to account. There is therefore a tension between the goals of increasing transparency and reducing manipulation, though recent advances in machine learning explore mechanisms for pursuing both objectives at once (Björkegren et al., 2020).
Centralized vs. local knowledge CDR-based methods enable a top-down, centralized and standardized approach to program targeting, rather than a bottom-up approach that prioritizes local knowledge that can be elicited, for example, through community wealth rankings. While the empirical results in this paper indicate that the efficiency gains from CDR-based targeting are significant, it may reinforce existing power structures (Taylor, 2016;Blumenstock, 2018a;Abebe et al., 2021). Efficiency gains should also be considered within the context of evidence suggesting that participating communities may prefer communitybased approaches (Alatas et al., 2012), but also may perceive them as less legitimate (Premand & Schnitzer, 2020).
To summarize, our results suggest that there is potential for using CDR-based methods to determine eligibility for economic aid or interventions, significantly reducing program targeting overhead and costs. Our results also indicate that CDR-based methods may complement and enhance existing survey-based methods. We note, however, that the practical and ethical limitations to CDR-based targeting are significant. We emphasize the need to consider these limitations and the constraints of specific local contexts alongside the efficiency gains offered by CDR-based targeting.  (2) Just those respondents who own a phone, where the phone number matches to the CDR obtained from the mobile phone operator; (3) Respondents who report owning a phone, but whose phone number does not match to the CDR obtained from the operator; (4) Respondents who report they do not own a phone.  Notes: Four different measures of performance (columns) reported for different targeting methods (rows), using different samples of survey respondents (panels). Standard deviations, calculated using 1,000 bootstrap samples, in parentheses. Panel A: The 535-household subsample that is matched to CDR. Panel B: The 535-household matched sample, plus the 472 households that do not have a phone; this is meant to approximate targeting performance if CDR from all mobile networks were available. Sample weights are applied as described in Section 2.4. Panel C: All 2,814 observations from the TUP survey, including households matched to CDR, households that own phones not matched to CDR, and households without phones, with sample weights applied. For Panels B and C, we simulate two types of CDR-based targeting: targeting households without phones first and targeting households without phones last. ), as well as the difference in average characteristics between correctly and incorrectly classified households (Diff.). Panel A: Differences between ultra-poor households correctly classified as such and those misclassified as non-ultra-poor (errors of exclusion). Panel B: Differences between non-ultra-poor households correctly classified as such and those misclassified as ultra-poor (errors of inclusion).

A Machine learning methods and hyperparameters
Although our paper is focused on identifying the ultra-poor with CDR, we experiment with predicting four measures of ground-truth welfare with CDR features: ultra-poor status (binary), below the national poverty line (binary), asset index ( In each case, we produce predictions out-of-sample over 10-fold cross validation. We use nested cross-validation to tune the hyperparameters of each model over 5-fold crossvalidation within each of the outer folds to avoid any information leakage between folds. We report both the mean score across the 10 folds as well as the overall score when data from all folds is pooled together. For the linear models and random forest, missing data is mean-imputed and each feature is scaled to zero mean and unit variance before fitting models (these transformations are done separately for each fold, with parameters fitted only on the training data for each fold). For the gradient boosting model missing values are left as-is and features are not scaled. We re-fit the model on the entire data, again tuning hyperparameters over 5-fold cross validation, to report selected hyperparameters and feature importances. We also report the top 5 features for each model, determined by the magnitude of the coefficient for the linear models, and by and by maximum impurity reductions for the tree-based models.
Hyperparameters are selected from the following grids for each model: B Abbreviations in Feature Names Figure S4 and Tables S7, S2, and S6 use a set of abbreviations in CDR feature names. This appendix lists the relevant abbreviations.
• BOC: Balance of contacts  Figure S2: Distributions of asset index and log-transformed consumption, for the entire survey sample, separately for ultra-poor and non-ultra-poor households, and again separately for households in the subsample matched to CDR, households outside of the matched subsample that report owning at least one mobile phone, and households outside of the matched subsample that report not owning a mobile phone. Figure S3: Correlation between asset index and log-transformed consumption, separately for the entire survey sample and the matched subsample. We include the LOESS fit, along with a 95% confidence interval.     The asset index benchmark we used is constructed following standard procedures based on principal comnponent analysis (see Table S1). However, it is possible that an alternative asset-based predictor, trained using machine learning to predict ultra-poor status directly from the 16 underlying components, could perform better. We test this hypothesis by adapting our machine learning pipeline for identifying the ultra-poor from CDR to the task of identifying the ultra-poor from asset possession. As with the CDR-based prediction, we evaluate the model over nested cross validation: the model's predictions are evaluated out-of-sample over 10-fold cross validation, and within each fold hyperparameters are tuned over 5-fold cross validation. We retrain the model on the entire dataset to report hyperparameters and feature importances. Hyperparameters are chosen from the same grid as for the CDR-based models. We display the AUC score and top features for each model.  Notes: In our main analysis, for multi-phone households we use only the phone number belonging to the household head (or to a random household member, where no household head is specified), leaving 535 household-level observations. Here we consider instead using machine learning methods to predict individual-level ultra-poverty, with a dataset of 634 individual phone numbers matched to the ground-truth wealth measures for the associated households. We find that the individual-level models are slightly less accurate than the household-level models presented in the main paper, but we focus on the household-level models in the main paper since the household was the unit of targeting in the TUP program. See Appendix B for abbreviations in feature names.