Measuring consumption over the phone: Evidence from a survey experiment in urban Ethiopia

The paucity of reliable, timely household consumption data in many low- and middle-income countries have made it difficult to assess how global poverty has evolved during the COVID-19 pandemic. Standard poverty measurement requires collecting household consumption data, which is rarely collected by phone. To test the feasibility of collecting consumption data over the phone, we conducted a survey experiment in urban Ethiopia, randomly assigning households to either phone or in-person interviews. In the phone survey, average per capita consumption is 23 percent lower and the estimated poverty headcount is twice as high than in the in-person survey. We observe evidence of survey fatigue occurring early in phone interviews but not in in-person interviews; the bias is correlated with household characteristics. While the phone survey mode provides comparable estimates when measuring diet-based food security, it is not amenable to measuring consumption using the ‘best practice’ approach originally devised for in-person surveys.


Introduction
When it became clear the spread of COVID-19 would become a pandemic in March 2020, many surveys that had been taking place inperson could no longer be fielded due to the concern they would contribute to virus spread. Yet in-person surveys are a key component to many research efforts and monitoring outcomes such as those measuring progress towards the Sustainable Development Goals (SDGs). Without in-person surveys such as the Demographic and Health Surveys (DHS), Household Consumption Expenditure Surveys (HCES), Living Standards Measurement Surveys (LSMS), and other similar surveys conducted by national statistical offices, it is impossible to know what kind of progress is being made towards meeting the SDGs or reducing poverty in general.
The main pivot by many researchers during the early part of the pandemic was to begin conducting phone surveys. 1 There was a veritable explosion of efforts to collect some type of data to monitor situations over the phone, including major coordinated efforts by Innovations for Poverty Action (RECOVR) and the World Bank (Gourlay et al., 2021).
These efforts have played an important role in helping us to understand some of the socioeconomic consequences of the pandemic. In terms of living standards, these surveys generally asked about job loss and loss of income, and they tend to show substantial negative effects (Egger et al., 2021;Josephson et al., 2021;Miguel and Mobarak, 2021). Yet these findings are all based on crude measures, e.g., asking whether household income was lower, the same, or higher than it had been at the same time of the year 12 months ago.
Although these surveys provided valuable information about how living standards were qualitatively changing during the early part of the pandemic, there remain obvious ways that phone surveys cannot replace in-person surveys. Some variables require physical measurement; for example, it is impossible to study how stunting prevalence is evolving among children under 5 years of age without in-person data collection.
Similarly, collecting data on household consumption expenditures to estimate poverty incidence requires complex measurement. 2 The standard household consumption expenditure and poverty measurement involves administering detailed food and non-food consumption modules covering more than 100 items typically consumed in the country (Deaton and Grosh, 2000;Deaton and Zaidi, 2002). 3 Consequently, most phone surveys have not attempted to collect such data, in trying to minimize the time spent on the phone.
As researchers have shied away from collecting complex data over the phone, we lack data on specific trends through the pandemic. In reviewing impacts on incomes, Miguel and Mobarak (2021) do not even attempt to speak directly to trends in poverty incidence. Despite the fact that modelers have predicted large increases in poverty incidence and rising food insecurity due to policies associated with the pandemic (e.g., Laborde, et al., 2021;Lakner et al., 2021;Sánchez-Páramo et al., 2021;Sumner et al., 2020), the lack of data collected in-person means it is difficult to tell whether their predictions have come true.
The surveys that have tried to collect consumption data over the phone during the pandemic suggest the increases in poverty incidence are not as severe as either the crude income measures or models would suggest. Egger et al. (2021) report on phone surveys in Kenya and Sierra Leone that collected data on food consumption in both countries and non-food consumption in Kenya, and find that the value of food consumption increased in both countries, offset by a decline in non-food consumption in Kenya. 4 Janssens et al. (2020) study a sample of households in Kenya collecting financial diaries, and find that households sold assets to maintain food consumption levels. Hirvonen et al. (2021) also find no material change in the value of overall food consumption in a representative sample of Addis Ababa between an in-person survey conducted in 2019 and a phone survey conducted at the same time of year in 2020, though the composition of food consumption changed.
These surveys suggest it might be plausible to conduct phone surveys to measure consumption as it had been before and therefore poverty incidence, particularly if survey efforts first attempt to develop some rapport with households before the long consumption survey, as is true in all the surveys described above. But it is important to quantify differences between phone and in-person measures of consumption before making such conclusions. Therefore, here we test whether consumption data collected over the phone has a comparable distribution to data collected in-person, using a sample that has been asked about food consumption several times in the past. We randomly select half of the sample to be enumerated about consumption in-person, with the other half enumerated over the phone. We do not include other modules in the survey, so we cannot test other differences between phone and in-person surveys. However, note that we can generate other indicators that are often enumerated in phone surveys, such as the household diet diversity score (HDDS) and a food consumption score (FCS) providing alternative measures of the household's food security.
We can then compute poverty incidence using both the consumption measures generated by our phone sample, versus the in-person sample. Note that it is best to at least initially be agnostic about which sample provides closer to a "true" approximation of the distribution of consumption, and therefore poverty incidence. Indeed, an important challenge in survey experiments such as ours is that we do not observe the "true value" against to which to benchmark our estimates (De Weerdt et al., 2020). However, when we test for survey fatigue by randomly changing the order in which the food groups appear in the food consumption module, we observe evidence of survey fatigue occurring very early on in the phone interviews but not in the in-person interviews. It seems then that the in-person survey mode does perform better, resulting in less measurement error than the phone survey mode. Our assessment of data quality based on Benford's law also suggest that the consumption data from the in-person survey are of higher quality than the data from the phone survey. In heterogeneity analysis, we find that bias is attenuated among more educated household heads, and is positively related to household size. 5 This finding implies that the measurement error in phone survey mode is not classical and, as a result, cannot be easily corrected with standard methods used in the literature (Bound et al., 2001).
This paper contributes to the understanding of how variation in survey designs can shape data quality and ensuing analyses (De Weerdt et al., 2020;McKenzie and Rosenzweig, 2012;Zezza et al., 2017). Much of the previous work has focused on improving consumption measures used to measure poverty incidence (Abate et al., 2020;Ameye et al., 2021;Backiny-Yetna et al., 2017;Beaman and Dillon, 2012;Beegle et al., 2012;Caeyers et al., 2012;De Weerdt et al., 2016;Friedman et al., 2017;Gibson et al., 2015;Gibson and Kim, 2007;Jolliffe, 2001;Kilic and Sohnesen, 2019;Troubat and Grünberger, 2017). We add to this literature by systematically comparing consumption and poverty estimates generated from a phone survey to those from an in-person survey. Finally, many researchers have hypothesized that the phone survey mode is likely to be considerably more vulnerable to response fatigue than the in-person mode, leading to the widespread recommendation to keep phone-based interviews short, and to avoid complex questions (Dabalen et al., 2016;Gourlay et al., 2021). Our results on consumption measurement provide empirical support to this hypothesis. However, in our case, both survey modes result in similar estimates when measuring diet-based food security suggesting that the phone survey mode is appropriate for measuring simpler and cognitively less demanding indicators, as long as the interview time is kept relatively short (Abay et al., 2021a).

The survey experiment
We designed a survey experiment to understand the implications of using a phone survey mode for household consumption measurement by systematically contrasting responses from computer assisted personal interviews (CAPI, or in-person) and computer assisted telephone interviews (CATI, or phone). The survey instrument in both survey modes were identical and had four sections. The interview began with a brief section containing only three questions needed to construct household size and its dependency ratio. In the first main section, respondents were asked to report on the household's food consumption for each item from a list of 118 food items, grouped into eight food groups. We first went through the list of 118 items asking whether the household consumed the item in the past seven days or not. The survey instrument was programmed to carry forward all items that were consumed in the past seven days to the next sub-section that asked about the consumption frequency ('on how many days was the item consumed') and quantity ('amount consumed') within the 7-day period. The second main section of the questionnaire included a short module asking household's food consumption outside of home within the same 7-day recall period. The final main section of the survey included a non-food consumption module, which asked respondents to recall household expenditures during the last month (e.g., toiletries or electricity expenditures) and during the last 12 months (e.g., school fees or health expenditures). The questionnaire administered for the two groups differed, then, only by the interview mode. For all other aspects, the questionnaire designs for the two groups were identical (Table 1). The full questionnaire is included in the Online Appendix.

Household sample
The household sample for this survey experiment originates from a randomized control trial (RCT) conducted to assess the impact of videobased behavioral change communication on fruit and vegetable consumption in Addis Ababa, Ethiopia . The baseline and endline surveys for the RCT took place in September 2019 and February 2020, respectively. 6 The sample of 930 households randomly selected from six sub-cities, 20 woredas (districts), and 40 ketenas (neighborhoods; or clusters of households) within Addis Ababa. 7 Comparison of household characteristics against those reported in other surveys from Addis Ababa suggest that the sample is representative of the households residing in the city (Hirvonen et al., 2020).
The endline survey was administered just before the COVID-19 pandemic was declared in 2020, a setup that was highly optimal for launching COVID-19 phone surveys. Phone numbers were collected from 887 households of the 895 households (99%) that took part in the February 2020 survey. To monitor the food security situation in Addis Ababa during the pandemic, we selected a random subsample of 600 households for monthly phone surveys . In total, four phone survey rounds were carried out between June and August 2020. In the August 2020 phone survey round, we administered the same food consumption module described above for all households selected for the phone surveys . Table A1 in the Appendix summarizes the various surveys with the sample of households used in this study.
The survey experiment contrasting consumption data collected via in-person and phone modes was administered over a 10-day period in September 2021 (i.e., one year after the last COVID-19 phone survey). 8 The sampling frame for this study was based on 895 households that were interviewed during the in-person survey conducted in February 2020, the endline survey of the video RCT. Out of the 895 households, 448 were randomly selected for an in-person interview and 447 for a phone interview. 9 A total of 797 households were interviewed; 421 in the in-person group and 376 in the phone group. 10 Administering the consumption modules over the phone took 41 min on average (median) and while the average (median) interview duration was 43 min for an inperson visit. The quality of the connection was generally good for the phone interviews, and based on enumerators' assessment, rarely affected the interview quality. 11 The survey team tasked with the in-person surveys followed recommended COVID-19 preventive measures when visiting the households. First, both the enumerators and respondents were provided with facemasks that they were required to wear during the interview. Second, the enumerators were required to thoroughly wash their hands with soap for 20 s or use disinfectant (containing more than 70% alcohol) before entering and when leaving the respondent's premises. Third, the survey coordinator conducted daily check-ups with enumerators regarding any COVID-19 related symptoms. Finally, the interview was conducted outdoors with at least 2-m distance between the enumerator and the respondent.
Ethical approval for the survey experiment was obtained from the institutional review boards (IRB) of the International Food Policy Research Institute (IFPRI) and the College of Medicine and Health Sciences at Hawassa University in Ethiopia. Informed oral consent was obtained from all participants at the start of the interview. Enumerators provided respondents a brief overview of the study objectives and informed them that their participation in the study was entirely voluntary.

Data
Food consumed at home was reported in terms of quantities consumed, which we converted into local currency units (Ethiopian birr) using retail price data collected by the Central Statistical Agency (CSA) of Ethiopia. We used the retail price data for Addis Ababa from February 2020 (the latest month available to us) and then used a food-specific consumer price index for Addis Ababa to express our food consumption data in September 2021 prices. Food consumption outside the home as well as non-food expenditure were collected in birr terms, thus requiring no price adjustments.
Each household's total consumption was calculated by first converting all consumption expenditure data to weekly terms and then adding up the three consumption components: food consumption at home; food consumption expenditures outside the home; and non-food expenditures. The official poverty data in Ethiopia come from the Household Consumption Expenditure Survey (HCES) collected every five years. The HCES survey is conducted throughout the Ethiopian calendar year to address consumption seasonality and covers nearly 400 food items and more than 850 non-food items. The latest HCES was administered in 2015/16, after which food prices and prices of non-food items have both been rising annually at a double-digit rate. Considering the high inflation rate and the considerable methodological differences between our survey and the HCES, we do not attempt to update the HCES poverty line for September 2021. Instead, we calibrate our poverty Note: (*) 1 month for non-food expenditures such as toiletries and utilities and 12 months for expenditures such as school fees and health expenses. 6 The endline survey also included a survey experiment to quantify the degree of telescoping bias in recalled food consumption by experimentally varying the recall method, see Abate et al. (2020) for more details. 7 Melesse et al. (2019) provide a detailed description of the sampling strategy. 8 The exact dates were 31 August to 9 September 2021. 9 To ensure balance between the two groups, we block-randomized using the following variables: sex, age and education of the household head, household size, and an asset index. The data for these variables were collected in the previous in-person visits. 10 Out of the 70 households in the phone survey group that were not interviewed, 16 did not answer the call, 37 had their phone switched off or not working, 10 had wrong numbers, and 5 had no phone numbers. Only 2 households refused to take part in the phone survey. 11 At the end of each phone interview, we asked enumerators to rate the quality of the connection during the call. 74 percent of the phone interviews were rated as 'very good' ("we heard each other very well"), 19 percent as 'good', 5 percent as 'OK/average' and only 2 percent (5 interviews) as 'bad' or 'very bad'.
line for the in-person sample to match the 16.8 percent poverty headcount based on the national poverty line and reported for Addis Ababa using the 2015/16 HCES (FDRE, 2018). We also use our food consumption data to study how the phone survey mode affects household dietary diversity, an indicator of household food security (Hoddinott and Yohannes, 2002). First, we computed the HDDS of Swindale and Bilinsky (2006) by grouping the 118 food items in our consumption module into 12 food groups: cereals; roots and tubers; vegetables; fruits; meat, poultry and offal; eggs; fish and seafood; pulses, legumes and nuts; milk and milk products; oil and fats; sugar and honey; and miscellaneous foods. The HDDS is a sum of all food groups from which the household consumed food items during the 7-day recall period, with a minimum of one and maximum of 12. Second, we constructed the food consumption score (FCS) developed by the WFP (2008). The FCS combines dietary diversity and consumption frequency by grouping the consumed food items into nine groups and allocating more weight to protein rich foods. 12 The weighted FCS index ranges between zero and 112, with higher scores indicating a better food security situation.
After dropping two households with implausible consumption values, the final sample of 795 households is formed, out of which 421 are from the in-person group and 374 are from the phone group. Table 2 shows that the in-person and phone groups are similar in terms of basic household characteristics. Moreover, the households in the two subsamples are balanced in terms of the number of times they had been interviewed since September 2019. We also see no meaningful differences in the household per capita food consumption collected in September 2019, whether we examine means (Table 2) or full distributions ( Figure A1 in the Appendix).

Estimation methods
We quantify the difference in reported household per capita consumption values across the two groups using ordinary least squares (OLS). In the most basic model, we regress both the per capita consumption value and its logarithm on a binary treatment variable valued one if the household was randomly selected into the phone group, and zero if into the in-person group. In subsequent models, we control for differences in basic household characteristics (household size, and household head's sex and level of education in years) as well as sub-city fixed effects. Finally, when we discuss percentage differences derived from the coefficients in semi-log regressions they are based on the approximate unbiased variance estimator proposed by van Garderen and Shah (2002): 100 × (eβ − 0.5V (β) − 1), where β refers to the estimated coefficient and V to the estimated variance. Finally, the standard errors in all household level regressions are clustered at the enumeration area (ketena) level.

Household total per capita consumption
Fig. 1 contrasts the full distributions of (log) household weekly per capita consumption measured in birr between households that received an in-person visit and households that were interviewed over the phone. The estimated household consumption distribution for the phone group lies to the left of the distribution estimated for the in-person group, indicating that the whole distribution of total consumption values resulting from the phone survey resulted in lower values than that of the in-person survey.
The regression estimates reported in Table 3 quantify the difference in household weekly food consumption when the data were collected over the phone relative to when the in-person survey mode was used. In columns 1 and 2, the dependent variable is the natural logarithm (ln) of household per capita consumption value in birr, whereas non-logged values are used in columns 3 and 4. Unadjusted estimates are reported in odd columns, whereas estimates in even columns are adjusted for differences in basic household characteristics as described above. Because the differences between the unadjusted and adjusted regressions are negligible, we focus our reporting and discussion on the adjusted regression results.  (1); fruits (1); meat, eggs, fish (4); dairy products (4); sugar (0.5); oil/butter (0.5); and condiments (0).
Relative to the in-person survey, on average the phone survey mode decreases the reported household per capita consumption expenditures by 23 percent (Table 3, column 2). 13 The 95% confidence interval (CI) for this estimate ranges between − 14.2 and − 31.1. The estimates based on non-logged per capita consumption variable are similar. Considering that the mean per capita consumption in the in-person group is 966 birr, the 201 birr difference reported in Column 4 of Table 3 translates into 21 percent lower average per capita consumption in the phone survey group.

Components of consumption
Food consumed at home represents 50.3 percent of the total consumption among the in-person group and 55.8 percent among the phone survey group. 14 The regression estimates reported in Column 1 of Table 4 indicate that the reported per capita food consumption values are 13 percent lower on average when the phone survey mode is used (95-% CI: − 5.5; − 20.7). However, we do not find strong evidence to suggest that some food groups were more affected than others. We reestimated the main regression using the value of food consumption for each of seven categories of food as the dependent variable; in Figure A2 in the Appendix, we observe that all the coefficient estimates are negative and suggest 5 to 25 percent lower consumption, with overlapping confidence intervals.
About 60 percent of the households in our sample report to have consumed food items outside of their home in the past 7 days. This reporting incidence varies by survey mode with households in the phone survey group being 13 percentage points less likely to report to have consumed foods outside their home (Table 4, column 2). A regression based on a non-logged outcome variable shows that the food expenditures outside of the home are 40.2 percent lower in the phone group relative to the in-person group (Table 4, column 3). 15 All the households in our sample report positive (non-zero) non-food consumption values. Column 4 in Table 4 shows the impact of the phone survey mode when the dependent variable is logged weekly per capita non-food consumption. On average, the phone survey mode lowers the reported non-food consumption by 30.1 percent (95-% CI: − 15.5; − 42.1).

Poverty estimates
Next, we estimate the impact of using phone survey mode on poverty estimates. Since poverty is defined at the individual level, we need to convert our data from household to individual level. To do so, we use a weighted least squares regression method where the weights are frequency weights based on household size. Using our calibrated poverty line, in Table 5 we estimate that poverty rate is 17 percentage points higher when phone survey mode is used compared to when consumption data are collected through in-person visits (95-% CI: 9.99; 24.1). Since the poverty rate in the in-person sample is calibrated at 16.8 percent, using the phone survey mode effectively doubles the poverty rate in this context.

Table 5
Impact of phone survey mode on poverty rate. (1)

Measures of food security
In Table 6, we report the impacts of using the phone survey mode on two widely used diet-based food security measures, HDDS and FCS. Both can be computed from the food consumption survey data. All four reported impact estimates are relatively small in magnitude and not statistically different from zero. The HDDS and FCS do not require respondents to estimate quantities consumed, only whether the food item was consumed in the past 7 days (HDDS) or the consumption frequency in terms of number of days in the past 7 days (FCS). In contrast, collecting data for food consumption measures is cognitively more demanding because it requires respondents to also estimate quantities consumed in the household during the recall period. Our results therefore indicate that the phone survey mode appears to lead to similar estimates when measuring diet-based food security to in-person surveys but leads to much lower estimates of the value of household food or nonfood consumption.

Survey fatigue
Our survey experiment shows that the phone survey mode leads households to underestimate their food and non-food consumption expenditures. As a result, if we trusted the phone survey mode and tried to use it in the same manner that we had used in-person surveys to measure poverty prior to the pandemic, we would conclude that the poverty headcount is twice as high using the phone survey data than the data collected in-person. Here, we study whether survey fatigue can help explain differences between results of the two survey modes.
The large difference in consumption and poverty incidence estimates between the two survey modes could result from respondent or enumerator fatigue. For example, fatigued respondents pay less attention when responding to cognitively demanding questions (e.g., amount or value of consumption), increasing the risk of measurement error. Survey experts have hypothesized that the risk of respondent fatigue is considerably higher in phone surveys than in in-person surveys (Dabalen et al., 2016;Gourlay et al., 2021). Consequently, it has been widely recommended to keep the phone survey duration short to minimize the risk of survey fatigue (Glazerman et al., 2020;Hoogeveen et al., 2014;Hughes and Velyvis, 2020;Jones and von Engelhardt, 2020;Kopper and Sautmann, 2020). While it is certainly intuitive that the risk of survey fatigue is higher in phone surveys, to the best of our knowledge, no studies have attempted to compare survey fatigue between phone and in-person modes using the same survey form.
Evidence from in-person surveys suggests that survey fatigue can lead to under reporting and overall deterioration of data quality in some settings (Ambler et al., 2021;Baird et al., 2008;Schündeln, 2018), but not always (Laajaj and Macours, 2021). 16 In a recent phone survey conducted in rural Ethiopia, Abay et al. (2021a) estimate that delaying the timing of a dietary diversity module by 15 min increased the likelihood that the respondents reported not to have consumed from certain food groups, resulting in an 8 percent decline in the mothers' dietary diversity score. 17 To explore the role of survey fatigue, we cross-randomized the order in which the food groups appeared in the first main section of the survey, the "food consumed at home" module. 18 Specifically, we implemented two versions of this food consumption module, ordering the food groups differently (see Appendix Table A2). For example, in version 1, mango appeared as the 5th item while in version 2, it appeared as the 73rd item. Similarly, in version 1, rice was the 52nd item on the list while in version 2, it was the 11th item on the list. Exploiting this variation, we use the food item level data to construct a variable that takes on the value of 1 when each food appears later in the questionnaire relative to the other version, and 0 otherwise. 19 Using the example above, this variable would be 1 when mangoes appear as the 73rd item, and when rice appears as the 52nd item. Using our food item level data, we then regressed the weekly household per capita consumption of the food item on this binary variable capturing the item's relative position in the questionnaire, and the indicator variable for the phone survey mode. To assess whether the impact of delaying when the item is asked in the module differs between phone and in-person survey modes, we interact the two variables and include the interaction term in the regression. In these regressions we control for food item fixed effects, meaning that our estimates are identified from variation in the survey mode or relative position in the questionnaire for the same food items. As additional controls, we include household size, an indicator variable for maleheaded households, the head's years of education, and sub-city fixed effects. Table 7 provides the results. In column 1, we estimate the model without the interaction term. Moving the item later in the questionnaire results in a report that is, on average, 5.8 percent lower for the item than if it takes on its earlier position. 20 Using the phone survey mode, the average report suggest the value of consumption is 15.5 percent lower than found with the in-person survey mode. In column 2, we estimate the model with the interaction term. The basic variable now captures the effect of placing the item later in the questionnaire in the in-person survey; this coefficient is close to zero and not statistically significant. The CI is relatively tight around zero (95-% CI: − 0.0167; 0.0016) indicating that survey fatigue does not play a role in the in-person survey also randomize the order in which questions are asked in their surveys to study survey fatigue. 19 As can be seen from Appendix Table A2, we administered two different versions of the food consumption module by simply changing the ordering of the food groups. As a result, we do not have sufficient variation in our data to test this with a 'distance variable' that captures the number of items between the version 1 and version 2. 20 The calculations in this paragraph are as follows: 5.8 percent lower is calculated as − 0.230/3.97 and 15.5 percent lower is calculated as − 0.615/ 3.97, using the estimates reported in Table 7, column 1, and 11.9 percent lower is calculated as [-0.014+(-0.458)]/3.97, using the estimates reported in Table 7, column 2.
G.T. Abate et al. mode, at least in this relatively early part of the questionnaire. In contrast, the coefficient on the interacted variable is negative, relatively large in magnitude, and statistically different from zero; it suggests that delaying an item in the phone survey mode leads to a report that is 11.9 percent lower on average than an item occurring later in the in-person survey. This finding is strongly suggestive that the in-person mode leads to less survey fatigue than the phone survey mode.
In Appendix Table A3 we replicate this analysis, only considering the responses to the Yes/No questions regarding whether the household consumed the item or not during the 7-day period. Interestingly, all coefficients in the interacted model appear insignificant implying that only consumption quantity reports are affected, but not responses on whether the household consumed the item or not. This finding is in line with our earlier result according to which diet-based food security measures do not seem to be affected by variation in survey mode.

Data quality
We next use Benford's law as a benchmark for assessing data quality. According to Benford (1938), the distribution of first-digits in many numerical data sets approximately follow the probability (P): where d ∈ {1, …,9} refers to the first-digit of the observation.
It is unlikely that survey data perfectly conform to the Benford's law distribution (Kaiser, 2019), but previous work (Abate et al., 2020;Garlick et al., 2020;Schündeln, 2018) has used the distance between the observed distribution and the predicted distribution under Benford's law as a measure of data quality. Here, we calculate this distance separately for the data collected by phone and for the data collected by in-person visits. Following Schündeln (2018), we compute normalized Euclidean distances between the observed first-digit distribution and the one predicted by Benford's law. 21 We use the digits of the quantities consumed as reported by the households in the food consumption module. The specific question asks for the quantity consumed and the unit (e.g., kg, litre, cup, or a locally used unit such as tassa). Of note is that Benford's law is scale-invariant; the law holds irrespective of the unit in which the consumed quantities were reported. Figure A3 in the Appendix reports the observed first-digit distributions in our data and compares them to the distribution predicted by Benford's law. 22 The null hypothesis that the observed distributions follow Benford's law is rejected for both groups (p < 0.001). However, relative to the in-person group, the phone group is much more likely to report the smallest possible value (i.e., value 1) as the first digit, possibly indicating limited cognitive engagement with the question.
Next, we calculate the Euclidean distances separately for each of the 33 consumption units reported by the households and for both survey mode groups. We then test whether the consumption unit specific average Euclidean distances for the two groups are statistically different by regressing the mean distance on our binary treatment variable. Table A4 in the Appendix shows that the coefficient on the treatment variable is positive and statistically different from zero, indicating that the data collected via the phone survey deviate more from the Benford's law than data collected via the in-person survey. This finding suggests that the consumption data from the in-person survey are of higher quality than data from the phone survey.

Heterogeneity
The results show that using the phone survey mode leads to substantial underestimation of household consumption expenditures. It is tempting to think that it could be possible to devise relatively simple adjustment factors to correct for this attenuation bias. Unfortunately, evidence from previous survey experiments suggests that because the measurement error is usually not independent of household characteristics (i.e., non-classical), such adjustment factors do not exist (De Weerdt et al., 2020). To explore the possibility that the phone survey mode varies by household type, we interacted the phone survey indicator variable with the household head's level of education and household size. Table 8 provides the results when household per capita food consumption (Columns 1-2) and non-food consumption (Columns 3-4) is used as the dependent variable. For household food consumption, we observe that the bias decreases with household head's education and increases with household size. 23 The former result suggests that Table 7 Impact of item's relative position in the questionnaire and phone survey mode on reported per capita food consumption value measured in birr. (1) (2) Note: Ordinary least squares regression. Unit of observation is food item consumed (or not) in each household. Number of food items is 118 and number of households is 795 resulting in 93,810 observations. Dependent variable is household per capita consumption of the food item measured in birr. Standard errors are clustered at the food item level and reported in parentheses. Statistical significance denoted with * p < 0.10, **p < 0.05, ***p < 0.01. Statistical significance denoted with * p < 0.10, **p < 0.05, ***p < 0.01. 21 The Euclidian distance is calculated as the square root of the sum of squared differences between the observed percentage and the percentage predicted by the Benford's law. We further normalize the calculated distances by taking a Zscore: subtracting the mean distance and dividing this by the standard deviation calculated using the pooled data. 22 We calculated these distributions using a user-written Stata routine devised by Jann (2007). 23 Table 2 reports that the difference in household size between the two household groups is not statistically different from zero.
G.T. Abate et al. respondents from more educated households better overcome survey fatigue in phone surveys. In contrast, the cognitive burden increases with household size as the number of consumption events is higher within the recall period (Fiedler and Mwangi, 2016;Gibson and Kim, 2007). Larger households are bound to have more consumption events than smaller households, making them more vulnerable to survey fatigue. For non-food consumption, the coefficients are of the same sign and magnitude but not statistically different from zero, possibly because of the larger variation in the data relative to the food consumption data. Overall, these heterogenous impacts imply that adjustment factors to account for the bias caused by the phone survey mode cannot be easily developed.

Enumerator effects
The survey team of 21 enumerators were all trained together and supervised by the same survey coordinator. To simplify survey logistics, the enumerators were tasked with conducting either phone interviews or in-person interviews. This collinearity between enumerator assignment and survey mode raises a concern that the estimated survey mode effects could be completely driven by enumerator effects. 24 To address this concern, we conduct three robustness checks. First, we show that our main findings are robust to controlling for enumerator characteristics: age, level of education, and past survey experience (see Column 2 in Table A5 in the Appendix). Second, to explore whether one poorly performing enumerator in the phone survey group could explain our results, we assess the sensitivity of our result to omitting one enumerator at a time from the sample. Results are remarkably robust to running the main regression across these 21 sub-samples (see Figure A4 in the Appendix). Third, we show that our results are robust to the controlling for enumerator random effects (Table A5, column 3 in the Appendix) as well as Mundlak (1978) correlated random effects (Table A5, column 4 in the Appendix). 25 Though we cannot use enumerator fixed effects, the combination of this evidence suggests that we can conclude enumerator effects could not have had much influence on the difference between in-person and phone survey results.

Cost considerations
Compared to in-person surveys, phone surveys are typically considerably less costly to administer (Gourlay et al., 2021). In this case, the cost per interview was approximately 3 times lower for phone surveys than in-person surveys. The cost differences are mainly due to survey logistical costs (which are marginal for the phone survey but represent about a third of the total cost of the in-person survey) and survey personnel costs due to differences in the number of interviews per day. While there was not much difference in the time phone and in-person surveys took, phone enumerators were able to conduct about three times as many interviews in a day than in-person enumerators because the survey mode allows them to make the next call as soon as they were ready, while the in-person survey requires enumerators to travel to the next household. However, there are a few ways that the in-person costs were minimized in this urban context. For instance, travel costs were relatively low, as enumerators could travel to the neighborhoods on their own, so vehicle rental was limited to supervisory vehicles. Had households been more spread out (e.g., in a rural survey), the cost difference would have been much larger.
The cost difference suggests that with the same resources, using phone surveys would allow for a sample size roughly three times larger than in-person surveys, in the same type of urban setting. Increasing the sample size that much implies a sizable gain in statistical power and thus improvement in the precision of consumption and poverty estimates. 26 However, as we have shown above, the phone survey mode comes with a systematic downward bias. Consequently, survey experts interested in measuring household consumption using the standard method face a trade-off between precision and accuracy when deciding between inperson and phone survey mode. In our view, the bias introduced by the phone survey mode in this context is too large to be ignored over potential gains in precision. If poverty incidence is to be measured with phone surveys, different methods of doing so consistent with current methods of poverty estimation are necessary.

Conclusions
Pre-pandemic, development economists and practitioners were using phone surveys in only a few contexts. In research, they were used when projects required high-frequency data or in contexts that were difficult to reach (Dabalen et al., 2016;Dillon, 2012;Hoogeveen et al., 2014). Meanwhile, WFP (2017) was building up knowledge about how to use phone surveys to monitor food insecurity. As the pandemic began, phone surveys suddenly became the only option for many types of data collection, and research on living standards and food insecurity shifted rapidly to phone surveys, to understand the socioeconomic implications of the pandemic.
The subsequent COVID-19 phone surveys have provided important information about the socioeconomic consequences of the pandemic in many low-and middle-income countries with limited infrastructure to provide real-time economic or employment data to inform policy decisions. However, the economic information collected at the household level has been largely restricted to subjective indicators measuring income or employment losses, offering limited information about the severity or depth of the crisis (De Weerdt, 2008;Hirvonen et al., 2021). 27 Indeed, there have been only few attempts to measure household consumption to inform how the progress toward meeting the first Sustainable Development Goal of 'No Poverty' has been affected by the pandemic. Finally, there remains considerable uncertainty on the implications on the use of the phone survey mode on data quality, particularly in low-and middle-income country contexts where the pre-pandemic roll out of phone survey technology and testing had been relatively slow (Gourlay et al., 2021).
Our research begins to address some of these important methodological knowledge gaps. To measure the extent of bias on household consumption measures in phone surveys, we conducted a survey 24 Previous work in this area has found that the enumerator effects play a negligible role in shaping survey responses, unless the questions are sensitive in nature (Di Maio and Fiala, 2020). 25 The random effects estimator controls for enumerator heterogeneity by decomposing the unobserved heterogeneity to variance occurring between enumerators and within enumerators (i.e., across different interviews conducted by the same enumerator). The key assumption of the random effect estimator is that the correlation between the treatment status and the random effects is zero, or in the correlated random effects model, that it takes on a specific parameter. We acknowledge that, in our application, this assumption may not hold. However, simulation studies suggest that the 'heterogeneity bias' stemming from the violation of this assumption is relatively small (see Bell and Jones, 2015). Considering this point and the fact that the estimated coefficient based on the random effects estimator is very close to the coefficient reported in column 2 of Table 3, we believe that unobserved enumerator effects are not driving our results. 26 There is another channel through which phone surveys can be more efficient than in-person surveys. In-person surveys typically require cluster sampling to simplify logistics and reduce potentially sizable transportation costs (particularly in rural areas). As the same logistical concerns are absent in phone surveys, they permit applying a simple random sampling through random direct dial techniques that is more efficient than cluster sampling. 27 At the same time, with imperfect and non-random mobile phone access in rural areas, the data may not be representative as the poor and people in more remote areas may have less access to phones or be outside of coverage areas when phone surveys are fielded (Ambel et al., 2021;Brubaker et al., 2021). experiment in Addis Ababa, Ethiopia, randomly assigning a balanced and representative sample either to a phone or an in-person interview mode. We find the phone survey mode leads to a statistically significant and large underestimation of household consumption. Relative to the inperson survey mode, the phone survey mode decreases the reported household per capita consumption expenditures by 23 percent, on average. Consequently, the estimated poverty rate is twice as high when the phone survey mode is used.
We therefore should reinterpret results in Hirvonen et al. (2021), which used the same household sample to show that the total value of food consumption expenditures had not changed much between August-September 2019 and August 2020. The former survey was collected in-person, and the latter by phone; if we use the results here to re-interpret that paper, it seems that if anything the average value of food consumption rose by August 2020; moreover, that paper shows that the value of relatively nutritious foods might have declined; that concern is far lower given those results likely underestimate all categories of food consumption.
The mechanism appears to be linked to survey fatigue that results in phone survey respondents to greatly under-estimate consumption quantities, but not whether they consumed the item during the recall period. Our heterogeneity analysis suggests the bias increases when more people eat within the household, possibly because of the increased cognitive burden in remembering larger number of consumption events. In contrast, the bias is attenuated by education, suggesting that more educated individuals can overcome issues of attention.
Our study has some important limitations. First, our sample is not nationally representative and importantly does not cover rural households that are typically poorer and consume fewer food and non-food consumption items. Consumption surveys in rural areas could take less time to complete than in urban areas, making the phone survey mode more feasible. 28 Another external validity concern relates to the fact that the household sample used in this study had responded to two or three food consumption surveys prior to this survey experiment (see Table A1 in the Appendix). Consequently, the household in our sample may have become more attuned to recalling consumption events than a new, randomly selected sample of households. Finally, while we hypothesize that the documented survey fatigue is driven by respondents, the design of our experiment does not allow us to distinguish whether the fatigue is driven by fatigue among respondents or fatigue among enumerators.
These limitations aside, our findings suggest that while phone surveys can provide large cost savings, they cannot replace in-person surveys for standard household consumption and poverty measurement, as outlined in Deaton and Grosh (2000). However, the phone survey mode does appear to be useful for monitoring diet-based food security indicators that do not require information about the quantities consumed, as used by the WFP (2017) in their Vulnerability Analysis and Mapping surveys.
Given the prevalence of cell phone ownership, figuring out how to use phone survey data to best contribute to accurate consumption and poverty measurement in low-and middle-income countries forms an important future research agenda. One option is to substantially shorten the consumption modules to accommodate the greater risk of survey fatigue in phone surveys. However, the available evidence from low-and middle-income country contexts suggest that shorter modules systematically underestimate consumption levels and thus overestimate poverty headcounts (Beegle et al., 2012;Jolliffe, 2001;Pradhan, 2009). Therefore, when adjusting the consumption module length, survey practitioners need to balance between accuracy and survey fatigue. Finding a balance in which accuracy is maximized and risk of survey fatigue minimized in phone surveys constitutes an important task for future survey methodology research. 29 Another option is to rely on cross-survey imputation methods. In recent years, these methods have become popular among poverty economists to estimate poverty in contexts and periods lacking consumption survey data (e.g., Dang, et al., 2021;Douidich et al., 2016;Stifel and Christiaensen, 2007). These types of imputation methods typically begin by using a household consumption survey and by regressing household consumption expenditures on a set of household characteristics, such as household demographics, employment status, and asset and education levels. Then another survey that collected data on the same characteristics is used, as the estimated model parameters can be applied to these household characteristics to predict household consumption expenditures and poverty rates. Phone surveys could be used to (relatively inexpensively) collect data on these household characteristics, link these data to a previous household consumption expenditure survey, and estimate poverty using cross survey imputation methods. However, the validity of this approach rests on some important assumptions. First, the relationship between household consumption expenditures and its predictors should remain stable over time (Christiaensen et al., 2012). Considering relative price changes occurring as a consequence of the COVID pandemic and the conflict between Russia and Ukraine, it remains an open question about where and when this assumption would hold. Second, linking parameters estimated from in-person consumption survey to household characteristics obtained from a phone survey assumes that survey mode effects do not matter (Kilic and Sohnesen, 2019). Considering the evidence presented here and other emerging work testing survey mode effects (e.g., Garlick, et al., 2020), this assumption is clearly is a strong assumption requiring further validation. Third, one must always be cognizant that phone ownership is correlated with income, and lower income people with phones may be less likely to keep them turned on (and therefore answer calls), to preserve their batteries.
Finally, it would be useful to experiment with split questionnaire designs in a phone survey setup. In this method, respondents are randomly assigned fractions of the full questionnaire and the missing data are then imputed using multiple imputation techniques (Raghunathan and Grizzle, 1995). Recent applications of a split questionnaire design with in-person surveys suggest that the approach can produce reliable consumption and poverty estimates with considerably shorter interview durations (Pape, 2021;Pape and Mistiaen, 2015). It remains an open question about whether split designs could be used to generate low bias estimates of poverty incidence with phone surveys.

Data availability
Part of the data are publicly available; the remainder are in process of being made available. Code will be made available once review process is complete. 28 However, a limited and unequal access to phones can be a major obstacle to administering representative phone surveys in rural areas. For example, in Ethiopia, only 40 percent of rural households have access to a phone, and those that have, tend to be more educated and wealthier (Wieser et al., 2020). Furthermore, rural households tend to be larger than urban households, potentially exacerbating bias related to household size.   Note: Both phone and in-person surveys included two types of food consumption modules with different order in which the food groups appeared in the questionnaire. This table shows the order of food groups in both questionnaire types. Note: Based on ordinary least squares regression. Unit of observation is household; N = 795. All regressions included household level controls (household size, indicator variable for male-headed households, and head's education in years) and sub-city fixed effects. Dots quantify the difference in household per capita consumption-expenditure (in birr) when the phone survey method is used relative to when the in-person method is used. The difference is measured as a percent of the mean household per capita consumption-expenditure value reported in the in-person group. Capped bars are 95-% confidence intervals, calculated from standard errors clustered at the enumeration area level. Note: Ordinary least squares regression. Unit of observation is food item consumed (or not) in each household. Number of food items is 118 and number of households is 795 resulting in 93,810 observations. Dependent variable obtains a value 1 if the household reported to have consumed the item in the past week, zero otherwise. 0/1 = binary variable. Standard errors are clustered at the food item level and reported in parentheses. Statistical significance denoted with * p < 0.10, **p < 0.05, ***p < 0.01.  Note: Dependent variable is Euclidean distance to the distribution predicted by Benford's law. Unit of observation is unit in which the quantity consumed was reported in (one for each group). Coefficients measure Z-scores. Standard errors clustered at food item level and they are reported in parentheses. Statistical significance denoted with * p < 0.10, **p < 0.05, ***p < 0.01. 0.288 0.290 n/a n/a R 2 within n/a n/a 0.224 0.224 R 2 between n/a n/a 0.600 0.652 R 2 overall n/a n/a 0.286 0.290 Note: Ordinary least squares regression. Unit of observation is household. Dependent variable is (ln) household total per capita consumption (in birr). Household level controls include household size (number of members), indicator variable for male-headed households, and household head's education in years. Enumerator characteristics include enumerator's age, level of education, and survey experience (number of surveys involved in since September 2019). Standard errors are clustered at the enumeration area level and reported in parentheses. Statistical significance denoted with * p < 0.10, **p < 0.05, ***p < 0.01.  Table 3. The maroon hollow dots are equivalent OLS estimates for 21 different sub-samples when one enumerator is dropped from the dataset. The capped vertical lines represent the corresponding 95% confidence intervals.