Household electricity access in Africa (2000–2013): Closing information gaps with model-based geostatistics

Household electricity access data in Africa are scarce, particularly at the subnational level. We followed a model-based Geostatistics approach to produce maps of electricity access between 2000 and 2013 at a 5 km resolution. We collated data from 69 nationally representative household surveys conducted in Africa and incorporated nighttime lights imagery as well as land use and land cover data to produce maps of electricity access between 2000 and 2013. The information produced here can be an aid for understanding of how electricity access has changed in the region during this 14 year period. The resolution and the continental scale makes it possible to combine these data with other sources in applications in the socio-economic field, both at a local or regional level.


C Gaussian Process Regression
Let f x ∼ GP(M, K) be a Gaussian process with domain X , mean function M and covariance K. The common modeling assumption in GP models is that the response variables {y i } n i=1 are noisy observations of the latent variables {f x } n i=1 , and that the noise follows a zero mean Gaussian distribution [1]. Hence where σ 2 is regarded as noise variance.
In the case of the Geostatistic models implemented in this work, X represents a 2 dimensional space and {x} n i=1 are a set of spatial indices that vary continuously in the domain X . The spatial process f x is a random field that follows a multivariate Gaussian distribution with mean M and covariance K. This methodology is not limited to the assumption of additive Gaussian noise. Other families of distributions can be used, as long as the latent variables are related to the response through some monotonic transformation of the form φ : R → Y, where Y represents the space of the response variable [2][3][4][5][6].
A binomial noise model, rather than Gaussian, provides an adequate representation of the stochasticity observed in the survey data modeled. Such data correspond to the number of households with access to electricity out of a total number of households in a village. We can assume an inverse logit transformation that shrinks the real space into B1 Table. Survey points used. For each country-year we present above in square brackets the Survey ID, below at the left the total number of data points, and below at the right the number of data points considered valid and used. The country names are abbreviated according their ISO 3166-1 alpha-3 code as defined in Table B2 in S1 File.
where n i represents the total number of households associated to the i-th village.

D Stochastic Partial Differential Equation and Integrated Nested Laplace Approximation
Spatial models commonly rely on covariance functions that depend on the Euclidean distance x i − x j . The Matérn covariance, for example, is defined as B2 Table. ISO 3166-1 alpha-3 codes. This list only contains countries from which we analyzed georeferenced survey data as described in where and ν are non-negative parameters, and K ν is the modified Bessel function of second kind and order ν.
Exact inference with a GP regression has a computational complexity of O(n 3 ), where n is the number of training cases. Such a cost arises from the need to invert the covariance matrix. Thus, large datasets prevent the use of O(n 3 ) algorithms. An alternative is provided by the stochastic partial differential equation (SPDE) approach [7]. The SPDE approach approximates the Gaussian random field (assuming a Matérn covariance) with a Gaussian Markov random field allowing the use of a sparse precision matrix and reducing the complexity to O(n 3 2 ). The SPDE approach can be applied for both stationary and non-stationary processes. In the latter case, used in this work, the starting point of the approximation is the equation where κ(x) and τ (x) are parameters that vary in space, ∆ is the Laplacian, α controls the smoothness (here we assumed α = 2, which is the default value in R-INLA), and w x is a Gaussian spatial white noise process. The nominal approximation to the variance and range of the spatial correlation are given by The range is defined as the distance at which the spatial correlation is close to 0.1 The SPDE representation also makes it possible to use the Integrated Nested Laplace Approximation to handle the non-linear transformation on the latent variable [8]. The INLA model computes the posterior marginals of the latent field through a step-wise implementation of a simplified Laplace approximation with a cost of O(n 2 log n). We used the SPDE and INLA algorithms provided by the R-INLA package [9,10].

E Sensitivity Analysis
Statistical inference is affected by the random displacement of cluster locations. To mitigate this effect DHS has proposed some guidelines when using their data along with predictors or covariates defined spatially [11]. These guidelines cover three specific cases: distance-base analyses, raster-based analyses and point-in-polygon analyses. The model we propose falls in the category of raster-based as this is the format in which the covariate information we are using is stored: a regular grid of values that represent a pixel area. In this case, DHS guidelines suggest that covariates are not computed using direct cell extraction (pixel value that corresponds to the GPS location), but using an average value across a 5 km buffer around the GPS location reported.
We performed a simulation analysis to assess the impact of the locations displacement. We generated a synthetic dataset of 10,000 observations according to the pseudo-code shown in Table E1 in S1 File. First, we fitted a GLM to a sample of the survey data and assumed that the residuals were given by a spatial random field. Second, we used these residuals and the GLM fitted values as the inverse logit transformation of the mean of a Binomial random process.
Being this merely a spatial phenomenon, we did not considered time in this section of the analysis. The synthetic dataset was based on a sample of the available survey data between 2008 and 2012. We sampled across multiple years to ensure a larger spatial coverage of the data, but constrained to a window of four years to reduce variations due to time.
E1 Table. Synthetic data generation. Steps followed to generate the synthetic dataset for sensitivity analysis.
Pseudo code for data simulation 1 Sample m observations {y i , n i } from data. #y i : positives; n i : total 2 For each observation i ← 1 to m: 3 For each covariate j ← 1 to q: #q: number of covariates 4 Compute z ij using a 5 km buffer #z ij : covariate j at point i 5 Fit a GLM and estimate {β j }.
#β j : regression parameter j 6 Compute the errors vectorε = [ yi /ni−ŷ i /ni]. 7 For each observation i ← 1 to m : For the purposes of this exercise we assumed that the synthetic data locations, which correspond to the survey locations reported, were not displaced. We then generated an new obfuscated dataset (with displaced locations) based on these data, following the pseudo-code of Table E2 in S1 File. DHS methodology for data protection displaces cluster locations up to 2 km in urban areas and up to 5 km in rural areas, with 1% of the locations displaced up to 10 km. Our algorithm did not displaced the locations using distances measured in the metric system, but in coordinates degrees. We assumed that 1 degree is equivalent to 111 km, which is the approximate proportion of degrees to kilometers in the Equator. That means that the offset we used in this simulation is larger the farther from the Equator. In addition, for simplicity we did not make any distinction between urban and rural clusters and made a displacement up to 5/111 degrees for 99% of the data points and up to 10/111 for the remaining 1%. Fig. E1 in S1 File shows the displacement of a sample of points, with respect to their assumed original location, and a histogram of the displacement magnitudes applied to the whole dataset. Note that the proportion of points tend to increase as the offset increases because the area of the discs around the original location is larger the farther we are from it. If u < 0.99: 4 Define radius r = 5 /111: 5 Else: 6 Define radius r = 10 /111: 7 Define buffer of radius r around x i x i : GPS coordinates of point i 8 Randomly We fitted a Geostatistic model to both datasets, non-obfuscated and obfuscated, according to and where φ is the logit transformation, τ i are coefficients of the binary variables indicating the LULC category, z i are the continuous covariates: IAP, NTL and PIA; β i are the coefficients of the continuous covariates; f xi is the spatial random field and x i are the spatial reference points. Fig. E2 in File S1 shows a comparison of the marginal distribution of the fixed effects parameters when using obfuscated and non-obfuscated datasets. Also the true parameters are shown with a vertical red line. In general, the true parameters are not covered by areas of high density of the marginal distributions, which reflects that both cases are missing the right value. The distributions of all parameters with both datasets present a very similar dispersion. In addition, in 6 of the 10 parameters estimated, there is a clear overlap between the marginal distributions of the two models. This suggests that results tend to be consistent and not highly affected by the locations displacement. Fig. E3 in File S1 shows a comparison of the marginal distribution of the spatial hyperparameters when using obfuscated and non-obfuscated datasets. In this case we do not have a ground truth to compare with. The marginal distribution of κ is shifted to the right when using obfuscated data. We will not dwell on the comparison of marginal distributions of this parameter since its interpretation is not intuitive. We will expand more on the nominal variance and the nominal range which depend on κ and 5/11 are more easy to interpret. The nominal variance of the model with obfuscated data is larger, which means a higher variability of the spatial random field and is possibly driven by the noise due to the locations displacement. The nominal range of the spatial covariance is smaller when modeling obfuscated data. This means that the radius at which two points present a correlation above 0.1 is shorter when data has been displaced. This reduction in the range prevents correlating points that would be assumed to be correlated if using the range of the non-obfuscated data. As a result, the model for obfuscated data relies less on the proximity of observations. The last seems adequate considering that we are not certain about the true location of the data. Once again, this lack of certainty may be a driver of the larger nominal variance.   E4 in File S1 shows a diagram explaining the effect of the reduction in the range. The red points represent the actual locations of three observations. The radius of the blue circle represents a hypothetical large nominal range. All points within the blue circle are correlated with A (ρ > 0.1 to be more precise). Such is the case of point C, but not of point B. The magenta dots represent potential displaced locations of points B and C. For the sake of this example we do not displace point A. If we use the large nominal range, point A will be still correlated with most of the displaced locations of C. In addition, A will also be assumed to be correlated with some displaced locations of B. However, if instead we use a short nominal range (radius of the yellow circle), A will be considered to be correlated only to a few displaced locations of C, and it will no longer be correlated with any of the displaced locations in B. In addition to the comparison of the marginal distributions, we compared the odds ratio of the fixed effects. First, we compared the true values vs the estimates with obfuscated data (see Table E3 in S1 File); second, we compared the estimates using non-obfuscated data vs obfuscated data (see Table E4 in S1 File). In both tables we see that odds ratios are consistent. Only NTL and PIA present ratios larger than one. The 7/11 rest of odds ratios are smaller than one. Moreover, the difference between the ratios of obfuscated data estimates and the base of comparison (true values or non-obfuscated estimates) is in general small relative to the magnitude of such base estimates. These tables show that the random displacement is not perturbing too much the fixed effects estimates.
E3 Table. Odds ratio estimates (true vs displaced). Comparison of the odds ratios used to generate the synthetic data and their mean estimates using displaced locations.
Variable Finally, in Fig. E5 in File S1 we compare the target values estimates: in panel A, we compare the estimates using the non-obfuscated data vs the true target values; in Panel B, we compare the estimates using the obfuscated data vs the true values; in Panel C, we compare the estimates using the obfuscated data vs the estimates using the non-obfuscated data. By target values we mean the inverse logit of the probability of having electricity access. This is the η i value on the left hand side of Eq S8. Panels A and B show that the target estimates are aligned around the true values, and panle C shows that the estimates obtained using exact and displaced locations are consistent between them. Another way of viewing this is that the target value estimates with displaced locations are an approximation of the estimates that we could obtain with the exact locations. As expected, the estimates when using obfuscated data are more spread than when using non-obfuscated data. However the difference is not too large. In fact, the sum of squared errors (SSE) for the data in panel A is 624.54, while the SSE for the data in panel B is 659.01. Thus, using obfuscated data increases the SSE only a 5.52%.

F Model Selection
We compared 4 different models. Model 1 is just a GLM with all the variation explained by the covariates of land cover type, IAP, PIA and NTL. Model 2 is a Geostatistic model that extends Model 1 by adding a spatial random field. Model 3 extends Model 2 by adding the year of observation as a covariate. Model 4 extends Model 2 by incorporating an autoregressive component to the random field across time.
The performance of these models was compared using the conditional predictive ordinate (CPO) as criterion. CPO is defined as the probability of observing value y i given the rest of the values. This is expressed as where y −i = {y j } j =i . Note that CPO is computed using leave-one-out cross-validation. For this part of the analysis we used only 21,080 observations, that correspond to 70% of the survey points with displaced locations. Table F1 in S1 File shows a summary of the models comparison. The model with largest sum of CPO is Model 3, and therefore it this was the one used for estimating the probability of electricity access. Model 3 can be expressed as and where φ is the logit transformation, τ i are the fixed effect estimates LULC categories, z i are the remaining covariates (IAP, PIA, NTL and time), β i are the regression parameters of such covariates, f xi is the spatial random field, and x i , t i are the space and time reference points. NTL and PIA have odds ratios larger than one, which reflects a positive association with the probability of electricity access. The difference in the magnitude of the ratios is due to the scale of the variables: NTL dynamic range goes from 0 to 63, while PIA values are constrained between 0 and 1. On the other hand, the category impervious surface and the variable IAP have odds ratios smaller than one, which describes a negative association. This is counterintuitive, as we expect urbanization processes to be positively associated with electricity access. However, the variable that dominates this relationship is PIA, which is positive.

G Model Validation
Once the model was trained, we predicted the probability of a household having electricity at each of the locations in V1. Then we classified the locations as with electricity or without electricity. The datasets used contain information about the proportion of households with electricity within the cluster (see Table 1), as opposed to a binary classification. We simulated binary classes according to this proportion following the pseudo-code of Table G1 in S1 File. Simulate c ∼ Bernoulli(y i /n i ) 5

G1
C.append(c) Using the synthetic classes of V1 we computed the accuracy, precision, sensitivity and confusion matrices of our predictions. Results are shown in Tables 2 and 3.