MODEL-BASED PREDICTION FOR SMALL DOMAINS USING COVARIATES: A COMPARISON OF FOUR METHODS

We consider methods for model-based small area estimation when the number of areas with sampled data is a small fraction of the total areas for which estimates are required. Abundant auxiliary information is available from the survey for all the sampled areas. Further, through an external source, there is information for all areas. The goal is to use auxiliary variables to predict the outcome of interest for all areas. We compare areal-level random forests and LASSO approaches to a frequentist forward variable selection approach and a Bayesian shrinkage method using a horseshoe prior. Further, to measure the uncertainty of estimates obtained from random forests and the LASSO, we propose a modification of the split conformal procedure that relaxes the assumption of exchangeable data. We show that the proposed method yields intervals with the correct coverage rate and this is confirmed through a simulation study. This work is motivated by Ghanaian data available from the sixth

Ghana Living Standards Survey (GLSS) and the 2010 Population and Housing Census, in the Greater Accra Metropolitan Area (GAMA) region, which comprises eight districts that are further divided into enumeration areas (EAs).We estimate the areal mean household log consumption using both datasets.The outcome variable is measured only in the GLSS for 3 percent of all the EAs (136 out of 5019) and 174 potential covariates are available in both datasets.In the application, among the four modeling methods considered, the Bayesian shrinkage performed the best in terms of bias, mean squared error (MSE), and prediction interval coverages and scores, as assessed through a crossvalidation study.We find substantial between-area variation with the estimated log consumption showing a 1.3-fold variation across the GAMA region.The western areas are the poorest while the Accra Metropolitan Area district has the richest areas.
KEY WORDS: High-dimensional auxiliary information; Model selection; Prediction intervals; Random forests; Split conformal inference.

MOTIVATION
In 2015, the United Nations (UN) released their 2030 agenda for sustainable development goals (SDGs) consisting of 17 goals, the first of which was to end poverty worldwide (United Nations 2015).For their first SDG, the UN made seven guidelines explicit, including the implementation of "poverty eradication policies" at a disaggregated level.To that end, producing reliable and high-resolution spatial estimates of socioeconomic status and income inequality is fundamental to help decision makers prioritize and target certain areas for decentralized interventions (Elbers et al. 2002).These detailed maps empower local communities to understand their situation compared to their neighbors, which also helps when planning interventions (Bedi et al. 2007).
In Ghana, household surveys are collected every 5-7 years to measure the living conditions of households across Ghanaian regions and districts and to monitor poverty.To keep track of the Ghanaian population wealth, the equivalized consumption is recorded for the sampled households.Although the

Statement of Significance
We propose a procedure to compute prediction intervals that is valid for a variety of modeling method (including random forests and the LASSO) and that relaxes the assumption of exchangeable data.We compare the performance of random forests and the LASSO with forward variable selection and a Bayesian method to produce small area estimates when a large proportion of the areas are not sampled.household income is not directly measured, the equivalized expenditure is an alternative that allows decision makers to assess a household's standard of living (Johnson et al. 2005).This measure corresponds to the household consumption scaled by a weight based on the number of members in the household.We aim to estimate the equivalized consumption at a disaggregated level, to help communities, civil society organizations, and policymakers better understand the distribution of the households' living standard in Ghana, in order to prioritize certain areas when implementing poverty eradication policies.The sixth Ghana Living Standards Survey (GLSS), conducted in 2012-2013, was the last household survey carried out prior to the new UN SDGs agenda.The fifth GLSS had shown that inequalities had increased since 2006.In particular, although the overall poverty decreased nation-wide, the wealthiest decile of the population consumed 6.8 times more than the poorest (Cooke et al. 2016).A downside of these household surveys is that the sampling design is stratified two-stage cluster sampling, which only allows for reliable survey sampling estimates at the district level at best.Ghana is divided into ten regions, which were comprised of 170 districts in 2010 or, at a finer level, around 38,000 enumeration areas (EAs).Producing reliable estimates at the EA level would further help the authorities in their policy decisions (Corral et al. 2022).
We analyze data from the sixth GLSS for the Greater Accra Metropolitan Area (GAMA), which consists of eight districts.The GLSS used a stratified two-stage cluster sample in which strata are defined by an urban or rural indicator.The first stage sample of clusters (EAs) was selected proportional to size.Within the sampled EAs, 15 households were systematically sampled.For each sampled household, we have detailed assessment of consumption and their level of education, employment, assets, totaling 174 auxiliary variables.This yields a sample of 136 EAs out of the 5019, in this Ghanaian region.Given that a small proportion of the areas are observed, model-based prediction approach is sensible (Pfeffermann 2013;Tzavidis et al. 2018;Ghosh 2020;Erciulescu and Opsomer 2022;Hogg et al. 2023).Additionally, the sampled EAs are anonymized, which means it is unknown which 136 EAs of the 5019 EAs are represented in the survey.Finally, we have data available from the 2010 Ghanaian census for all EAs in the GAMA.Among others, the same 174 variables are measured in this census and in the sixth GLSS.The aim of this work is to produce estimates with uncertainty of the mean log household consumption at the EA level in the GAMA.Note that, as is common in the literature on poverty mapping, we focus on the equivalized consumption in the log scale to model symmetrical data (Elbers et al. 2003;Nguyen et al. 2017).If needed, the estimates could be transformed back into the original scale.
Given the higher number of auxiliary variables compared to the number of sampled EAs, we assess the performance of random forests and the LASSO (which performs variable selection) to estimate the mean household log consumption at the EA level in the GAMA.For the sake of comparison, we also consider a forward variable selection approach in the frequentist framework and a Bayesian shrinkage method using a horseshoe prior.For all four approaches, we adopt EA-level models.Due to the nature of the motivating Ghanaian datasets, where only a small proportion of the areas are sampled and are anonymized, synthetic small area estimators are of interest.Further, we propose a modification of the split conformal (SC) procedure to compute prediction intervals for the random forest and LASSO predictions while relaxing the assumption of exchangeable responses, which is necessary due to the complex sampling design.
This paper is organized as follows.Section 1.1 briefly reviews the literature on small area estimation (SAE) and variable selection in the frequentist, Bayesian and machine learning frameworks.Section 2 describes the four methods that will be compared and the proposed procedure to produce prediction intervals for estimates obtained through random forests and the LASSO.Section 3 shows the results from two simulation studies.First, section 3.1 presents a comparison between the proposed modified SC and the original SC procedures.Then, section 3.2 provides a comparison between the four methods that perform variable selection.Section 4 discusses the results from the four methods applied to the Ghanaian datasets.Finally, section 5 concludes the paper with a discussion.
1.1 Literature Review SAE concerns estimation of area-level summaries when data are sparse or non-existent in selected areas (Rao and Molina 2015).This area of research in survey sampling has greatly evolved in the last 50 years (Pfeffermann 2002(Pfeffermann , 2013;;Rao and Molina 2015;Ghosh 2020).Tzavidis et al. (2018) points out that the use of SAE by national statistical institutes and other organizations to produce official statistics exhibits this increasing popularity; notable examples include the povmap software developed by the World Bank (Elbers et al. 2003;World Bank 2015) and the Small Area Income and Poverty Estimates project carried out by the US Census Bureau (Census Bureau 2018).
In survey sampling, the design-based framework is distinguished from the model-based framework.Design-based methods, also called randomization methods, assume the variable of interest to be fixed in the finite population while the randomness comes from the sampling process.Direct (weighted) estimators have favorable design-based properties in large samples and rely only on the sampling weights and the recorded responses within each sampled area to produce areal estimates.Hence, estimates for non-sampled areas are missing.Additionally, data sparsity will yield imprecise direct estimates at the areal level.Similarly, data sparsity within areas may lead to imprecise modelassisted estimates.These latter approaches also fall under the umbrella of design-based inference.Model-assisted methods are design-based approaches that model the responses to gain precision but are still design consistent (S€ arndal et al. 2003).An alternative is to use model-based approaches, where the responses are no longer assumed fixed but are treated as random variables, modeled using auxiliary information and/or random effects.In this framework, it is common to use exterior sources of information to augment the auxiliary information from the sample to the entire finite population; for example, using information obtained from censuses.Tzavidis et al. (2018) describe a two-step approach to produce model-based small area estimates.First, a model is fitted using the survey responses and survey auxiliary variables.Then, the outcome is predicted for the entire finite population according to the estimated model parameters and finite population auxiliary information.
Abundant auxiliary information may be measured in the sample, for the sampled areas, and through exterior sources, for all the areas of the region of interest.It may therefore be necessary to select a subset of covariates to model the response variable in the presence of high-dimensional auxiliary information.In this way, precision can be increased as unnecessary auxiliary variables are not included.The inference procedure for model-based approaches can be performed under the frequentist or Bayesian frameworks, or with flexible parametric models via machine learning techniques.
Machine learning methods are becoming more popular in the survey sampling community; see for example, Wang et al. (2014) and Breidt and Opsomer (2017).However, it is not straightforward to perform inference and assess the estimates' uncertainty under these approaches.For example, the bootstrap does not work for non-smooth targets such as LASSO estimates (Dezeure et al. 2015).Among machine learning methods, random forests (Breiman 2001) can be fitted to unit-level or area-level data for a flexible approach.Random forests are a collection of regression trees that recursively partition the responses into increasingly homogeneous subgroups (nodes), based on covariate splits.Random forests potentially have the benefit of accommodating non-linear relationships and complex interactions and naturally select variables through these covariate splits.Each individual regression tree is fitted on a bootstrap sample of the original dataset.There is a growing literature on methods to measure the uncertainty of random forest point estimates, for example using different jackknife approaches (Wager et al. 2014;Steinberger and Leeb 2016;Wager and Athey 2018) or V-statistics (Zhou et al. 2021).However, the subsampling procedures have drawbacks, including their computational overheads and their unclear application to survey data.Recently, Zhang et al. (2019) proposed the so-called out-of-bag (OOB) prediction intervals, which are computed based on quantiles of the random forest OOB prediction errors.These denote the difference between a data point's outcome and its point estimate, obtained from a random forest grown without said data point.In simulation studies, Zhang et al. (2019) show that their proposed method performs similar to the SC approach proposed by Lei et al. (2018).The SC approach may be used to compute prediction intervals for any modeling method (e.g., linear models or random forests) and is a novelty in the literature on survey sampling (Bersson and Hoff 2024;Wieczorek 2023).To compute prediction intervals for random forest estimates with the SC method, the original dataset is first split into two datasets.A random forest is trained on one subsample, and point estimates and their associated prediction errors are obtained for the other subsample.Then, the intervals are computed based on the empirical quantiles of the prediction errors from the second subsample.Note that while the OOB method proposed by Zhang et al. (2019) only estimates prediction intervals for random forests, the SC method can potentially be applied to any modeling procedure used to obtain point estimates.A common feature of all these prediction interval methods is that the data are assumed to be exchangeable (Angelopoulos and Bates 2023).This is a strong assumption and is not usually true for data gathered from a complex survey design.
Inference procedures for model-based approaches can also follow the frequentist or the Bayesian paradigms.In these frameworks, variable selection is an important yet contentious research topic.In the frequentist framework, twostep procedures are common.Variables are first iteratively selected (forward selection) or removed (backward elimination) to model the outcome, based on the optimization of some criterion (e.g., AIC, BIC, R 2 ).Then, a final model that includes only the selected covariates is fitted to the data.In SAE, it is common to select variables by comparing models through some criterion (e.g., AIC or BIC, or survey sampling adjusted versions); for example, Han (2013); Rao and Molina (2015), and Lahiri and Suntornchost (2015).In the frequentist framework, regularization methods have also been proposed in the literature, such as ridge regression and the LASSO (Tibshirani 1996(Tibshirani , 2011;;McConville et al. 2017).These methods apply constraints to the regression parameters.However, in the case of the LASSO, these constraints yield estimates of the model parameters whose uncertainty estimation is difficult, especially in a survey setting.In a simulation study, Lei et al. (2018) show that their proposed SC method performs well in computing prediction intervals for predictions obtained through the LASSO, when the data are exchangeable.
In the Bayesian framework, variable selection is conducted by imposing informative priors on the model parameters.Multiple shrinkage priors have been proposed in the literature, for example, Bayesian ridge regression and the Bayesian LASSO (Hans 2010).In the former, a Gaussian prior is assigned to the regression parameters, while a double-exponential distribution is used for the latter.It can be shown that, under the respective priors, computing the maxima a posteriori to estimate the parameters results exactly in ridge-type and LASSO-type estimators (Reich and Ghosh 2019).A more recent popular approach is the use of the horseshoe prior (Carvalho et al. 2010), which imposes a priori a heavier weight toward zero than a normal or doubleexponential distribution (Datta and Ghosh 2013;Porwal and Raftery 2022).

METHODS
Let a region be divided into M non-overlapping areas, A c ; c ¼ 1; . . .; M. Denote the number of units in A c by N c , with outcomes y ck ; k ¼ 1; . . .; N c .The main goal is to estimate the areal mean � y c ¼ ð1=N c Þ P N c k¼1 y ck for all areas c ¼ 1; . . .; M, using a sample of n c units taken from c ¼ 1; . . .; m areas.Denote the set of area and household indices included in the sample by s, and denote the set of sampled units in the c-th area by s c ; c ¼ 1; . . .; M. Let f c ¼ n c =N c be the sampling fraction within each area.For any variable a, let � a c ¼ ð1=N c Þ P N c k¼1 a ck ; c ¼ 1; . . .; M, be the population areal mean, and � a . . .; M, the areal means for the sampled (subscript (s)) and non-sampled (subscript (ns)) units, respectively.For all M areas, the estimation target may be decomposed as follows: To estimate � y c , the non-sampled mean � y The uncertainty of � Y c may be measured using prediction intervals of level ð1 − αÞ%; PI ð1 − αÞ of the form Note that for a non-sampled area c 0 ; f c 0 ¼ 0 and the estimator reduces to

Random forests and the LASSO are considered to estimate � Y
ðnsÞ c in the model-based framework.For the sake of comparison, we also consider a forward variable selection approach in the frequentist paradigm and a Bayesian shrinkage method.
The four modeling approaches assume there are p-dimensional covariates available from the sample, fx ck ; c; k 2 sg, as well as areal covariate means, � x c ; c ¼ 1; . . .; M, which are known for all the areas of the finite population.Such information may be obtained from a census.Inference is carried out at the areal level.In this model-based framework, at the unit level, the finite population response values y are assumed to be an independent and identically distributed realization of super population random variables Y whose sampled and non-sampled moments are at the area level: Therefore, inference is conducted using

Random Forest and LASSO Approaches
First, we consider a random forest prediction approach.This non-parametric method makes no further assumption on (3).Following Breiman (2001), random forest point estimates are the average over B point estimates obtained from training B independent regression trees on B bootstrap versions of the original sample.Each regression tree partitions the bootstrap response values based on splitting rules applied to covariates.A random forest algorithm is described in appendix C, in the supplementary data online.Hence, with a random forest procedure, the estimator (1) becomes where the weights w c 0 ð�Þ result from the random forest procedure described in appendix C, in the supplementary data online.Second, we consider the LASSO to predict the areal non-sampled means, assuming a linear relationship between the covariates and the outcome, while performing variable selection.m � > .Note that the shrinkage penalty parameter λ is fixed after a 10-fold cross-validation, seeking the smallest test MSE.Therefore, the estimator (1) becomes To measure the uncertainty associated to the random forest and LASSO predictions (4) and ( 5), we propose a modification of the SC prediction intervals of Lei et al. (2018), relaxing the assumption of exchangeable sampled and non-sampled data points.are not exchangeable.Hence, in the proposed modified SC procedure, we assume the mean structures to be similar and allow the variances to be scaled differently, as is the case in (3).Specifically, in this context of a complex sampling design, we assume the variance is independent of the sample strata.The unit-level variance, σ 2 , is assumed fixed across the strata and the sampled and non-sampled areal-level variances only vary with the number of sampled and non-sampled units, n c and N c − n c , respectively.We propose scaling the residuals computed in the original SC procedure before computing the empirical quantile necessary for the prediction intervals.Said quantile is then scaled when computing the prediction intervals.The proposed scaled SC procedure can be described through the following steps: (1) Randomly split (2) Train a random forest or a LASSO approach on (3) Compute the scaled absolute residuals (5) Let the prediction interval be PI Hence, for random forest or LASSO predictions (4) or (5), the uncertainty (2) becomes Appendix A in the supplementary data online provides a proof of the ð1 − αÞ% marginal coverage of the proposed scaled SC prediction intervals

Forward Selection
For comparison, we consider a frequentist method with the commonly used forward approach with AIC as a variable selection criterion.Model (3) is completed by assuming the errors are normally distributed.To predict � Y ðnsÞ c , the forward variable selection approach (hereafter, the forward selection

Model-Based Prediction for Small Domains Using Covariates
approach) is a two-step procedure.First, a subset of K covariates z is selected among the available x's.To that end, linear models are iteratively fitted, adding one covariate at a time based on the resultant AIC value.Using these selected covariates, a linear model is fitted: � y ðsÞ c � N ð� z ðsÞ > c η; σ 2 =n c Þ; c ¼ 1; . . .; m, to estimate η; VðηÞ and σ.The steps required to run this forward approach are detailed in appendix B in the supplementary data online.The estimator and uncertainty (1) and (2) become, respectively, where q α denotes the α-level quantile from a N ð0; 1Þ distribution.Note that this does not account for the uncertainty in the selected covariates.

Bayesian Shrinkage Approach
Finally, a Bayesian approach is considered, where all the available covariates, x, are used in a single step to model the outcome while applying the horseshoe prior (Carvalho et al. 2010) to the regression parameters.Similar to the forward approach, a normal distribution is further assumed for (3).The c ¼ 1; . . .; m; with priors β j jλ j ; τ � ind: N ð0; λ 2 j τ 2 Þ; j ¼ 1; . . .; p; and σ; τ; λ 1 ; . . .; λ p � HCð0; 1Þ, where HCða; bÞ stands for a half-Cauchy distribution with location a and scale b.In this prior, τ corresponds to the global shrinkage and λ j , to the local shrinkage.Then, inference is conducted through the posterior distribution, approximated through a Markov Chains Monte Carlo (MCMC) procedure.The estimator and its uncertainty (1) and (2) become

SIMULATION STUDY
This section presents two simulation studies that assess the performance of the proposed scaled SC procedure and compare the four modeling methods.Section 3.1 focuses on the proposed scaled SC method that computes prediction intervals while relaxing the assumption of exchangeable data points.In section 3.2, different generating models and sampling designs are studied to compare the four model selection methods described in section 2.
Inference is performed in R. The random forests of B ¼ 1,000 trees are trained using the ranger package (Wright et al. 2022).For each simulation scenario, the random forest hyperparameters are fixed after a cross-validation study of different values.The code to conduct the proposed scaled SC procedure for random forest estimates is available in appendix D in the supplementary data online.The LASSO method is conducted through the glmnet package, using the cv.glmnet function to define the optimal shrinkage penalty parameter.Bayesian inference is performed with the NIMBLE package (de Valpine et al. 2017).Convergence of the MCMC chains is assessed through trace plots, effective sample sizes, and the R statistic (Gelman and Rubin 1992).

Simulation Study: Scaled SC Procedure
To assess the performance of the proposed scaled SC procedure, five modelbased simulation scenarios are considered: R ¼ 500 finite populations are created, and the different simulation scenarios correspond to the various sampling designs applied to that finite population.Assume that each finite population consists of M ¼ 500 areas of sizes N c ; c ¼ 1; . . .; M; with min c ðN c Þ ¼ 50 and max c ðN c Þ ¼ 500.For c ¼ 1; . . .; M; and k ¼ 1; . . .; N c , the response variable has distribution with six unit-level covariates, x 1 ; . . .; x 6 � i:i:d: N ð0; 1Þ, which are the same across the finite populations.In this model-based framework, the same areas and units are sampled across the different finite populations.From each finite population, a sample is drawn according to five sampling designs, which constitute the simulation scenarios: ( The proportion of sampled areas is higher in the stratified sampling designs, as all areas are selected.Hence, the areal-level inference is conducted on more data points in the first two scenarios than in the remaining three, and we expect any modeling method to perform better in these two scenarios.The one-stage and two-stage designs all yield m ¼ 500 areal-level responses.The difference between these last three scenarios is in the sampling fraction within areas.Out of the five simulation scenarios considered, the fourth one is the closest to the Ghanaian data analyzed in section 4. For each simulation scenario and in each sample, the estimates described in (1) are computed using four methods: a linear model that includes the correct six covariates; a linear model that omits x 4 , x 5 , and x 6 ; a random forest method that considers all six covariates to grow the trees; and a linear LASSO model.After conducting a cross-validation study, the random forest hyperparameters were set to mtry ¼ 2 and nodesize ¼ 5.For each scenario, 50 percent, 80 percent, and 95 percent prediction intervals (see ( 2)) are computed following the SC and the proposed scaled SC procedures in each sample, for each modeling method.These non-parametric methods may be applied to any modeling approach.The objective of this simulation study is to assess whether the proposed scaled SC method yields valid coverage rates of the prediction intervals.
All results are shown in figure 1.The observed coverages for each scenario, method, and interval level are shown in the left panel and the interval widths, in the right panel.The simulation scenarios are identified by 1-5, as described above.Further, we differentiate the results for the sampled and non-sampled areas (Yes/No, respectively).Scenarios 1 and 2 present results for sampled areas, as all areas are sampled in a stratified sample.Scenario 3 shows the results for non-sampled areas because the target is exactly estimated in the sampled areas since all units are selected in a one-stage sample.Therefore, in the third scenario, we measure the predictions' uncertainty only in the nonsampled areas.
In the first scenario and for the sampled areas in the fourth scenario, n c and N c − n c are equal.Therefore, the data points are exchangeable, and the SC and proposed scaled SC approaches are the same.In these scenarios, both methods yield the right coverages of the prediction intervals, regardless of the modeling method.In terms of interval widths, the linear model with an incorrect set of covariates leads to the widest intervals under both SC procedures.The linear model with correct covariates and the LASSO method yields the narrowest intervals regardless of the SC method.
For all other sampling schemes, n c 6 ¼ N c − n c .Therefore, the sampled and non-sampled means, � Y ðsÞ c and � Y ðnsÞ c , are not exchangeable and have differently scaled variances, σ 2 =n c and σ 2 =ðN c − n c Þ, respectively.In these cases, the original SC intervals obtained for all four modeling methods do not attain the right coverages.The original SC leads to under-coverage of the prediction intervals.
On the other hand, the proposed scaled SC procedure produces prediction intervals with the right coverages, regardless of the interval level and modeling method.In particular, when fitting a linear model with the LASSO constraint or with the right mean structure, the scaled SC intervals have exactly the right coverages.When modeling with a non-parametric random forest approach or with a linear model assuming the wrong mean structure, the scaled SC prediction intervals show slight errors in the coverage rate.
In terms of interval width, the proposed scaled SC intervals tend to be a little wider than their original SC counterparts for all interval levels.Regardless of the simulation scenario, the SC intervals and proposed scaled SC intervals obtained for the random forest estimates tend to be narrower than the ones obtained for the linear estimates using the incorrect set of covariates.
Finally, appendix E in the supplementary data online shows the results from the equivalent simulation study in the design-based framework.In this case, the finite population is assumed fixed and different samples are taken.Similar results are obtained: when n c 6 ¼ N c − n c , the proposed scaled SC procedure yields the correct coverages, while the original SC approach produces undercovering prediction intervals.

Simulation Study: Prediction Methods Comparison
To compare the performance of the random forest and the LASSO methods to the frequentist forward variable selection and the Bayesian shrinkage approaches, as described in section 2, a model-based simulation study is conducted considering three generating models for the outcome and five sampling designs.Similar to section 3.1 and appendix E in the supplementary data online, a design-based counterpart to this simulation study is shown in appendix F in the supplementary data online, producing similar results.We replicate R ¼ 100 times the creation of three finite populations of M ¼ 1,000 areas of sizes N c with min c ðN c Þ ¼ 50 and max c ðN c Þ ¼ 500 as follows: (A) y ck � N ð20 þ x > ck β; 0:5 2 Þ; where the covariates are such that x ck � N 100 ð0; IÞ and with coefficients β > ¼ ð1; − 1; 2; − 1; 2; 1; 2; 1; − 1; 1; 0; . . .; 0Þ; (B) where the covariates are such that ; and β > ¼ ð1; − 1; 2; − 1; 2; 1; 2; 1; − 1; 1; 0; . . .; 0Þ=10; The covariates are the same across the replicated populations and the randomness comes from the y's.Populations A and B assume a linear relationship between the outcome and the first ten covariates.In scenario B, however, the strength of the association is weak, and the covariates are correlated.Population C is inspired by Scornet (2017) and assumes a non-linear relationship between the outcome and covariates.Throughout this simulation study, areas are indiscriminately termed "areas" or "EAs."From each finite population, a sample is drawn following the two sampling schemes: (1) (Stratified) Select all m ¼ M ¼ 500 areas and within each area, sample n c ¼ 15; c ¼ 1; . . .; m units; (2) (Two-stage) Sample m ¼ M=2 areas and within each area, sample n c ¼ 15; c ¼ 1; . . .; m units.
These simulation scenarios were motivated by the Ghanaian data analyzed in section 4, where only 15 households were sampled within the selected areas.Additional sampling designs are considered in appendix G in the supplementary data online, where a higher number of sampled units within the selected areas, and in appendix H in the supplementary data online, where a simulation study aims to emulate the Ghanaian data analysed in section 4. For each scenario, the estimates and their prediction intervals are computed as described in section 2. Further, for each scenario, the estimates and their uncertainty are also computed assuming anonymized EAs.In this context, the modeling methods are trained on the sample and predictions are obtained ignoring which areas have been sampled, that is, assuming f c ¼ 0; c ¼ 1; . . .; M; at the prediction stage.We conducted the latter study because the available Ghanaian data that are analyzed in section 4 does not identify the sampled EAs.The random forest hyperparameters are set following a cross-validation study conducted for each simulation scenario.A random forest with hyperparameters set to ðmtry; nodesizeÞ ¼ ð10; 5Þ is found to perform the best for both sampling schemes in populations A and B. In population C, we set ðmtry; nodesizeÞ ¼ ð70; 200Þ and ðmtry; nodesizeÞ ¼ ð70; 9Þ, for the stratified samples and the two-stage samples, respectively.In all scenarios, the Bayesian approach runs through a MCMC procedure with two chains of 5,000 iterations, including a burn-in period of 2,500 iterations; with practical convergence validated for these values by assessing trace plots, effective sample sizes and R statistics (Gelman and Rubin 1992).For a particular finite population and sampling scheme, running the random forest over 100 replicates takes 55 minutes, the LASSO takes 4 minutes, the forward selection method takes 7 minutes, and the Bayesian approach, 2.5 hours.
The methods' performances are compared via four measures: the mean absolute bias AB ¼ ð1=RÞ  (Gneiting and Raftery 2007).Smaller values of the interval proper scores are preferred, indicating narrow intervals and average close to the nominal.Additionally, we extract which covariates have been selected from each method.Note that in the Bayesian framework, a covariate is said to be selected when its coefficient's posterior 95 percent credible interval does not include zero.For the random forest approach, when the p-value related to a variable's importance (Altmann et al. 2010) is smaller than 0.05, the variable is selected.The variable importance is computed based on results from random forests fitted with permutations of the set of covariates.
Figure 2 shows the selected covariates by each method for each model and sampling design.When the association is linear between the covariates and the outcome (A and B), regardless of the sampling design, the forward selection approach tends to adequately select the true auxiliary information.However, it also tends to select irrelevant variables.Each unimportant covariate is selected about 20 percent of the time by the forward selection method.In scenario A, the LASSO and Bayesian approaches also select the right covariates 100 percent of the time, while rarely including redundant covariates.When the covariates are correlated, the LASSO and Bayesian methods tend to miss the right covariates 20-70 percent of the time, depending on the sampling design.In scenarios A and B, the random forest method misses the right covariates 10-50 percent of the time, while it always captures the correct set when the association is non-linear.The LASSO selects one out of two correct covariates about 80 percent of the time in this third population.Both the forward selection and Bayesian approaches miss the correct set of covariates in scenario C almost 100 percent of the time and include irrelevant variables.
Figure 3 shows the absolute biases multiplied by 100, MSEs, prediction intervals' coverages, and proper scores for all methods, generating models (A-C), and sampling schemes.The results for the sampled and non-sampled areas are differentiated by red and black symbols, respectively.Only results for the sampled EAs are produced for the stratified sampling design, as all areas are sampled.The results assuming the anonymized EAs are distinguished from the ones in which we know which areas have been sampled by the circle and cross symbols, respectively.The results shown in figure 3 are also provided in tables I.1-I.8 of Appendix I in the supplementary data online.
For all performance measures, the four modeling approaches yield similar results when it is known and unknown which areas have been sampled.For example, in population C with a two-stage sampling design and regardless of the modeling method, the MSE results over the anonymized sampled EAs are not worse than the results over the non-sampled EAs.This result is reassuring as for analyzing the Ghanaian data, where the sampled EAs are anonymized.
In terms of bias, all methods are virtually unbiased with mean absolute biases between zero and 0.08, regardless of the population and sampling design.In the linear scenarios, the random forest tends to yield slightly higher mean absolute biases, compared to the forward, LASSO and Bayesian methods, which is sensible as the correct mean structure cannot be estimated in this non-parametric approach.
In terms of MSE, there is no apparent difference between the LASSO, forward selection, and Bayesian methods, for all simulation models and sampling schemes.These three methods, which fit linear models, yield slightly smaller MSEs than the random forest approach when the association between the covariates and the outcome is strongly linear (A).For scenario C, however, the random forest produces smaller MSEs than the three linear modeling approaches.The random forest method divides the other three modeling methods' MSEs by a factor of three in population C, regardless of the sampling scheme.This result is explained by the fact that the random forest approach is a non-parametric method that adapts better to the non-linear setting.
The prediction intervals computed for all four methods in each sampling scheme for populations A and B yield the right coverages, with a slight undercoverage for the random forest approach.This may be due to the incorrect  mean structure that is fitted to the data.These intervals are wider for the random forest method in model A, for both sampling designs, as deduced from the proper interval scores.When the relationship between the outcome and the covariates is non-linear (C), we observe that all four modeling methods yield under-coverage in both sampling designs.The random forest method, which accommodates a non-linear relationship, leads to prediction intervals with slightly higher coverage rates than the other three methods, in particular across the sampled areas, but still misses the targeted rates by about 30 percent.Note, however, that the random forest approach produces prediction intervals with smaller proper scores than the other three modeling methods.

AREAL LOG CONSUMPTION PREDICTION IN THE GAMA
In this section, the four modeling methods described in section 2 are applied to the data for the GAMA in Ghana.Using the sixth GLSS and the 2010 Ghanaian census, a complete map of the mean equivalized consumption (in the log scale) is produced across the M ¼ 5019 EAs for each method.Note that in the household survey, only m ¼ 136 EAs have been sampled.To provide estimates in the missing areas, the response values are modeled using the p ¼ 174 auxiliary variables, which are measured in both available datasets and have been scaled for computational efficiency.
For the random forest approach, due to the small sample size m and following Hastie et al. (2009), a cross-validation study on the survey data was run to set the hyperparameters to B ¼ 1,000 trees grown with mtry ¼ 25 and nodesize ¼ 3. The Bayesian approach required two MCMC chains of 100,000 iterations, including a burn-in period of 50,000, and a thinning factor of 50.Convergence was attained as assessed by the trace plots, effective sample sizes and R statistics smaller than 1.1.Wakefield et al. (2020) point out the importance of including the design variables in model-based SAE methods.To that end, the urban indicator, which corresponds to the sampling strata, is added to all four modeling methods.In the forward selection approach, this inclusion means that the urban indicator is added to the vector of selected covariates, even if it was not selected in the first step.In the Bayesian shrinkage and LASSO methods, it means there is no shrinkage applied to the regression coefficient that corresponds to the urban indicator.Finally, in the random forest approach, it means that the urban indicator is part of the variables considered for each covariate split.Figure 4 presents the covariates that were selected by each method.Despite all methods including the urban indicator, only the random forest finds it relevant with a p-value for its variable importance smaller than 0.05.Additionally, figure 4 shows that the horseshoe prior leads to only one variable whose coefficient's posterior 95 percent credible interval does not include zero.The LASSO approach selects about 6 percent of the available covariates (11 variables), while the forward method and random forest methods select more than 12 percent of the variables (21 and 22, respectively).The variable indicating whether a household's floor is made of cement or concrete is selected by all four methods.
Figure 5 shows the mean log consumption areal estimates and their 95 percent prediction intervals' widths for each of the four methods.Among the four methods, the random forest approach yields the most homogeneous point estimates across the EAs, with a prediction range between eight and nine, compared to a range of seven to ten, for the other three methods.This can further be seen in figure 6, which compares the predictions obtained using each method for each EA.The prediction interval widths are shown across the EAs in figure 5 and compared between the modeling methods in pairwise scatter plots in figure 7. The prediction intervals computed for the linear approach with forward variable selection are the narrowest.As expected, the widths of the intervals obtained through the proposed scaled SC approach for the random forest and LASSO predictions behave similarly.The widths are of the form ð1 , where d α is the only quantity that differs between the LASSO and random forest approaches.Note that in this analysis, the scaled SC procedure divides the dataset into two halves, consequently computing the necessary residuals and quantile based on only m=2 ¼ 68 data points.
Finally, to determine which method performs the best in this particular data application, an eight-fold cross-validation study is conducted.proper scores of the 50 percent, 80 percent, and 95 percent prediction intervals in table 1.These performance measures are computed as described in section 3.2, with the number of replicates R corresponding to the eight folds and the total number of areas M becoming the number of sampled EAs.The Bayesian shrinkage approach performs the best among the four methods we consider, yielding the smallest bias, MSE, and interval scores, and attaining the right coverage rates of the prediction intervals.On the other hand, the prediction intervals obtained for the forward selection approach lead to significant undercoverage.A cross-validation study was also conducted where the four methods were fitted without forcing the inclusion of the urban indicator.The results are not shown in this paper as the performance of the four methods in terms of bias, MSE, coverage, and proper score of the prediction intervals were similar to the ones shown in table 1.The Bayesian shrinkage method considered in this paper consists of applying the horseshoe prior to the regression coefficients.Other priors could have been considered, such as a Bayesian ridge prior.A cross-validation study was conducted with the Bayesian ridge approach for the GAMA sample.Because the results were similar to the ones shown in table 1, obtained with the horseshoe prior, in terms of bias, MSE, coverage, and proper score of the prediction intervals, they are not presented in this paper.
In the original scale, on average, we find that the Bayesian shrinkage consumption estimates among the richest 10 percent are 2.3 times bigger than the ones among the poorest 10 percent.We also find that the 92 percent urban EAs are not uniformly distributed across the estimated consumption deciles: there are only 79 percent urban EAs among the poorest 10 percent, versus 91 percent among the richest 10 percent.Following Dong and Wakefield (2021), we rank the EAs from poorest to richest, based on the Bayesian shrinkage point estimates to identify the EAs where interventions should be prioritized.In particular, in this Bayesian framework, we obtain each EAs posterior ranking distribution by ranking the point estimates at each MCMC iteration.Figure 8 shows the posterior ranking distributions for five of the 10 percent poorest EAs and five of the 10 percent richest EAs.Additionally, the righthand side of figure 8 maps the 10 percent poorest and richest EAs.We find that the Greater Accra South district, which corresponds to the western EAs in figure 8, contains most of the poorest EAs, while the Accra Metropolitan Area district, which corresponds to the southern EAs in figure 8, is the richest.Figure 8 further shows that the 500 poorest EAs' ranking distributions overlap, which seems to indicate that there is a need to intervene in the poorest 500 EAs.  the mean log equivalized consumption are computed for all EAs in the GAMA.Additionally, prediction intervals are computed for all EA estimates to measure their uncertainties.The LASSO and forward variable selection methods select more than 10 percent of the auxiliary variables, while the Bayesian horseshoe model yields posterior credible intervals that do not include zero for only one coefficient.The random forest procedure estimates a smoother map of the mean log consumption than the other three approaches.
A cross-validation study conducted on the sample data shows that the Bayesian shrinkage method performs the best, among the four methods considered, on this particular dataset.Finally, before fitting random forests to the different datasets, crossvalidation studies were run to help set the hyperparameters.These hyperparameters are the number of regression trees included in the forest, the number of variables considered at each step when growing the trees, and the final node sizes.This step could be improved as other hyperparameters could have led to better performing random forests.For further discussion on the selection of random forest hyperparameters, see McConville and Toth (2019) and Dagdoug et al. (2023).On the other hand, the proposed scaled SC procedure used to compute prediction intervals for the random forest and LASSO estimates relies on an equal split of the data points to grow a forest and compute prediction errors.In the data application of this paper, this implies that the prediction interval limits are based on 68 data points.This partition, suggested by Lei et al. (2018) for the original SC algorithm, could be revisited to attempt to narrow down the resulting intervals.
1; . . .; m n o into two equal sized datasets.Denote the resulting two sets of area indices by S 1 and S 2 ; observed sampled means are modeled through � y ðsÞ c jβ; σ � ind: N ð� x ðsÞ> c β; σ 2 =n c Þ; is the ' th element of the MCMC posterior predictive sample, with β ' ð Þ and σ ' ð Þ the ' th elements in the MCMC samples.The α-level empirical quantiles from the posterior predictive sample are denoted � Y ns ð Þ c;lower α and � Y ns ð Þ c;upper α .

Figure 1 .
Figure 1.Coverages and Widths of the Prediction Intervals (PI) Obtained From the Proposed Scaled and Original Split Conformal (SC) Procedures for the Four Modeling Methods and Across the Five Simulation Scenarios (1-5).Yes: coverages and widths across the sampled areas; No: coverages and widths across the nonsampled areas.

Figure 2 .
Figure 2. Covariate Selection Frequency for Each Method Across the Six Simulation Scenarios.Left of the vertical dashed line: true covariates used in the generating models.

Figure 3 .
Figure 3. Mean Absolute Bias, MSE, Coverages and Proper Scores of the Prediction Intervals, Obtained for Each Method Across the Six Simulation Scenarios.RF: Random forest approach.

Figure 4 .
Figure 4. Selected Covariates for Each Method When Modeling the Log Equivalized Consumption in GAMA.

Figure 5 .
Figure 5.Estimated Mean Log Equivalized Consumption in the GAMA EAs (Left) and Widths of the Corresponding 95 Percent Prediction Intervals (Right) Obtained From Each Modeling Method.RF: random forest.
Figure 6.Pairwise Comparison of the Areal Estimates Obtained From Each of the Four Methods: Forward Selection, LASSO, Bayesian Shrinkage and Random Forest.

Figure 8 .
Figure 8. Left: Histograms of the Posterior Ranking Distributions of Five of the 10 Percent Poorest EA's (Left Column, Red) and Five of the 10 Percent Richest EA's (Right Column, Green), as Estimated From the MCMC Samples Obtained for the Bayesian Shrinkage Approach.Right: Map of the Greater Accra Metropolitan Area highlighting the 500 poorest EA's (red) and the 500 richest EA's (green).There are a total of 5019 EAs in the study region.
The estimation via the LASSO applies to The original SC procedure assumes � Y The 136 sampled EAs are divided into eight rural EAs and 128 EAs.Hence, in this eight-fold cross-validation study, 17 EAs are removed from the sample at a time (one rural and 16 urban EAs), the four methods are fitted on the remaining 119 EAs, and predictions are obtained for the 17 removed ones.The four methods are compared in terms of mean absolute bias, MSE, coverages, and