Social cartography and satellite-derived building coverage for post-census population estimates in difficult-to-access regions of Colombia

Effective government services rely on accurate population numbers to allocate resources. In Colombia and globally, census enumeration is challenging in remote regions and where armed conflict is occurring. During census preparations, the Colombian National Administrative Department of Statistics conducted social cartography workshops, where community representatives estimated numbers of dwellings and people throughout their regions. We repurposed this information, combining it with remotely sensed buildings data and other geospatial data. To estimate building counts and population sizes, we developed hierarchical Bayesian models, trained using nearby full-coverage census enumerations and assessed using 10-fold cross-validation. We compared models to assess the relative contributions of community knowledge, remotely sensed buildings, and their combination to model fit. The Community model was unbiased but imprecise; the Satellite model was more precise but biased; and the Combination model was best for overall accuracy. Results reaffirmed the power of remotely sensed buildings data for population estimation and highlighted the value of incorporating local knowledge.


Introduction
Census omissions due to geographic inaccessibility disproportionately affect Indigenous populations and cultural minorities as well as vulnerable populations living with insecurity due to armed conflict (Fein 1990;Car-Hill 2013;Dias and Verona 2018).Incomplete enumeration of these populations creates challenges for planning essential services, such as healthcare, education, and housing.Because achieving a full-coverage national population census is challenging even for the most developed countries, there is an increasing trend towards less costly methods that rely on administrative records and household surveys to supplement census field enumerations (Ericksen and Kadane 1986;Myrskyla 1999;Jardim 2001;Valente 2010).Even post-enumeration surveys (Hogan and Wolter 1988;Breiman 1994;UN 2010), which are used to estimate census omissions, are themselves dependent on access to regions where census coverage is being assessed.When entire regions are difficult to access and administrative records are incomplete, these approaches may be less effective for assessing coverage errors (which are needed to provide accurate and complete demographic estimates).
In situations where census counts are outdated or incomplete, modelled population estimates can provide a relatively low-cost alternative for obtaining up-to-date population estimates (UNFPA 2020).This provides stop-gap support in planning essential services for undercounted populations, with the ultimate aim of informing future census planning to achieve full-coverage enumeration.There is a degree of uncertainty associated with all modelled population estimates-dependent on population characteristics, input data, and modelling strategy-and recent work has focused on implementing statistical methods that accurately account for this estimation uncertainty (Leasure et al. 2020).Building on previous methods for modelled population estimates (Mossoux et al. 2018;Wardrop et al. 2018;Weber et al. 2018;Engstrom et al. 2020), these hierarchical Bayesian methods have been extended to use various types of input data, ranging from routine household surveys to satellite-derived building footprints and other geospatial data (Dooley et al. 2021;Boo et al. 2022).The increasing availability of remotely sensed maps of human settlements and building footprints is providing a valuable source of information for estimating populations with fine-grained spatial resolution, particularly in regions that are difficult to access (Palacios-Lopez et al. 2021;Kashyap et al. 2022).
In Colombia there are areas, mainly in the Amazonía, Orinoquía, and Pacífica regions, which are characterized by their poor accessibility, low population density, large territorial extent, and dense forest.In addition, some have security problems, mostly because of armed conflict.The sum of these conditions results in greater challenges for both the planning and operation of routine household surveys and the decennial Population and Housing Unit Census of Colombia.In addition, administrative records in these areas are often incomplete, and the administrative boundaries between municipalities are not well defined.In response to these obstacles, the National Administrative Department of Statistics in Colombia (DANE) implemented a data collection method called 'routes' (rutas in Spanish) for the 2018 Population and Housing Unit Census.
The routes method consisted of working groups travelling through the territory, along rivers, bridleways, or logging roads that encompassed an area of influence containing each of the existing communities and settlements.The routes were developed using information from a series of social cartography workshops (see Paulston and Liebman 1994 for a discussion of the social cartography concept) and other sources, such as the third National Agricultural Census, territorial planning documents, and administrative development plans (DANE 2014(DANE , 2021a(DANE , 2021b(DANE , 2022a)).The social cartography workshops collected information directly from community representatives about the locations and basic characteristics of difficult-to-access population settlements: for instance, the approximate numbers of housing units and people (DANE 2014).
We combined community knowledge obtained from social cartography workshops with building maps derived from satellite imagery, as well as other geospatial covariates, to estimate total population sizes for locations in Colombia that were not fully accessible to census enumerators; this affected mainly enumerations of minority ethnic groups in remote locations.Our primary objectives were to: (1) Estimate total population sizes and numbers of buildings for each census enumeration area not fully covered during the census; (2) Provide robust estimates of uncertainty with our estimates; and (3) Assess the relative contributions of local knowledge and remote-sensing observations to the accuracy of modelled population estimates.
To achieve these objectives, we developed a bespoke hierarchical Bayesian statistical model that was trained using full-coverage census enumerations from nearby areas.

This work was approved by the Ethics and Research
Governance Online committee at the University of Southampton (ERGO 61486 and 72234).All data were aggregated and fully anonymized so that individuals could not be identified at any stage of analysis.The data and Bayesian model code for all analyses are provided at doi.org/10.17605/OSF.IO/ DW4VR (Sanchez-Cespedes et al. 2022).

Data
Population and housing census.We used counts of people and dwellings from the 2018 Population and Housing Unit Census of Colombia, primarily from the Amazonía, Orinoquía, and Pacífica regions.In the census, municipalities from the study regions were divided into operational coordination areas called routes, and each of these was divided into operational units, which we refer to here as census enumeration areas.The enumeration areas were the spatial unit of analysis for our statistical models.In total, there were 394 routes, consisting of 1,302 enumeration areas spanning 145 municipalities (out of 1,121 nationally) and 23 administrative departments (out of 33 nationally).On average (standard deviation in parentheses) there were 3.3 (±1.9) enumeration areas per route, 8.9 (±11.3) per municipality, and 56.6 (±68.4) per department.
During the census fieldwork, the number of enumerated properties was verified and controlled by a geographic monitoring system that assigned a colour to each enumeration area according to the percentage of expected properties from the census frame that were actually enumerated by census workers (DANE 2014(DANE , 2021a(DANE , 2021b(DANE , 2022a)).Enumeration areas with 90 per cent or more on this indicator were coded as green, those in a range of 0-90 per cent were coded as orange, and units that were not visited were coded as grey (Figure 1).We used the green enumeration areas (n = 508) to train the models because these areas were considered to be fully enumerated, whereas the orange and grey enumeration areas (n = 628 and 166, respectively) were not fully enumerated and hence needed estimates of total population.

Social cartography workshops. Ethnic minority
groups occupy approximately 35 million hectares, one-third of the national territory of Colombia, with many living in regions that are difficult to access.To involve these groups in the census activities, DANE implemented 90 social cartography workshops with ethnic community representatives -66 with Indigenous communities and 24 with Afro-Colombian communities-which were held between 2011 and 2014 for the National Agricultural Census and updated in 2016 and 2017 for the 2018 Population and Housing Unit Census (DANE 2014(DANE , 2021a(DANE , 2021b(DANE , 2022a)).The objective of the social cartography workshops was to establish the locations of ethnic minority communities and their characteristics to support operational planning for the census (e.g.number of census takers and supervisors, costs, and times).To achieve successful community engagement, 14 separate agreements were reached between DANE, Afro-Colombian organizations, and Indigenous organizations (DANE 2014).
The ethnic minority organizations oversaw the logistical aspects of the social cartography workshops and summoned community representatives.The workshop participants were selected by each organization as community leaders who were knowledgeable about the populations in these areas.The organizations guided participants in producing estimates of numbers of dwellings, families, and people living in each community (Figure 2, left-hand panel), alongside documenting logistical constraints for census enumerators in accessing these remote communities.During the workshops, the DANE Social cartography & remote sensing to support census 5 team led cartographic exercises to help participants locate their communities on the map.Working in small groups based on Indigenous reservations, community councils, and geographic zones, the groups mapped communities and estimated numbers of dwellings, families, and people by reaching consensus within their small groups.These exercises identified and located 12,067 communities: 8,010 Indigenous, 3,200 Afro-Colombian, and 857 colonist.This information was used to construct the sampling frame for the census in these remote areas, alongside information from the 2014 National Agricultural Census and municipal development plans (DANE 2021a(DANE , 2021b(DANE , 2022a)).

Remotely sensed building coverage (hectares).
We used estimates of total building area per 90 m grid cell, obtained from World Settlement Footprint 3D (Figure 2, right-hand panel; Esch et al. 2020;Esch et al. 2022).This provided essential information about where buildings were located in these remote areas and gave an indication of how many buildings were likely in each location.These data were derived from Sentinel-1 and Sentinel-2 satellite imagery collected at 10 m spatial resolution between 2017 and 2019 with full coverage of Colombia, in combination with 12 m digital elevation data and radar imagery collected by the TanDEM-X mission.
The estimated building areas were validated using building models with very high resolution (<50 cm), which are available for 19 regions worldwide (Esch et al. 2022).The accuracy assessments showed a slight bias towards overestimation globally, with mean errors (ME) ranging from −6.48 to 12.99 per cent.Cartagena, Colombia, was included as one of the validation sites: its estimated building areas were the least biased of all validation sites, with an ME of 0.29 per cent.Building area estimates were also the most accurate in Cartagena with a mean absolute error (MAE) of 6.52 per cent and root mean squared error (RMSE) of 8.98 per cent.For comparison, MAE ranged from 6.52 to 17.29 per cent and RMSE ranged from 8.98 to 23.79 per cent globally.It should be noted that most of the validation sites were urban areas, whereas our study was focused on remote rural communities.Two of the validation sites were rural areas of Bavaria, Germany (ME = 0.93 per cent, MAE = 6.79 per cent, RMSE = 10.24 per cent), and Gyeonggi, South Korea (ME = 3.06 per cent, MAE = 9.93 per cent, RMSE = 14.19 per cent), which were below or near the average accuracy values among the validation sites.
Other geospatial covariates.We included a set of six additional geospatial covariates that had full coverage across the study area and were likely to be correlated with population densities.Our final set of covariates was selected from a larger set of covariates based on expert opinion and avoiding the inclusion of correlated covariates in the model.The six geospatial covariates, x k,i , that we selected were: (1) school density; (2) poverty index; (3) elevation; (4) night-time lights; ( 5) distance to populated centres; and ( 6) total area of the census enumeration area.Covariates were defined as the mean values within each enumeration area (except for ( 6)); these were then log transformed, scaled, and centred.
School densities were calculated for every 100 m grid square based on school locations obtained from the Ministry of National Education of Colombia.The poverty index represented the proportion of households in each route that were determined to have unsatisfied basic needs (INDEC 1984;Feres and Mancero 2001) based on their responses to the census questionnaire (DANE 2020).Digital elevation data were obtained at 30 m resolution from NASA's Shuttle Radar Topography Mission (Farr et al. 2007)

Statistical analysis
We chose a hierarchical Bayesian modelling framework to take advantage of its flexibility to develop bespoke model structures for our data and also to account for uncertainty in population estimates.Accounting for uncertainty is essential for any population estimates in these remote areas where information is scarce, because the uncertainty intervals may provide important context when using population estimates for decision-making (e.g.planning government services, health initiatives, household surveys, and census activities).
We compared four hierarchical Bayesian models using a consistent base model structure and set of geospatial predictor variables across models.We varied whether or not we included additional predictors derived from the social cartography workshops and remotely sensed buildings data, to isolate the contributions of these two sources towards improving model fit.These were all hierarchical models with two levels: one level to estimate the number of buildings and a second level to estimate the total population (i.e.aggregate counts for each enumeration area, not building-specific estimates).The directed acyclic graph (Figure 3) illustrates both sub-models and relationships between all parameters and data in the model (Tables 1 and 2).Models were fitted using training data from 489 census enumeration areas (out of 508 green areas), selected because they were fully enumerated during the census, hosted social cartography workshops, and were located in regions where some enumeration areas were not fully enumerated.This approach included an unavoidable assumption that relationships between predictors and populations were the same in enumerated areas as in undercovered areas.
Base model.The Base model for total population, P, in enumeration area i was: where B i is the number of buildings (occupied or not) and r i is the average number of people per building (log scale).We included a log-normal Social cartography & remote sensing to support census 7 regression on r i with a random intercept by administrative department, a d , and municipality, d m , along with the effects, b k , of six geospatial covariates, x k,i , selected a priori (Table 1).The random intercept by department a d estimates the average number of people per building (log scale) for department d (assuming covariates equal zero), while the term d m estimates deviations from this average for each municipality, m, within a department.The residual variance term s 1 quantifies variation in r i (people per building) that is not explained by the model.
The priors for all models are provided in a separate subsection later.
The Base model for buildings, B i , was: )  1 and 2.
Source: Authors' own.where A i is the total area (hectares) of enumeration area i, and u i is the average number of buildings per hectare (log scale).The remaining parameters are comparable to those in population sub-model (1a), and this sub-model includes the same set of geospatial covariates, x i .The number of buildings, B i , was observed during the census in accessible areas, and the model estimated this parameter for inaccessible areas.We include a dot above the parameter symbols to distinguish them from population submodel (1a).
Our three additional models differed in whether or not they included local knowledge from community workshops, remotely sensed building coverage from satellite imagery, or both.One of the key challenges was estimating the number of buildings B i , and we had several resources at our disposal to inform this portion of the model.The census recorded counts of buildings, although with incomplete coverage in some areas.We also had community-based estimates of the number of dwellings, D i , from the social cartography workshops, as well as satellite-based measurements of building coverage, C i .
Community-based model.This model used information gathered from the social cartography workshops to help inform the models of population and buildings: where I i , D i , and F i , are the numbers of individuals, dwellings, and families, respectively, reported to be in enumeration area i during the social cartography workshops.We used these reports to help estimate average numbers of people per building, r i .The Community model estimated building counts, B i , as a function of the community-based estimates of dwellings, D i , and the total area, A i , of each enumeration area: where u i is defined as the number of buildings B i per hectare A i .Note that the full specifications for random intercept parameters a d and d m are not shown here, but they were the same as in the Base model.
Satellite-based model.This model used the Base model for total population: The number of buildings, B i , was estimated as a function of the remotely sensed building coverage, C i , and the total area, A i , of each enumeration area: Notice that unlike in the Community model, building density, f i , is now defined as the building count per hectare of building coverage, C i , rather than the building count per total area of the enumeration area, A i (i.e.like the Community model parameter The satellite-based estimates of building coverage strongly constrain the portion of each enumeration area where buildings may be present.
Combined model.This model was the same as the Community model for estimating total population: where I i , D i , and F i , are numbers of individuals, dwellings, and families reported during the social cartography workshop to be within enumeration area i.
The building sub-model combined communitybased estimates of dwellings, D i , with the satellitebased estimates of building coverage, C i , in an attempt to better approximate the observed total building counts, B i , from the census: Priors, implementation, and diagnostics.All priors used in these models were designed to be minimally informative within a realistic range of parameter values: b, g, m, ḃ, ġ, ṁ Normal(0, 3) s 1 , s 2 , s 3 , ṡ1 , ṡ2 , ṡ3 Uniform(0, 3). ( The same priors were used across all models to ensure comparability of results.We chose uniform priors for standard deviations rather than the half-Cauchy priors suggested by Gelman (2006), to avoid a long tail that included unrealistic parameter space on the log scale.We implemented statistical models using JAGS software (Plummer 2003;Eddelbuettel 2021) from the R statistical programming environment (R Core Team 2020) with the runjags and coda packages (Plummer et al. 2006;Denwood 2016).Model convergence was assessed using the potential scale reduction factor (PSRF) statistics (Gelman and Rubin 1992).All models were run until they achieved PSRF < 1.1, indicating convergence (Brooks and Gelman 1998).
We used randomized 10-fold out-of-sample crossvalidation to assess model fit and robustness of uncertainty intervals.This iterative procedure involved fitting models to subsets of the data that each excluded a random 10 per cent of the locations and then predicting values for these out-of-sample locations to assess prediction accuracy.This procedure was repeated 10 times, omitting a different subset of the data each time, until all data had been withheld once.We used out-of-sample predictions, ŷi , to estimate the following measures: bias = mean ŷi − y i ŷi imprecision = sd ŷi − y i ŷi inaccuracy = mean ŷi − y i ŷi These out-of-sample fit statistics were calculated for the response variables (i.e.population, P i , and buildings, B i ) and used for comparing models.Robustness of uncertainty intervals was assessed by calculating the proportion of out-of-sample observations that fell within their 95 per cent prediction intervals, with the expectation that about 95 per cent of observations should fall within the prediction intervals.

Results
All models achieved convergence, including 10-fold cross-validation models.Uncertainty intervals appeared robust, if not a bit conservative, because they contained approximately the expected proportion of out-of-sample observations, suggesting appropriately specified error structures for the models (Table 3).For prediction intervals of less than 95 per cent, a greater than expected proportion of out-of-sample observations fell within the prediction intervals indicating that prediction intervals may be conservative at these (wider than necessary) uncertainty levels.We provide maps of predicted populations and building counts from the Combined model in Figure 4 for all of the census enumeration areas where the routes method was conducted, to show the geographic variation in model outputs.

Model comparison
We compared models in terms of bias, imprecision, inaccuracy, and percentage variance explained, r 2 (Figure 5).The Base model explained 51.4 per cent of variance in building counts and 54.3 per cent of variance in population counts observed during the census (bottom panel).The Base model included a set of geospatial predictors that was incorporated in all models but did not include local knowledge from social cartography workshops or remotely sensed buildings.All models contained a positive bias for estimates of total population and building counts, although the degree of bias varied between models (Figure 5).This positive bias was most pronounced for enumeration areas with the lowest population sizes.
The Satellite model included satellite-derived estimates of building coverage for every 90 m grid square.This information increased the variance explained to 53.4 per cent for building counts and 56.8 per cent for population counts.Compared with the Base model, the remotely sensed building coverage helped primarily to reduce the imprecision of estimated building counts.It also slightly reduced the bias of building counts and population estimates compared with the Base model.
The Community model included estimates of people, families, and dwellings provided by local community members during the social cartography workshops.This information increased the variance explained to 64.1 per cent for buildings and 66.2 per cent for total population.While this was a noticeable increase in variance explained compared with the Base and Satellite models, it is important to note that this model produced more imprecise population estimates than any other model (i.e. more random noise).Conversely, the Community model produced the least biased estimates of building counts and population estimates of any model.
The Combined model explained the largest proportion of variance in out-of-sample observations, with 65.1 per cent of variance in building counts and 67.9 per cent of variance in population counts explained.This model was the most precise of any model, although it gave slightly more biased population estimates than the Community model.The Combined model showed the highest overall accuracy (i.e. a measure that incorporates both bias and imprecision) of any model we tested.The predicted values and prediction intervals from this model are plotted against out-of-sample observations in Figure 6.This shows that model predictions performed reasonably well for unobserved locations and that the prediction intervals accurately represented uncertainty in the population estimates.

Covariate effects
Estimated covariate effects on the expected values of buildings per hectare ( u i ) and buildings per built hectare ( f i ) are shown in Figure 7, while covariate effects on expected values of people per building ( r i ) are shown in Figure 8.One trivial result from Figure 7 that is important to note was that covariates Social cartography & remote sensing to support census 11 containing the unit of area used as the denominator for buildings per hectare (i.e. total area, A i , or building coverage, C i ) always had significant negative effects as expected.We also want to emphasize that strong covariate effects do not necessarily imply causality, because these data were observational rather than experimental.We defined 'significant' effects as b estimates where at least 95 per cent of the marginal posterior mass was either above or below zero.Another general pattern worth noting was that the effects of geospatial covariates from the Base model were found to be very similar (although not always identical) in the other models.The poverty index and elevation always had significant positive relationships with numbers of people per building, r i .Numbers of schools always had a significant positive relationship with buildings per hectare, u i .Distance to city centre showed a slight positive relationship with numbers of people per building, and the poverty index had a slight positive relationship with building per hectare.In models that did not contain satellitebased estimates of building coverage, the intensity of night-time lights had a slight positive relationship with buildings per hectare, but this slight effect was not present in models that included remotely sensed building areas.
In the Community model, the numbers of dwellings per hectare reported during the social cartography workshops had a significant positive effect on expected values of buildings per hectare, u i .Reported values of individuals per family had a significant positive effect on expected values of people per building, r i , but reported values of individuals per dwelling did not.This latter result may have been due to correlation (r = 0.65) between the two covariates.
In the Satellite model, total building coverage had a significant negative effect on expected values of buildings per hectare, f i , as expected, because building density was defined in this model as buildings per hectare of building coverage.The proportion of the total enumeration area covered by buildings had a The Combined model included a covariate measuring dwellings per building coverage that combined information from the social cartography workshops with information from remotely sensed buildings.This covariate had a significant positive relationship with expected values of buildings per hectare, f i .The covariate of reported individuals per family from the social cartography workshops did not have a significant effect on people per building in the Combined model although it did in the Community model.

Discussion
We have demonstrated a novel approach, combining information from space-based Earth observations with local knowledge gathered from social cartography workshops to fill census gaps in locations where access was challenging for fieldworkers.We were encouraged by the degree to which local knowledge contributed to model fit, and it was reassuring that we were able to fine-tune population estimates based on the relatively imprecise information on remotely sensed buildings.On one hand, the Community model exhibited the most unbiased estimations for both population sizes and building counts; on the other hand, the Satellite model increased precision of population estimates compared with the Community model.When both types of information were used simultaneously, we obtained unbiased estimates similar to the Community model along with increased precision and accuracy, achieving the highest r-squared across all of the models.
The social cartography workshops in Colombia provide a powerful example of engaging potentially undercounted communities with the census process.Community engagement and social mapping exercises are already used to gather information to support planning for censuses and household surveys (Marcil et al. 2016;Green et al. 2020;Open Street Maps 2022), but we are not aware of a previous example where the data collected have been directly used in population estimation to help address census omissions.The methodological framework that we proposed used local knowledge to improve population estimates, which will guide appropriate resource allocation for essential services back into these communities.To account for the subjective nature of social cartography exercises, our approach incorporated objective information from remote sensing and other geospatial data, and the model was fitted to full-coverage census enumerations from nearby locations to ensure rigorously produced population estimates.It is important to incorporate knowledge of estimation uncertainty into decision-making processes that are based on modelled population estimates (UNFPA 2020), and this is particularly relevant for remote locations where data are sparse.Our hierarchical Bayesian modelling approach provided robust estimates of uncertainty similar to previous work (Leasure et al. 2020;Dooley et al. 2021;Boo et al. 2022).The current model differed from previous examples because it included a sub-model that explicitly estimated building counts for inaccessible locations.This was necessary because we did not have enumerations of buildings from the census cartography nor from remote-sensing data; our remotely sensed building data (Esch et al. 2022) measured building coverage for each 90 m grid cell but did not include individual building footprints.Because of the hierarchical nature of the statistical model, the uncertainty around our population estimates also accounted for uncertainty in building estimates.
High-resolution building footprints are available from a variety of sources with global coverage, but costs are often prohibitive.These data sets are increasingly becoming openly available (e.g.Google 2022; Microsoft 2022) or crowdsourced with incomplete coverage (e.g.Geofabrik GmbH 2018; OpenStreetMap 2022), but full-coverage high-resolution building footprints are not yet openly available globally.Our approach addressed this limitation in Colombia by using census-based building counts from fully accessible census Figure 7 Covariate effects ( ḃ and ġ) on the expected values of buildings per hectare ( u i ) or buildings per built hectare ( f i ) for all four models, Colombia Notes: Covariates are defined in the Methods section and Table 1: Schools (x 1 ), Poverty (x 2 ), Elevation (x 3 ), NightLights (x 4 ), DistToCenter (x 5 ), and Area (x 6 ).BldgCover refers to building coverage.Source: See Data subsection for information on data sources used to calculate values in this figure .Social cartography & remote sensing to support census 15 enumeration areas to train a sub-model to estimate building counts using satellite-based estimates of building coverage (Esch et al. 2022), communitybased estimates of numbers of dwellings (DANE 2014), and other geospatial covariates.We would expect estimation uncertainty for our modelled population estimates to be reduced if high-resolution building footprints were available, and opportunities are now arising to pursue this option (Microsoft 2022).
We included a small set of geospatial covariates that was consistent across all of our models so that we could isolate the influences of data from social cartography workshops and remotely sensed building coverage.We evaluated many geospatial covariates before finalizing the set of covariates presented here, but covariate development and selection was beyond the scope of the current study.However, it is important to note that the selection of covariates must be dependent on data availability and the specific context of the population estimation.For example, the intensity of night-time lights may not be a good predictor of populations in remote regions where electricity is not commonly available, whereas it otherwise may provide valuable information.We chose a small set of orthogonal covariates using the best available data for these remote regions of Colombia, but additional work may be able to uncover additional covariates that could improve model fit.Notes: Covariates are defined in the Methods section and Table 1: Schools (x 1 ), Poverty (x 2 ), Elevation (x 3 ), NightLights (x 4 ), DistToCenter (x 5 ), and Area (x 6 ).Source: See Data subsection for information on data sources used to calculate values in this figure.

Conclusions
While emphasis is often placed on new technologies such as satellite remote sensing to fill data gaps, we have provided evidence to serve as a reminder that innovative technologies are sometimes most effective when combined with traditional low-tech sources of information, such as local knowledge obtained through community engagement.We have highlighted the importance of social cartography workshops to engage potentially undercounted communities of Colombia in the census process.The statistical approach that we demonstrated incorporated community-based estimates of numbers of dwellings, families, and people with satellite-derived estimates of building coverage and other geospatial covariates to estimate building counts and population sizes in remote regions of Colombia where a full-coverage census enumeration was not possible.This project has provided a step forward in the science of modelled population estimates to support censuses and highlighted the value of community engagement as well as government-academic partnerships in searching for innovative solutions for real-world challenges.

Figure 1
Figure 1 Maps showing percentage of expected properties from the census frame that were actually enumerated by census workers: rural and urban areas, Colombia 2018 Notes: Left-hand panel: In rural areas, census enumeration areas in which 90 per cent or more of expected properties were enumerated during census fieldwork are shown in green, areas where 0-90 per cent were enumerated are shown in orange, and areas where no property was enumerated are shown in grey.Right-hand panel: In urban areas, colours are the same as for rural areas, but the indicator was the percentage of expected dwellings in a census enumeration area.Source: The methodology for determining census coverage is described by DANE (2022a).The image is taken from the Geovisor tool used by DANE to monitor census coverage.

Figure 2
Figure 2 Maps of the study area in Colombia, 2018, showing two important predictor variables: Left-hand panel shows community-based estimates of population size in each enumeration area; Right-hand panel shows remotely sensed building coverage for 90 m pixels Source: National boundaries were obtained from Global Administrative Areas (GADM 2019) and the subnational boundaries from DANE (2022b).The maps were created using ESRI ArcGIS pro v.2.5.Community-based estimates of population size in each enumeration area were obtained from social cartography workshops.Remotely sensed building coverage for 90 m pixels was obtained from World Settlement Footprint 3D.

Figure 3
Figure 3 Directed acyclic graph (DAG) showing relationships between data (squares) and parameters (circles) Notes: The hierarchical model structure has a sub-model to estimate counts of buildings (B) that feeds into a sub-model of population (P).Solid lines indicate stochastic relationships, while dashed lines indicate deterministic relationships.Blackfilled nodes were not included in every model.Key parameters included people per building (r) and buildings per hectare (u).Parameters and data are defined in Tables1 and 2. Source: Authors' own.

Figure 4
Figure 4 Predicted counts from the Combined model of population (left-hand panel) and building counts (right-hand panel) for all census enumeration areas where the routes method was conducted, Colombia 2018

Figure 5
Figure5Comparisons of model fit across four models (Base, Satellite, Community, and Combined) for the two response variables (population and buildings) in each hierarchical model Notes: Fit statistics were calculated using out-of-sample predictions from 10-fold cross-validations.Bias, imprecision, and inaccuracy are reported as proportions of the predicted values.R-squared values quantify the proportion of variance explained by each model.Note that x-axes do not start at zero.Source: See Data subsection for information on data sources used to calculate values in this figure.
Social cartography & remote sensing to support census 13

Figure 6
Figure 6 Model fit for the Combined model showing out-of-sample model predictions vs observed data from census enumeration areas that were fully enumerated (≥90 per cent coverage), Colombia 2018

Figure 8
Figure 8 Covariate effects (b and g) on the expected values of people per building ( r i ) for all four models, Colombia

Table 1
Definitions of symbols for data 8 Lina Maria Sanchez-Cespedes et al.

Table 2
Definitions of symbols for parameters

Table 3
Proportion of out-of-sample observations that were within each model's prediction intervals Note: Approximately 95 per cent of observations (i.e.0.95) should fall within well-specified 95 per cent credible intervals (CI).Source: See Data subsection for information on data sources used to calculate values in this table.