Spatial interpolation using areal features: A review of methods and opportunities using new forms of data with coded illustrations

This paper provides a high-level review of different approaches for spatial interpolation using areal features. It groups these into those that use ancillary data to constrain or guide the interpolation (dasymetric, statistical, street-weighted, and point-based), and those do not but instead develop and refine allocation procedures (area to point, pycnophylactic, and areal weighting). Each approach is illustrated by being applied to the same case study. The analysis is extended to examine the opportunities arising from the many new forms of spatial data that are generated by everyday activities such as social media, check-ins, websites offering services, microblogging sites, and social sensing, as well as intentional VGI activities, both supported by ubiquitous web- and GPS-enabled technologies. Here, data of residential properties from a commercial website was used as ancillary data. Overall, the interpolations using many of the new forms of data perform as well as traditional, formal data, highlighting the analytical opportunities as ancillary information for spatial interpolation, and for supporting spatial analysis more generally. However, the case study also highlighted the need to consider the completeness and representativeness of such data. The R code used to generate the data, to develop the analysis and to create the tables and figures is provided.

develop and refine allocation procedures (area to point, pycnophylactic, and areal weighting). Each approach is illustrated by being applied to the same case study. The analysis is extended to examine the opportunities arising from the many new forms of spatial data that are generated by everyday activities such as social media, check-ins, websites offering services, microblogging sites, and social sensing, as well as intentional VGI activities, both supported by ubiquitous web-and GPS-enabled technologies. Here, data of residential properties from a commercial website was used as ancillary data. Overall, the interpolations using many of the new forms of data perform as well as traditional, formal data, highlighting the analytical opportunities as ancillary information for spatial interpolation, and for supporting spatial analysis more generally. However, the case study also highlighted the need to consider the completeness and representativeness of such data. The R code used to generate the data, to develop the analysis and to create the tables and figures is provided.
T A B L E 1 Summary of the major areal interpolation approaches

Advantages Disadvantages
Simple areal weighting Goodchild and Lam (1980) Area is proportional with population London, Ontario, Canada Simple; No need of auxiliary information The homogeneity assumption, that the census population distributed evenly within a source zone, is a variation of the ecological fallacy; There is a possibility that the area selected from the source zone has a different population density than the average population density of the source zone Pycnophylactic interpolation Tobler (1979) Smooth pycnophylactic interpolation Ann Arbor, MI, US Generates a smooth surface; The total volume of each census division be preserved on the interpolated surface Unlike topography, population is not a continuously observed phenomenon Area-to-point interpolation without ancillary information Martin (1989) Centroid-based method with kernel density algorithm Cardiff, UK Generates a smooth surface; No need of auxiliary information; Transform the polygons into points; smooth surface The placement of the control point has a significant impact on the resulting surface. In some cases, the geometric centroid lies outside the boundary of the polygon, generating a questionable result; The total volume of each census division may not be preserved on the interpolated surface Bracken and Martin (1989) A variant of inverse distance weighted South Wales, UK Kyriakidis (2004) A geostatistical framework for area-to-point spatial interpolation n/a Krivoruchko, Gribov, and Krause (2011) Areal interpolation with kriging-based method n/a No need of auxiliary information; Overcomes several computational problems, such as how to handle polygons of vastly different sizes and how to analyse polygons that are overlapping or disjoint Dasymetric Wright (1936) Dasymetric map using USGS

Quad Map
Cape Cod, MA, US More accurate than simple areal weighting; Distributes population into different land use classes (populated, urban or rural area); The method is mature and stable; Need auxiliary information; The population is distributed evenly within a target zone; Accuracy is dependent on the resolution of auxiliary information (such as Fisher and Langford (1996) illustrations can be found in Diggle (1983) and in Brunsdon and Comber (2018). Over the last three decades, various approaches for areal interpolation have been developed based on different assumptions about the underlying distribution of the variables to be interpolated, for example, densities or counts. These can be grouped into two broad categories: methods that use ancillary (or auxiliary) data to control, inform, guide, and constrain the reallocation process from source to target zones, and methods that do not, but instead rely solely on the target and source zone properties (Hawley & Moellering, 2005;Langford, 2006;Zhang & Qiu, 2011).

| Areal interpolation without ancillary information
There are three basic approaches for transforming source zones values to target zone values using the properties of the source and target zones alone: those based on some measure of proportionality such as area (areal weighting), those also seek to smooth the allocation to target zones to minimise discontinuities between adjacent zones (pycnophylactic interpolation), and those that seek to take advantage of the ability of point based, geostatistical methods to create continuous surfaces (area-to-point interpolation).

| Areal weighting
Choices for areal interpolation are limited if the source zones and their attributes are the only information available.
In this situation, simple methods may be used, the best-known method of which is the area-weighting approach. This allocates the source zone attributes proportionately to the target zones based on the area of their intersection (Goodchild & Lam, 1980;Lam, 1983). Area weighting is inherently volume preserving-that is the source zone values are maintained if the target zone values within the source zone are summed-and is easily implemented using polygon overlay operations. Thus, the method is incorporated into most GIS software packages (Xie, 1995) and is widely used in practice (Langford, 2006). Recent examples of research using this approach includes an examination of methods to overcome changes in census boundary structures (Logan, Xu, & Stults, 2014) and the interpolation of election results to new target zones (Goplerud, 2016). Goplerud (2016) noted that the method worked well when interpolating election results across boundary changes for six different countries, with mean absolute errors in the range of 2% to 3%. The disadvantage of this method is obvious: it assumes the relationship between the source zone attribute being interpolated and the target zone areas to be spatially homogenous (Goodchild & Lam, 1980), an assumption that is rarely true in the real world. However, in the absence of ancillary data, it remains a reasonable solution (Xie, 1995;Tapp, 2010).

| Pycnophylactic interpolation
Pycnophylactic interpolation was first proposed by Tobler (1979) and takes a slightly different approach to areal weighting.
It iteratively interpolates the source zone attribute to the target zones in order to avoid sharp discontinuities between neighbouring target zones aims, whilst preserving the overall mass or volume of the counts in the source zones. It is a process that seeks to generate a smooth surface in the target zones from polygon-based source zones to avoid sharp attribute discontinuities between neighbouring target zones, which are frequently raster cells. Each iteration tries to improve the smoothness of adjacent target zone values across study area by adjusting the allocation to each target zone, whilst preserving the target zone total (also referred to as mass or volume), using the weighted average of target zone nearest neighbours. The number of nearest neighbours used and the number of iterations determines the overall level of smoothing and is a subjective process (Hay, Noor, Nelson, & Tatem, 2005).
Pycnophylactic interpolation is an elegant solution to the problem of generating a continuous surface from discontinuous data, although it does assume that no sharp boundaries exist in the distribution of the data (Hay et al., 2005), which may not always be the case in reality, for example, when target zones are divided by linear features (rivers, railways, roads) or are adjacent to waterbodies. In these cases, sharp discontinuities might be expected (for example, in popular riverside developments). However, pycnophylactic interpolation is an elegant method and has been adopted in many applications (Kounadi, Ristea, Leitner, & Langford, 2018;Monteiro, Martins, & Pires, 2018) as well as in hybrid approaches (Comber, Proctor, & Anthony, 2008).

| Area-to-point interpolation
A third type of areal interpolation is point-based areal interpolation (Bracken & Martin, 1989;Martin, 1989), an extension of point interpolation (Bracken & Martin, 1989;Lam, 1983;Xie, 1995). A control point for each source zone is identified (usually the centroid) and a density value is assigned to that point. The value is interpolated to a regular grid of points using one of the point interpolation methods such as kriging or inverse distance weighting (Martin, 1989;Xie, 1995). Then, the density value for each grid cell is converted back to a count and the count values are summed over the intersecting target zones. Lam (1983) noted that the resulting target zone values depend greatly on the choice of the control point, which has a significant impact on the grid surface. In some cases, the source zone geometric centroid may be outside the source zone boundary, generating a questionable result (Xie, 1995;Tapp, 2010). In others, the centroid location may adequately describe the distribution within the source zone, which may be better described using a population-weighted centroid (Martin, 1989), for example. A further problem is that point-based interpolators are not volume preserving and need to be rescaled (Bentley et al., 2013).
To solve this problem, Martin (1996) modified the original centroid-based algorithm to ensure that the populations reported for target zones are constrained to match the overall sum of the source zones. Recent examples of applications using area-to-point approaches include downscaling climate models outputs (Poggio & Gimona, 2015), urban data modelling (Anda, Erath, & Fourie, 2017), and handling mobile data streams (Kaiser & Pozdnoukhov, 2013).

| Illustration: Interpolation without ancillary information
The different areal interpolation approaches (in this section and Section 2) are illustrated using a common case study: estimating house counts for each cell in 500 m grid from US Census tracts for New Haven, Connecticut, USA. The census data are included within the newhaven data in the GISTools R package (Brunsdon & Chen, 2014), with the tracts source zones and the 500-m grid as target zones). The interpolations were developed in R and the code and data used to generate all the figures and results in this paper are provided in Data S1.
For interpolation without ancillary information, areal weighting was undertaken using the function in the sf R package (Pebesma et al., 2018), pycnophylactic using the pycno R package (Brunsdon, 2014), and bespoke code was written for the area to point interpolation using inverse distance weighting. Figure 1 shows the original data with the interpolation results, which are also summarised in Table 2.
Each of the three interpolation approaches shows broadly the same pattern, but with clear differences. Considering Figure 1 and Table 2 together, it is evident that the area to point estimates have higher maximum values, and considerable spatial discontinuity between adjacent target zones. In contrast, the pycnophylactic house estimates have lower maximum values but display a distinct poly centric pattern, with smooth but steep gradients from areas of high population to areas with lower (even zero) population allocations. The area weighting estimates have shallower gradients between areas with high and low allocated populations, and lower maximum values than area to point. The median values in Table 2 characterise some of these differences.

| Methods using ancillary information
The main problem with spatially unconstrained interpolation approaches is that they are likely to allocate source zone data inappropriately. For example, estimating population to areas where people do not live. Ancillary information with some relationship to the source zone variable can be used to constrain or inform allocation to target zones (Liu et al., 2008). Interpolations undertaken in this way can result in allocations that better reflect actual distributions, because the ancillary data is closely related to the data being interpolated. Population distributions, for example, are closely related to features such as residential land use. A number of algorithms using ancillary data have been developed, informed by the increasing variety of different data types (Langford, 2013). These methods have been extensively applied to population interpolations (Cromley et al., 2012;Langford, 2007;Mennis, 2003;Reibel & Agrawal, 2007), socioeconomic variable estimations (Eicher & Brewer, 2001;Goodchild, Anselin, & Deichmann, 1993;Mennis & Hultgren, 2006), and to handle changing historical administrative boundaries (Gregory, 2002;Mennis, 2016).
Methods for areal interpolation using ancillary data can be grouped into four sets of approaches: those which apply areal masks to inform interpolation (dasymetric mapping), those using road networks to allocate populations along target zone road segments (street-weighting method), those which establish a statistical relationship between the ancillary data and the source zone data to guide allocation (statistical approaches), and those which use point data as ancillary information (point-based approaches).
F I G U R E 1 A choropleth map of the New Haven census tracts, with 500 m grid cells and interpolation results (AW, areal weighting; Pynco, pycnophylactic; AtP, area to point)

| Dasymetric mapping
The dasymetric interpolation approach is the most cited of the methods that use ancillary information (Langford, 2013). Ancillary data includes areal features and linear or point features with buffers. It was first proposed as a cartographic technique to address some of the issues associated with choropleth mapping. Mennis (2009) provides a comprehensive overview of the origins of the dasymetric approaches, linking back to 19th century dasymetric maps of population (Semenov-Tian-Shansky, 1928) and the work of Wright (1936). An accessible introduction can be found in Mennis (2003) who defines dasymetric mapping as "areal interpolation that uses ancillary (additional and related) data to aid in the areal interpolation process" (p32). It guides the redistribution of source zone values to target zones using auxiliary information as a spatial control. Dasymetric approaches can either identify areas to include/exclude from the interpolation process. Population data, for example, are excluded from nonresidential areas. They can also highlight areas that might expected to have higher/lower population densities than others (Cromley et al., 2012). In early work, the most commonly used ancillary information was areal masks related to land use classified from remotely sensing data. In the 1990s, the Leicester group (David Unwin, David Maguire, Mitchel Langford, Peter Fisher) published a series of methods papers informed by urban land use data derived from satellite imagery. Langford and Unwin (1994) and Fisher and Langford (1996) demonstrated the improvements in areal interpolation using dasymetric mapping as have a number of authors subsequently (Eicher & Brewer, 2001;Langford, 2006;Mennis, 2003;Mennis & Hultgren, 2006).
The simplest dasymetric approach is to create binary masks of areas that are included or excluded from the interpolation process. Binary dasymetric approaches (Fisher & Langford, 1996) would, for example, exclude nonresidential areas in target zones when interpolating population data. However, population density may vary in different land use classes (Lin, Cromley, Civco, Hanink, & Zhang, 2013). Categorical dasymetric approaches assign different proportions of the total population to different land classes (Eicher & Brewer, 2001) or select target zones that are homogeneous with regard to a specified land class (Mennis, 2003). Mennis and Hultgren (2006)  agricultural land use dataset in the United Kingdom using Ordnance Survey data to mask out nonagricultural features (urban and woodland areas, buffered rivers and roads). Lu et al. (2010) and Sridharan and Qiu (2013) used LiDAR data to estimate populations at building level. Leyk et al. (2013) developed a maximum entropy dasymetric model using USGS national land cover data and multiple attributes from the population census to generate correlations between ancillary variables and population. Nagle et al. (2014) extended this to incorporate uncertainty into the modelling process.
There are a number of potential problems with dasymetric approaches. First, the performance of any given dasymetric approach has been found to vary substantially in different study areas, with no single technique consistently outperforming all others (Zandbergen & Ignizio, 2010). Second, although dasymetric mapping can provide a more spatially informed interpolation, the implementation of such approaches imposes greater demands in terms of ancillary data requirements (Cromley et al., 2012). Auxiliary information is often derived from remote sensing images or land use data, which are not always available (Sadahiro, 2000), especially in developing countries (Yang, Jiang, Luo, & Zheng, 2012). Further, the increases in computational costs can limit the wider applicability of methods as large quantities of polygons, raster cells, or both have to be processed (Zhang & Qiu, 2011). Dasymetric approaches using remote sensing data also requires an understanding of multispectral signatures, image classification techniques, etc. that may be outside the analyst's skill set (Langford, 2013). They also assume population density to be homogeneous in each land use class whether binary or using categorical masks. Fourth, they are inherently subject to the ecological fallacy or modifiable areal unit problem (MAUP, Openshaw, 1984) and can generate different results depending on the scales of target and source zones. Finally, with any dasymetric approach, the quality and relevance of the ancillary data and any intrinsic relationships with the source zones has a critical influence on the representativeness of the target zone estimates.

| Street-weighting method
The street-weighting method (Xie, 1995) uses street network data. Several variants of the methodology exist, the simplest of which uses the network length within the source zone and distributes population uniformly along street segments within its boundaries. The linear features are then intersected with the target zones and an estimated population count is derived by summing the population along each road segment within the target zone boundary. Reibel and Bufalino (2005) tested Xie's algorithm and showed it to be more accurate than simple area weighting. A significant difference between areal and street weighting is the way that the weighting is applied. In areal weighting, intersecting areas drives the allocation. In street-weighting this is done by the length of the intersecting linear objects.
This approach performs well in urban areas, with regularly spaced streets, but has been found to underperform in rural areas with fewer streets and residences located at irregular intervals on them (Tapp et al., 2010).

| Statistical and geostatistical methods
Statistical areal interpolation methods use ancillary data in conjunction with statistical techniques to establish functional relationships between the spatial distribution of the ancillary data and the spatial distribution of source zone data to be interpolated (Flowerdew & Green, 1991;Goodchild et al., 1993;Reibel & Agrawal, 2007;Lin, Cromley, & Zhang, 2011). A regression analysis is usually conducted at the source zone level to model the variable of interest from other source zone attributes. The resulting model is then applied to each target zone to predict values using target zone attributes. Some refinements to this approach have been proposed. Flowerdew andGreen (1991, 1994) used an expectation/maximum (EM) algorithm (Dempster et al., 1977) to model relationships between population density and socioeconomic variables. Harvey (2002) applied an iterated regression procedure as a least-squares approximation of the EM algorithm to model population from satellite image pixels classified as residential. Statistical methods typically generate a model which is then uniformly applied to target zones across the whole study area (Xie, 1995). To provide greater spatial nuance in the allocation to target zones, Cromley et al. (2012) proposed a quantile regression approach to provide estimates conditioned on local parameters rather than global ones and Schroeder and Van Riper (2013) developed a geographically-weighted EM algorithm. This allowed the allocation to target zones to vary spatially, depending on the local model coefficient estimates for the various land use categories related to built-up areas. The major factor with the use of such statistical approaches is that they depend heavily on the availability of detailed control variables (such as residential land use) at the target zone level and on the assumption that the variable of interest follows a known or quantifiable statistical distribution, which can limit their wider application (Zhang & Qiu, 2011).
Geostatistical interpolation methods were originally designed for point interpolation and have been applied in areal interpolation because of their ability to accommodate spatial autocorrelation into the modelling process. Kyriakidis (2004) established a theoretical framework for interpolation based on spatial cross-correlation between areal and point variables using cokriging. Kyriakidis and Yoo (2005) applied this approach to a synthetic image dataset. Wu and Murray (2005) developed a cokriging areal interpolation approach with pixel level variance estimations from the impervious surface fraction representing roads, roofs, etc. Liu et al. (2008) further extended this model to disaggregate the residuals from the regression of population density with built-up and vegetation compositions and Meng et al. (2013) combined multivariate regression with kriging to improve spatial prediction accuracy for weakly correlated auxiliary variables. Geostatistical methods are inherently pycnophylactic-they preserve volume and handle process spatial heterogeneity-but Griffith (2013) noted that kriging and cokriging involve variable transformation which may introduce errors in final population estimates. However, geostatistical-based areal interpolation is an emerging approach whose theoretical framework has been developed and tested with applications using simulated data and satellite imagery.

| Point-based ancillary information
Areal interpolation approaches with point-based information use the point locations to guide the interpolation to target zones (Zhang & Qiu, 2011). Tapp (2010) used address points as ancillary information to predict population and the results showed significant reductions in target zone estimate error compared to other methods. Harris and Chen (2005) used post code points with the population surface modelling technique proposed by Martin (1989) and Bracken and Martin (1989) to estimate population density. Zhang and Qiu (2011) used school locations to interpolate population with classic density models and Langford (2013) used primary school locations and bus stop points. Point data have a much simpler structure than polygon or line data and do not require topological information to be determined to represent spatial relationships (Zhang & Qiu, 2011). Many different types of point data are widely available from different portals and databases, supporting the interpolation of different types of variables to target zones.
Because they represent features at discrete dimensionless locations, there is no risk of ecological fallacy or the MAUP, in contrast to area-based ancillary information which implicitly assume even within zone distributions (Tapp, 2010). As new data sources emerge (see below), there are increased opportunities for point-based methods to complement methods using polygon or line based ancillary data.

| Illustration: Interpolation with ancillary information
The four areal interpolation approaches using ancillary are illustrated in Figure 2 using the newhaven census tracts (source zones) and the 500-m polygon grid (target zones) described above. Bespoke R code was written for each of these analyses and to produce the maps in Figure 2 and the summaries in Table 3, provided in Data S1.
The interpolation approaches using ancillary information to constrain and guide them show broadly similar patterns, but with some subtle differences. Considering Figure 2 and Table 3, two characteristics stand out. First, the negative values and flatter distribution of the estimates from the statistical approach. This is to be expected: a simple linear regression model was constructed then used to predict target zone house counts. One of the model coefficient estimates (for Grassland/Herbaceous) was negative and high (an order of magnitude greater than the other two F I G U R E 2 The interpolation results for approaches using ancillary data (Dasy, dasymetric; Street, streetweighted; Stat = statistical; Point = point-based) T A B L E 3 Summaries of the distributions of the house estimates from the interpolation approaches using ancillary information predictor variables) with the result that the predictions for some areas inevitably will be negative. The second striking feature is the high degree of degree of homogeneity in spatial distributions of the other three approaches-there are similar patterns of discontinuity and gradients between high and low target zone areas. This indicates a generic benefit of the inclusion of some kind of relevant additional information to guide the interpolation. Their statistical distributions are similar to the pycnophylactic and areal weighting approaches (Figure 1, Table 2).
The ancillary data used in the different interpolations (with the outline methods) were as follows: • Dasymetric: Parks data from the City of New Haven data portal, plus features labelled as "land use," "amenity," and "coastline" extracted from OpenStreetMap to mask these areas out. This was used as input to an areal weighting approach.
• Street weighted: road linear features were extracted from OpenStreetMap. The proportion of the source zone streets in each target zone was determined and used to allocate house estimates.
• Statistical: data from the 2011 National Land Cover Dataset was downloaded from the USGS portal. Three land use classes were used to train a linear regression model: Developed, High Intensity, Developed, Medium Intensity, and Grassland/Herbaceous. Counts of these were created over source zones and target zones. The model was trained over source zone counts and then used to predict houses over target zones.
• Point-based: features labelled with "building" were extracted from OpenStreetMap. The proportion of the source zone buildings in each target zone was determined and used to allocate house estimates.
The ancillary data are shown in Figure 3 and full details of how they were obtained (including the code for extracting them), manipulated, and applied in the area interpolations are fully detailed in Data S1.

| NEW FORMS OF DATA
Interpolation approaches using ancillary data generate more accurate results than approaches that do not include such data, although with some questions about their generalisability (e.g., Zandbergen & Ignizio, 2010). Remotely sensed imagery, road network, and land use/cover are the most commonly used ancillary data for interpolating population (Lin & Cromley, 2015). However, ancillary data can be expensive or unavailable. To overcome this, approaches have been developed that use the many new forms data as ancillary data. For example, in developing countries, especially rural areas, high-resolution spatial datasets are rarely available. Some research has used Google Earth imagery to extract auxiliary information (Yang et al., 2012), taxation data (Jia & Gaughan, 2016;Kar & Hodgson, 2012), and social media data (Yu, Li, Zhu, & Plaza, 2018). New forms of data able to support spatial interpolation are available from three general sources as follows: 2. Online service providers, particularly property sales and rentals for population interpolation, but also commercially produced but freely available Point-Of-Interest (POI) data; and 3. Data generated by citizens through social media posts, check-ins as well as citizen sensing and volunteered geographic information (VGI) activities such as OpenStreetMap, supported by mobile personal devices with weband GPS-enabled technologies.
There is necessarily some overlap between the groups. For example, social media check-ins are frequently to commercially produced POIs. However, these data sources reflect the increasingly volumes of data that are routinely created and collected as part of our everyday lives with location attached (all data are spatial now), and the ease with which data are uploaded and shared via open and queriable repositories.
Many national mapping agencies have been forced to respond to this open data explosion: the alternative is to lose their users but also their primacy and ultimately funding. For example, in the United Kingdom, this has resulted in open and free access to high-quality national mapping agency data, providing highly consistent ancillary data resources to support dasymetric interpolation (Langford, 2013). User-contributed information and VGI provide alternative sources of ancillary data for dasymetric interpolation, as used in a number of studies (Bakillah et al., 2014;Kunze & Hecht, 2015;Geiß et al., 2017). Although there are potential quality issues, these provide valuable data sources that can complement official and commercial data (Goodchild, 2007;Bakillah et al., 2014). Other geographic data generated by social networks is also emerging as a further source of ancillary data, with for example, Lin and Cromley (2015) using microblogging posts (tweets) to guide interpolation, although not without some issues related F I G U R E 3 The ancillary data used to constrain and guide the different interpolation approaches, each with an OpenStreetMap backdrop (© OpenStreetMap contributors). The land cover shading shows the Developed, High Intensity (Black), Developed, Medium Intensity (Grey), and Grassland/Herbaceous (Orange) classes, counts of which in each target zone were used as model inputs. The scale and orientation of the maps can be derived from Figures 1  and 2 to the representativeness of the sample. Their results indicated that using geo-located tweets as ancillary data did not perform as well methods using traditional data, although for specific age groups with a high percentage of Twitter users, it improved prediction. Other work has demonstrated interpolation enhancements with mobile phone data (Liu, Peng, Wu, Jiao, & Yu, 2018) and POI data (Ye et al., 2018) as input to dasymetric approaches. Undoubtedly, the robustness of the interpolation depends on the choice of ancillary data, as well as the interpolation methodological approach.

| Illustration: Interpolation with new forms of data
To illustrate this, property data (rentals and for sale) for New Haven were downloaded from www.zillow.com in January 2019. Each record included the latitude and longitude of the property. Property locations were used as input to a point-based interpolation in the same way as that described in Section 2.2. The results are shown in Figure 4 and summarised in Table 4. The details and the code used to do this are in Data S1.
The striking features of the spatial and statistical distributions of the house estimates using data from the a prop- the results in Tables 2 and 3 and Figures 1 and 2, and the lower median values. These indicate that some of the target zones contained no properties listed on the website from which the data was taken, with the original source zones counts allocated to those that did. Interestingly, the spatial distribution is similar to the point based approach (Figure 2) that used OpenStreetMap buildings point data. This reinforces the need to carefully evaluate and consider the representativeness of many of the new forms of informal spatial data that are available for use in such analyses.
The OpenStreetMap data in this case is relatively complete, but similar problems of completeness and representativeness might be expected if it was used for study with poor OpenStreetMap coverage.
It is instructive to compare the results of the interpolation approaches and different ancillary data. The pair-wise correlations, distributions, and pair-wise scatterplots of the target zone estimated populations from each of the eight interpolations are shown in Figure 5. All of the correlations are significant and generally are in the range 0.70 to 0.95. The distributions have a broadly similar form, with a large number of lower estimates, tailing off to a smaller F I G U R E 5 Correlations between different interpolation approaches with significance ( *** indicates p value <.001), pairwise scatterplots and distributions (AW, areal weighting; Pynco, pycnophylactic; AtP, area to point; Dasy, dasymetric; Street, street-weighted; Stat, statistical; Point, point-based; Web, property data from the web) number of higher ones. The exceptions are the Statistical and Area-to-Point approaches, which consequently have the lowest correlations. This suggests that these approaches generate noticeably different predicted populations to the others.

| CONCLUSIONS
This paper reviews and summarises the main approaches used in spatial interpolation of areal features. It separates these into those that include ancillary information to constrain or guide interpolation and those that do not. Each approach was illustrated using data for the 29 census tracts in New Haven (CT). Data of household number from these source zones were interpolated to a 500-m polygon grid (target zones), and ancillary data from a range of sources ware used to constrain the interpolation including areal, linear, and point-based features. Additionally, data of properties for sale and rent were scrapped from a commercial website and used to illustrate the potential utility of the many new forms of data as inputs to dasymetric approaches and related interpolation algorithms.
There is ever-increasing amount of data available that could be used to support spatial analysis more generally as well as methods spatial disaggregation. For formal data, availability is being driven open access initiatives and data portals which are opening up databases that were once the preserve of national mapping agencies and government. The volumes of informal data are increasing as well. These are generated by the everyday activities of citizens and businesses through social media (check-ins, etc.), websites offering services, microblogging, social sensing, as well as VGI activities such as OpenStreetMap, and supported by ubiquitous weband GPS-enabled technologies. However, the use such data presents new challenges particularly around data quality and the representativeness of the data relative to the process of interest (Comber, Mooney, Purves, Rocchini, & Walz, 2016). Formal data created by national mapping agencies and served through open data portals comes with assurances of quality, experimental design, metadata confirming to standards, and documentation. These are critically lacking in many new forms of data, requiring the user explicitly evaluate the suitability of the data for their intended application (Comber, Fisher, Harvey, Gahegan, & Wadsworth, 2006;Comber, Fisher, & Wadsworth, 2008). This paper evaluated ancillary data from a number of traditional and informal sources to illustrate different areal interpolation methods. The case study using data from a property website highlighted the need to consider the representativeness of such data before using it as ancillary data. However, generally, a correlation analysis of showed that new forms of data have can perform as well as traditional data.
This indicates the opportunities afforded to include such data, with health warnings, as ancillary information for spatial interpolation and to support spatial analysis more generally.