A comparative assessment of the statistical methods based on urban population density estimation

Abstract Population density is important spatial information for addressing the use and access to land resources in cities under the Sustainable Development Goals. This is because the spatial data support appropriate spatial policies at the spatial scale and predicts how much land will be consumed in the future. The study aims to compare and evaluate the regression tools in the context of estimating the population density difference. The three analysis tools used are Random Forest-Based Classification, Multiple Linear Regression, and Geographically Weighted Regression. The sampling area covers cities around Türkiye. Comparative results showed that the two most important descriptive variables in the Random Forest-Based Classification model are the density difference of the new developed area and the connectivity. The three main explanatory variables of the Multiple Linear Regression model are centrality, vehicle ownership, and accessibility. The results of the Multiple Linear Regression model (a non-spatial model) and the Geographically Weighted Regression model (a spatial model), were found to be quite similar. The importance of accessibility and connectivity is more evident in the Multiple Linear Regression model when the Random Forest-Based Classification model highlights the density values in the new development areas.


Introduction
Density can be defined on various scales by the number of people per hectare or per square kilometre, the number of residential units per hectare, and the ratio of ground area.Population density is associated with the population distribution and shape or volume of an urban area.It has always been at the centre of urban problems and the process of urbanisation (McFarlane 2016).Population density guides various land use policies (Hasse and Lathrop 2003;Jongman et al. 2012;Salvati 2013;Guastella et al. 2019;Angel et al. 2021).For example, a low-density spatial growth development policy can trigger urban sprawl (Pendall 1999;Robinson et al. 2005;Faour 2015;Tikoudis et al. 2022).
Most developed, developing and less developed countries are now adopting more compact urban form policies (OECD 2012(OECD , 2018)).Population density scales vary, such as a single residential plot, a residential subdivision, or a neighbourhood, and there are different measurement methods (Churchman 1999;Galster et al. 2001;Frenkel and Ashkenazi 2008;Angel et al. 2020).Using different scales and different density metrics can make assessment difficult for spatial policymakers.However, density metrics by scale can offer the opportunity for evaluation under the same conditions, such as social, economic, political, cultural, geographic, and environmental.It is also important to note that the density is influenced by external variables rather than just being measured.Therefore, in addition to intensity definitions, more intensity literacy and assessment are needed, as highlighted in Dovey and Pafka (2014).
Some studies focus on the parameter of density in the relationship between spatial cross-sectional studies on urban planning (Camagni et al. 2002;Schwarz 2010;Magliocca et al. 2012;Lowry and Lowry 2014;Angel et al. 2021).These studies are related to mobility advantages or disadvantages of urban densities in terms of reducing the carbon footprint (Churchman 1999;Holden and Norland 2005), efficient transportation (Burton et al. 2003), public health, efficient land use (Williams et al. 2000;Alexander and Tomalty 2002;Burton et al. 2003), economic structure (Ewing 1997;Churchman 1999), andenergy (Owens 1992;Anderson et al. 1996;Yang et al. 2012).Therefore, the anatomy of population density is involved in the whole process of urbanisation.It is necessary to evaluate how population density will affect urban relations and to develop planning policies according to density trends.
Population density is more than a unit of measurement; it is an influencing factor in social, economic and environmental issues within the urban environment, such as energy supply, land consumption, resource use, and decisions on new development areas (Cohen and Gutman 2007;Ng 2010;Boyko and Cooper 2011).Moreover, it is a substantial measurement parameter for various calculations such as efficient urban energy use (Rickwood et al. 2008), energy used in transportation (Yang et al. 2012), gasoline consumption (Gillham 2002), quality of life of residents (G€ uneralp et al. 2017), determination of land values (Tikoudis et al. 2022), and the development of local policies on urban growth management (Rodriguez et al. 2006).When population density is higher, it will occupy a smaller footprint and thus reduce the proportion of land that has transformed from rural to urban (Li et al. 2003).In cities with more compact development, many factors such as transportation, neighbourhood, and infrastructure costs can be used sustainably.Population density, therefore, is defined as a sustainability variable for urban areas.Relationships between urban expansion and transportation outcomes relate to density, land use mix, vehicle ownership, transportation modes, street accessibility, and centrality.Studies have shown that per capita vehicle use increases as density decreases in urban areas (Ewing et al. 2003;Yang et al. 2012).It is stated that density and vehicle use have a close relationship with each other (Gillham 2002).Malpezzi (2013) measured the population densities of some countries (Table 1).The number of people per hectare in the urban area is between 6 and 390 people per hectare (Malpezzi 2013).The extreme diversity in the average population density of cities is obvious, and the difference in their spatial growth varies with geography.Therefore, the population density distribution should be taken into account for planning policies, development policies, economic policies, and social policies (Ewing et al. 2003;Tikoudis et al. 2022).Additionally, it is essential to develop accurate models to predict the changes in population density.
The regulation of population density as a means of establishing and maintaining urban equality has been a standard implementation for urban planning (Dovey and Pafka 2014).This research is based on predicting density values by taking into account that macro (gross) population density trends from the past will change in the same direction because of the persistence of external factors (Pendall 1999;Fulton et al. 2001;Ewing et al. 2003;Schneider and Woodcock 2008;Bhatta 2010;Pereira et al. 2013;Silva et al. 2017;Sharifi 2019;Yılmaz and Terzi 2021) such as transportation alternatives, socio-economic development, and geographical location.In addition, upper-scale population density estimates do not support the disclosure of subscale densities such as building blocks, plot scale, and estimation from the floor area (Angel et al. 2020).Therefore, the methods used in the study have a viable structure on all urban scales.In other words, microscale (net) population density assessments are out of the scope of this research.
The aim of the study is to compare and evaluate regression tools in the context of estimating the population density difference.The study uses statistical measurement tools that use Random Forest Based Classification (RFC), Multiple Linear Regression (MLR) and Geographically Weighted Regression (GWR) methods (Sections 2.3 and 3).Various methods are available for predicting population growth through spatial and temporal data (Wu et al. 2005;Georganos et al. 2021;Credit 2022).However, recently, there has been growing interest in Ensemble Learning in machine learning.Random Forest is one of the different quantitative models among regression methods that is based on tree-based models to predict population density, which is a non-parametric ensemble-based prediction model.Based on the claim that these model results support more consistent results (Anderson et al. 2014), they are compared to the regression models used in this research.Multiple Linear Regression, one of these regression models, which is a traditional nonspatial regression analysis, is among the most common analysis tools (Shen 2009;Olive 2017;Tranmer et al. 2020).Furthermore, Geographically Weighted Regression, a traditional spatial regression analysis of statistically spatial measures, estimating the effects of explanatory variables on a dependent variable (Fotheringham et al. 2002), is also used.The novelty of this study is to expand the population estimation methods of urban population density estimation in terms of future density allocation decisions.The models have been used to analyse eighty-one cities in T€ urkiye.It is then discussed which statistical method will guide which spatial policies through the descriptive variables selected in the study.It is shown that population density estimates vary based on alternative evaluation methods and which variables are affected.

Study area
T€ urkiye, as a developing country in the Middle East, has eighty-one administrative provinces of various sizes (Figure 1).The population was approximately 83million in 2019 and the area of T€ urkiye is 785,347 km 2 .The world's average population density was about 60 people per square kilometre in 2018, while T€ urkiye was 107 people per square kilometre (World Bank 2018).Urbanization rates are constantly increasing throughout the country, and following the 1980s approximately 50% of the population were living in cities (TURKSTAT 2019).

Data collection
Increasing population density has further reduced the transformation from rural to urban areas (Li et al. 2003).The urban expansion that occurs with the emergence of this transformation brings with it the decentralisation caused by transportation, land use diversity, vehicle ownership, transportation modes, accessibility and fragmented development.
Changes in per capita vehicle use as a result of the relationships between urban expansion and transportation (Gillham 2002;Ewing et al. 2003;Yang et al. 2012), differentiation in infrastructure costs (Burchell et al. 2005;Yamagata and Seya 2013;Guastella et al. 2019;Angel et al. 2020;Tikoudis et al. 2022) reveal the consequences of increasing or decreasing the number of built-up patches (Ewing et al. 2003).The explanatory variables of this research are based on the literature background and created in the context of macroscale population density relations.The variables used, detailed in Table 2, are determined as the socio-economic development levels of the cities based on the economy, their location based on the topographic structure and the accessibility and connectivity based on builtup patches, and infrastructure costs.The data of the 1990s and the 2018s are used for all analyses.
The sample size of the study covers eighty-one provinces.In Figure 2, the population density difference observed between 1990 and 2018 in T€ urkiye is illustrated.The data of population density in this period are used as the dependent variable in the study.The missing value is not among the data.In the study, the explanatory variables of the model created to estimate population density are (i -POP_D) population density in new developed area, (ii -D_DIF) the difference density of new developed area, (iii -CENTR) centrality, (iv -V_OWN) vehicle ownership, (v -ACCESS) accessibility, (vi -CONCT) connectivity, (vii -SOC_ECO) socio-economic development index, (viii -LOC) location, and (ix -SD) standard deviation of patches in urban area.Table 3 shows the description values of the dependent and explanatory variables.The analyses coloured using the Jenks natural breaks classification method are visualised in Figure 2. When the analysis of 'population density in new developed area' is examined, the western part of T€ urkiye has shown a largely homogeneous population density variation.The northern part and the eastern regions have the lowest and highest density changes.According to the analysis of 'the density difference of the new developed area', population densities in inhabited areas that were established from 1990 to 2018 show sprawl development or compact development.To put this into context, there are differences between the east and west of the country.While the eastern region is developing more widely, the western region has developed more compactly.
On centrality, a heterogeneous distribution is observed throughout the country.However, cluster regions are observed in the western part, in the central region and in the eastern part.In vehicle ownership, the highest vehicle ownership is seen in the southwest region and the north-central region, while the least ownership is largely in the south-east region.Accessibility values include the size of the urban area and the street connectivity effect.In this context, the spot areas with the lowest value are located in the eastern and central-northern regions of the country.Additionally, connectivity values show a heterogeneous distribution throughout the country.However, the socio-economic development index change based on the data of Dinc ¸er et al. (2003) and The Ministry of Development ( 2013) has divided the country into three regions, namely the western, central and south-eastern regions.The regions where the decrease in index values is seen are the south-eastern region and the metropolitan cities of the country, Istanbul, Izmir, Ankara, Adana, and Gaziantep.The location variable classifies 28 coastal cities and the rest of cities as inland cities.The standard deviation of patches in urban areas shows the extent to which the built-up areas within the provincial boundaries differ.According to this variable, there is a distinctive difference in the southern and northern parts of the country.The northern part has smaller patches while the southern part has larger patches.
Estimation is an important part of spatial data science.ArcGIS and SPSS provide various estimation tools to support analyses.Therefore, the data of the study have been analysed using Forest-based Classification and Regression, Multiple Linear Regression, and Geographically Weighted Regression, which are developed by ArcGIS Pro software and IBM SPSS Statistics software.

Random forest-based classification and regression (RFC)
The classification tree method (Dietterich 2000a(Dietterich , 2000b) is that they are boosting (Schapire et al. 1998) and bagging (Breiman 1996).A random forest was created by adding a layer of randomness to the bagging by Breiman (2001), which is compared to many other classifiers and is among the powerful algorithms available (Mitchell 1997; Genuer  ).The method is a bagged classification and regression tree (CART) method.It comes with various advantages such as the reduction in variance and the improvement of predictive accuracy of the methods, and disadvantages such as complex model structure and the importance of variable.Differences in variables to be predicted arise through the layouts of the trees (for the mathematical points, see by Breiman (1996Breiman ( , 2001)), Smyth et al. (2015) and Gr€ omping ( 2009)).Recently, the use of random forests has been considered in many scientific fields thanks to its contribution to the evaluation of variable effects (Breiman 2001;Rodriguez-Galiano et al. 2015;Georganos et al. 2021;Shang et al. 2021;Talebi et al. 2022).For example, these areas are remote sensing, geoscience, spatial cases, and estimating spatial data.
In the study, RFC method is used to create an estimation model with many variables.In the RFC model properties, the number of trees is 1000; the leaf size is 5; the tree depth range is 5-15; the mean tree depth is 9; the percentage of available per tree is 100; the number of randomly sampled variables is 3; the percentage of training data excluded for validation is 10.Model Out of Bag Errors are an indicator that will help validate the model.It also shows the performance gained by the number of trees in the model.The model dataset is randomly subdivided as 90% for the training set and 10% for the test set (Smyth et al. 2015).

Multiple linear regression (MLR)
Multiple linear regression is among the most common analysis tools used for estimation.The regression statistically measures and estimates the effects of explanatory variables on a dependent variable.Thus, MLR includes more than one explanatory variable.The equation of Multiple Linear Regression model is formulated (1) as in the following (Shen 2009;Olive 2017;Tranmer et al. 2020); where x i is the observed values; y i is dependent variable; x i is explanatory variables; bn represents slope coefficients of each explanatory variable; b 0 is y-intercept which is constant term; E is the model's error term, known as the residuals.The MLR is a non-spatial model.

Geographically weighted regression (GWR)
Geographically Weighted Regression (GWR) is one of the spatial methods and uses a linear regression (Fotheringham et al. 2002).The GWR model is the spatial relationships between the statistically significant variables.It generates statistical relationships between dependent and explanatory variables within the spatial bandwidth, which is one of the significant local parameters in the GWR model.It uses Variance Inflation Factor (VIF) to test collinearity among independent variables.The GWR as a statistical model has a set of observations or locations fx ij g for i ¼ 1, . . ., n cases and j ¼ 1, . . ., k explanatory variables, a set of dependent variables fy ij g, and a set of location coordinates fðu i , v i Þg for each case.The equation of the Geographically Weighted Regression model is formulated (2) as in the following (Fotheringham et al. 2002;Brunsdon et al. 2010): where b (coefficients) is computes by the regression tool, reflecting the relationship and the strength of each explanatory variable to the dependent variable; 3 (residuals, residual errors) is the part of the dependent variable that is not explained by the method.

Results
This section provides the results of the RFC, the MLR, and the GWR models.

Random forest-based classification and regression (RFC) estimation model
In RFC method used with many variables, the percentage of variation explained indicates about 42% of the variability with 1000 trees in the population density.The variable of the density difference of the new developed area is of the highest importance at 33%, which means that it is the most useful in estimating the population density in the province.The importance of other variables is centrality at 16%, connectivity at 14%, population density in new developed area at 13%, number of vehicles per thousand people at 7%, socio-economic development index at 5%, location at 4%, accessibility at 4%, and the standard deviation of patches in urban area at 4%.
The value R 2 measures how well the model performs.The validation R 2 value is a better indicator of model performance than the training R 2 .In the estimation model of this multivariate study estimating population density, the R 2 value in the training data (regression diagnosis) is 0.938 with an accuracy of approximately 94%.It is 0.802 with approximately 80% accuracy in the validation data (regression diagnostics).Explanatory Variables Range Diagnostics of training and validation using Forest-based Classification and Regression predict new population density.Values are reported based on 1000 trees within each forest (Table 4).The estimated explanatory variable range diagnostics lists the range of values covered by each explanatory variable in the datasets used to train and validate the model.For example, the number of vehicles per thousand people in the dataset used to train the model ranges from À0.41 to 158.88, and in the dataset used to validate the model, from 10.06 to 148.84% of the population density in the new developed area values used to train the model are used to validate the model.To minimise extrapolation, the validation values in Table 4 have been revised as density estimates.
In terms of the estimated population density changes, the values of some provinces differ.A partial increase is observed in the estimated results of the observed density values.The estimated difference in population density is less than that observed in population density.However, while the urban area expands at the same rate and the population in the province increases at the same rate, the difference in population density is in the same range of values.Results show that for the population density difference excluding Istanbul, Bartın and S ¸ırnak provinces, the estimated and observed values are almost equal in the provinces with values of 25 and below (Figure 3).

Multiple linear regression (MLR) estimation model
Table 5 shows that the most important explanatory variables in MLR model are centrality, vehicle ownership, and accessibility.The explanatory variables that are not statistically significant, indicated by a p-value greater than 0.05, are connectivity, socio-economic development index, location, and standard deviation of patches in urban area.However, VIF is used to test collinearity among the explanatory variables.The variables of population density in new developed area and the density difference of the new developed area have high values of VIF, which are 15.22 and 16.24, respectively.According to the results, since the VIF value is greater than about 10 (for redundancy among the explanatory variables), they are variables that are not fit for the MLR model.The results of the MLR model are visualised in Figure 3.

Geographically weighted regression (GWR) estimation model
GWR results are R 2 values 0.88 and R 2 adjusted values 0.86.The variables of population density in new developed area and the density difference of the new developed area indicate redundancy among explanatory variables, which are more than VIF value 7.5.Table 6 shows the results of the GWR model.The results are similar to the MLR model.The results of the GWR model are illustrated in Figure 3.

Discussion
In this study, a comparative evaluation is made with the variables determined by referring to the importance of population density changes in terms of future spatial planning   decisions through quantitative estimation methods.Figure 3 illustrates a spatial comparison of the results of all three models.The results of the study reveal that compared to the observed population density values, the estimation values of MLR and GWR models have higher values than the RFC model.Results of both models show regions where provinces with estimated density increases are clustered spatially (Figure 4).
Current population density and the density in new developed area are influenced by the population density policy (Pendall 1999;Fulton et al. 2001).As population density increases, land consumption per person decreases (Pendall 1999;Ewing et al. 2003;Robinson et al. 2005;Yang et al. 2012;Faour 2015;Tikoudis et al. 2022).The results of this research support the idea that land consumption can be decreased if future 'density allocation' (Tikoudis et al. 2022) decisions are adapted to MLR results.
The important variables in the RFC model are seen as the density difference of the new developed area, centrality, and connectivity.Variables with significance relationships in MLR and GWR models are centrality, vehicle ownership, and accessibility.Ewing et al. (2003) states that the relationships between urban sprawl and transportation outcomes relate to density, land use mix, vehicle ownership, modes of transport, street accessibility, and centrality.This statement supports the results of this study.Although it is difficult to implement policies in the selection of vehicle ownership and modes of transport, the implementation of spatial policies on centrality, accessibility and population density allocation in new development areas may be among the possible decisions.The centrality variable is common to RFC, MLR, and GWR.The centrality identified in the study means that when the density increases, the distances between the settlements may decrease.The centrality variable is also supported by Pereira et al. (2013).Ewing et al. (2003) notes that vehicle ownership can be explained largely by population density; however, in the study, it is found that while vehicle ownership is significant in the MLR model, it is less important for the RFC model.The variables are seen as important for explaining population density.
Many urban density studies have been carried out in the planning literature, but limited resources have been found in the literature for estimating population density or urban density using new statistical tools and methods such as machine learning, deep learning, and spatial statistics (Wu et al. 2005;Georganos et al. 2021;Credit 2022), which can provide comparative forecast analyses with different methods.There is no one way to measure and estimate population density (Angel et al. 2020), thus the estimation methods used in this study are not the only way to measure.
In 2018, T€ urkiye's population density was above the world average of 60 people per kilometre with a value of 107 people per kilometre (World Bank 2018).According to the results of this study, from 1990 to 2018, T€ urkiye's average gross population density decreased by about 180 people per hectare, which means that a non-compact form of growth has been seen.This urban growth is one of the higher land consumption growth types (Pendall 1999;Robinson et al. 2005;Ng 2010;Faour 2015;Tikoudis et al. 2022).As Tikoudis et al. (2022) highlights, the strong correlation between density and urban growth provides a decisive role in the population density estimation comparisons and density allocations discussed in this research.
The density policy, therefore, refers to a more compact or a lower density development compared to the built-up area.For centrality, accessibility, and connectivity, the urban area is shaped by the road network (Bhatta et al. 2010); therefore, vehicle ownership, road density in urban area, and a significant increment in travel by vehicles, trigger spatial planning policies for urban expansion (Ewing et al. 2003;Pereira et al. 2013 ).Infrastructure investment or ownership is affected by socio-economic opportunities, which can change geographic locations (Yılmaz and Terzi 2021).
Population density is important spatial information for addressing the use and access to land resources in cities under the Sustainable Development Goals (Ehrlich et al. 2018).However, according to Tikoudis et al. (2022), urban population density estimation is among the important guides in making future 'density allocation' decisions on the macroscale.

Conclusions
This study examines RFC, MLR and GWR models and population density estimates both spatially and quantitatively, in which dependent and explanatory variables are included.Eighty-one city samples in T€ urkiye were examined to better understand how these models performed predictive values and whether the inclusion of spatial variables increased model performance.In the findings, it is observed that the confirmation rate increases as the number of trees increased for RFC, and for the MRL model different explanatory variables are needed.The two most important descriptive variables in the RFC model are the density difference of the new developed area and connectivity.On the other hand, the three main explanatory variables of MLR are centrality, vehicle ownership, and accessibility.While the RFC model is more consistent in density values in the new development areas, the importance of accessibility and connectivity is seen as more consistent in the MLR model.
Estimating population density and whether the urban area will be expanded or become more compact will guide the production of prevention strategies for future spatial planning policies.Thus, it can be used to estimate how much land will be consumed in the future.Therefore, in spatial planning policies, it is essential that decision-makers and policymakers obtain population estimates through various methods.
Ultimately, population density is a hierarchical and systematic structure.However, it is a useful tool to guide land use policy where the density value will be established from a holistic perspective.The introduction of new models to understand and predict density shows the importance of alternative methods.The models therefore support a new approach tool to explain the impact of population density on the planning process.Success in estimating population density will be possible by making population density measurement with more parameters.

Disclosure statement
The author declare that they have no conflicts of interest.

Figure 1 .
Figure 1.Location map of T€ urkiye (prepared by the author).

Figure 2 .
Figure 2. Descriptive analyses (prepared by the author based on the explanations in Table2).

82 49 Ã
Percent of overlap between the ranges of the training data and the input explanatory variable.ÃÃ Percent of overlap between the ranges of the validation data and the training data.

Figure 3 .
Figure 3. Population density difference observed between 1990-2018 (upper-left); estimated difference of population density in the RFC model (upper-right), the MLR model (bottom-left), and the GWR model (bottom-right) (people per square kilometre) (prepared by the author).

Figure 4 .
Figure 4. Comparison of observed values and the performances of the RFC model, the MLR model, and the GWR model based on the difference in population density (people per square kilometre) (prepared by the author).

Table 1 .
Population density in some countries.

Table 2 .
Description of variables using the study.

Table 3 .
Description statistics of observed values of the dependent and the explanatory variables.

Table 4 .
Range diagnostics of estimated explanatory variables.

Table 5 .
Statistics of explanatory variables based on the MLR model.

Table 6 .
Statistics of explanatory variables based on the GWR model.