A hybrid model for high spatial and temporal resolution population distribution prediction

ABSTRACT The accurate prediction of population distribution is crucial for numerous applications, from urban planning to epidemiological modelling. Using one-week data collected from open and multiple sources, including telecommunication activity, weather, point of interest, buildings, roads, and land use in Milan, Italy, we develop a hybrid method combining cellular automata (CA) and long short-term memory (LSTM) to predict population distribution with fine temporal and spatial granularity. Specifically, the convolutional autoencoder and LightGBM are applied to identify missing building types based on the pedestrian shed. The LSTM learns the transition rules of CA and Shapley additive explanations value is used for variable importance analysis. Results demonstrate that the combination of convolutional autoencoder and LightGBM is effective in building type prediction. The proposed model for population distribution prediction outperforms LSTM, the combination of CA and neural network, and the combination of CA and LightGBM by at least 5–10%. A variable importance analysis reveals that temporal variables are the most significant for prediction, followed by spatial and natural variables. The order of hour, housing-related variables, and types of precipitation are the most important variables in each category.


Introduction
From 2010 to 2020, the world population increased from 7 to 7.8 billion (Gu, Andreev, and Dupre 2021). During this process, advancements in traffic systems have led to increased human mobility, which is a threat to resources. Urban infrastructure development, such as urban planning and demand response, typically requires an accurate estimation of population distribution information at a fine-scale (Lu et al. 2010). For example, information and communications technology (ICT) infrastructure has evolved rapidly in the recent decade to provide a more diversified and customised service to people in urban and rural areas. Furthermore, numerous applications, including network diversity and economic development (Eagle, Macy, and Claxton 2010), micro-scoped urban structure analysis (Louail et al. 2014), land use organisation modelling (Lenormand et al. 2015), and prompt public health policy-making, require high-resolution population distribution data. Therefore, a fine-scale estimation of population distribution is needed. Technically, predicting population distribution in small spatial and temporal granularity is challenging.
The advance in communication and positioning technologies makes it possible to generate massive mobile signalling data. There are 128 mobile cellular subscriptions per 100 people in Italy, which makes mobile signalling most representative for high spatial and temporal population distribution. Mobile signalling data has been used to study people's mobility pattern at a large spatial scale (González, Hidalgo, and Barabási 2008;Alessandretti, Aslak, and Lehmann 2020). In addition, several population data products have been developed based on mobile signalling data. Such as WorldPop developed by the University of Southampton, UK.
Another challenge is that the distribution of people is affected by various factors, such as building distribution (Zhao et al. 2020), roads (Cooper et al. 2018), weather conditions (Huang et al. 2015), and time (Eon et al. 2018). These factors can be divided into spatial and temporal factors. However, most studies have comprehensively focused on the influence of buildings, roads, time, and weather; the comprehensive analysis of each variable's contribution to prediction performance has not been thoroughly discussed.
Model-based methods are widely used to solve natural evolution problem and show satisfactory performance. However, acquiring the simulation rules of these models is difficult. Because of the development of computational intelligence, data-driven methods have been widely used to capture nonlinear relations using real-world data (Bzdok, Altman, and Krzywinski 2018).
The aim of this study is to take advantage of both model-based method and machine learning method to improve prediction performance and quantify the contribution of each influencing factors. The contribution can be summarised as follows: (1) A framework combing CA and LSTM was developed to improve the prediction performance of population distribution in high spatial and temporal resolution.
(2) Geometry and spatial characteristics of buildings was firstly considered to predict building type.
(3) The contribution of various influencing factors on population distribution prediction was quantified.
The remainder of this paper is organised as follows. Section 2 reviews research on the population distribution prediction and the evaluation of related factors. Section 3 describes the study framework, data collection and pre-processing, and the proposed method. Section 4 provides the data analysis results, building type prediction, population distribution prediction, and variable importance analysis. Finally, the conclusions and directions for future work are presented in Section 5.

Literature review
This section discusses the factors influencing population distribution and compares state-of-the-art techniques for population distribution prediction, which are classified into two groups: modelbased or data-driven methods. Then the current practices and challenges of hybrid prediction techniques are further discussed.

Influencing factors of population distribution
Population distribution is highly dynamic and challenging to predict because of the complex nature of people's movement (Vu, Do, and Nahrstedt 2011). The literature has explored factors influencing people's movement, most of which can be divided into three categories: built environment, natural environment, and time-related factors.

Built environment
A built environment is a human-made environment that surrounds humans and provides a setting that shapes human activity (Mahmoud 2018). Zhao et al. (2020) analysed the relationship between city centre activity and the built environment and showed that building type and density were strongly correlated with human activity.
Notably, point of interest (POI) is an essential spatial feature reflecting the built environment. POIs consist of rich information about places on a map, such as the location, name of urban services, and contact information . Regions with different POIs have different population. For example, people gather in regions with companies during working hours; by contrast, at the same time, few people remain in regions with entertainment facilities. Yuan, Zheng, and Xie (2012) analysed the relationship between human mobility, POI, and functional regions, and different human mobility patterns were identified in different regions with different POI combinations. Wang, Zhou, and Zhao (2015) applied a modular neural network to predict pedestrian flow in Beijing transportation hub areas by considering the spatial distribution of land usage types and building facilities.
However, the information provided by POIs is less sufficient than that from buildings: POIs only represent spatial features as geographical points without geometrical information such as area which significantly influences people's movement in a city. For example, a large shopping mall attracts more people than a food court. Considering that most human activities occur inside buildings, Huang et al. (2021) disaggregated the United States census records to generate a 100 m population grid and analysed the relationship between population and building information. The results showed that building count and building size were the most influential factors of population distribution; however, building type and its geometry information were not included in the analysis, and the temporal granularity was not fine enough for analysing highly dynamic population distribution.

Natural environment
The natural environment represents the natural characteristics of the surrounding districts, in which weather conditions and topography have a significant impact on human mobility. Huang et al. (2015) demonstrated this phenomenon based on the daily pedestrian flow in Tiananmen Square in China, showing the impact of natural environment on human mobility.
Topography is another essential environmental factor.  analysed nighttime light data, net primary vegetation productivity, and average slope data to improve the fitting performance of population distribution at the city level, demonstrating the importance of the ecological environment and topographic factors in population estimation. Eon et al. (2018) studied water-related factors such as distance to rivers and lakes and found that water accessibility had a substantial effect on population distribution.

Temporal factors
Time plays a dominant role in shaping human behaviour. This phenomenon has been systematically discussed by Eon et al. (2018) and its impetus was summarised as a system of practice. Cooper et al. (2018) studied the town centre street layout to predict the change in pedestrian flow at various times during one day in Cardiff, United Kingdom. Duives, Wang, and Kim (2019) proposed a cellsequence derivation method to divide continuous GPS data into discrete cells. As an LSTM variant, a recurrent gate unit was applied to this dataset to predict the movement of crowd flows, which demonstrated that the prediction accuracy of population distribution could be further improved by considering the temporal pattern derived from historical data.

Population distribution prediction
The methods used in studies on population distribution prediction can be divided into three categories: conventional methods, model-based methods, and machine learning (ML) methods. A comparison of their advantages and disadvantages is provided in this section.

Conventional methods
Data resources used in conventional methods are commonly derived from surveys. Methods used to analyse these data are traditional statistical methods, such as linear regression (LR). Lutz, Sanderson, and Scherbov (1997) proposed a probabilistic approach to predict population distribution in 13 regions worldwide by considering expert opinions on the trends of fertility, mortality, and migration. Zakria and Muhammad (2009) applied the autoregressive integrated moving average model (ARIMA), which combined difference, moving average, and autoregressive analysis, to improve the prediction of population density in Pakistan.
However, the data used in these studies have several limitations, including large spatiotemporal granularity and limited study scale. Additionally, the conventional methods used in the analysis lead to poor prediction performance in some circumstances because of the limited capability to process big data. Big data has become widely and readily available; thus, rich information on human mobility can be mined from big data resources, including anonymised mobile and social media data.

Model-based methods
Model-based population estimation methods establish models based on an assumed rule of human behaviour and prior knowledge to simulate human movement, they are less dependent on historical data compared with conventional and ML method.
Low accuracy and high computation cost are the top two limitations of citywide population distribution prediction. Akagi et al. (2018) proposed a model based on the collective graphical model (CGM) to overcome these difficulties. The main idea was to model transition rules while considering departure probability, gathering score, and geographical distance. Previous studies also considered gravity model, Gu et al. (2019) combined eigenvector spatial filtering (ESF) with a negative binomial gravity model using data derived from censuses and population sampling surveys to analyse interprovincial migration flows. Gravity model is less effective to construct for high-resolution prediction because it is difficult to construct complicated relationship and the model does not have evolutionary process. CA is another effective method for simulating human behaviour. CA views each individual or region as a cell and applies a transition rule to change the state of each cell by considering the influence of neighbouring cells. Compared with other methods, CA has evolutionary mechanism which makes it suitable for simulating natural evolution. Lu et al. (2017) incorporated characteristics of pedestrian group behaviours into a CA model to model crowd evacuation.
Model-based methods are less dependent on historical data and have strong interpretability. However, these methods require extensive prior knowledge, which is difficult to acquire . In addition, identifying suitable simulation rules is difficult.

ML methods
The development of computational intelligence encourages researchers to apply ML to different domains (Bock et al. 2013). ML methods perform well in modelling nonlinear relationships and are thus popular in predicting population distribution. Robinson, Hohman, and Dilkina (2017) used convolutional neural networks (CNN) to predict the population of one km 2 cells in the United States and compared the result with census data at the county level to validate the proposed model. Mourad et al. (2019) proposed an attention spatiotemporal inception network (ASTIR) to predict crowd flow, which improved prediction accuracy by further considering short-term, long-term, and periodical dependencies. Recently, neural networks have been built with more neurones, layers, and functions than before. Liu, Piao, et al. (2021) proposed AttConvLSTM, which had a convolutional LSTM (ConvLSTM) to acquire time information and maintain spatial information, and the attention mechanism focused on the change in crowd flow in important regions. Chen, Wan, et al. (2019) attempted to predict people's real-time movement by applying random forest on data, including cellular data, POI, time, and traffic network.
However, ML methods are not directly interpretable and cannot offer the potential of extrapolation beyond observed conditions, which means that ML methods do not have rule-based reasoning.

Integration of model-based method and ML
Considering the limitations of model-based and ML methods, attempts have been made to combine them to eliminate their limitations. Reichstein et al. (2019) reviewed the application of ML in earth system science and suggested future research directions are hybrid models that coupled physical process models with data-driven ML models.
Because CA can simulate the law of natural evolution and the outcomes can be directly explained by the mathematical model, CA has been integrated with ML in various fields (Itami 1994). Because of its capability in spatial simulation, CA has been widely used to model the spatial complexity and self-organising characteristics of dynamic processes (Okwuashi and Ndehedehe 2021). Recently, recurrent neural networks have been combined with CA to capture temporal information. Liu, Xiao, et al. (2021) combined LSTM with CA to simulate urban expansion, which overcame the overfitting, gradient explosion, and vanishing problems caused by the long-term dependence of time series data. Devi and Mohan (2020) proposed a long short-term memory CA (LSTMCA) classifier to predict changes in stock value. This model applied CA to compute the similarity between word embeddings, using an LSTM as a classifier.
Although the combination of CA with ML has been applied in various fields, it is seldom used in population distribution prediction at the fine spatiotemporal resolution, and three limitations have been identified: (1) Unlike land use change, population distribution is highly dynamic and is influenced by many observable and hidden factors.
(2) Population distribution has a high level of spatial heterogeneity.
(3) Data, such as building type and detailed building geometry, which have essential roles in population distribution modelling, are neither easily nor widely available.
A more effective model needs to be developed to improve the precision of population distribution prediction by overcoming these limitations, thereby improving the understanding of citywide population distribution at a fine spatiotemporal resolution. Table 1 compares conventional, model-based, ML, and hybrid methods.

Methodology
As shown in Figure 1, the study comprises four parts: (1) data collection and pre-processing, (2) building type prediction, (3) population distribution prediction, and (4) performance and importance analysis. In the first part, eight types of databuilding, POI, land use, spatial grid, road, weather, mobile phone signalling, and timewere collected from four data sources. Next, building information was enriched by combining it with a POI within a comfortable walking distance; land use, road network, and mobile phone signalling data were allocated according to grid cells; and weather data were reorganised and aligned with mobile phone signalling data according to their temporal resolution. Finally, data analysis was applied to explore the correlation between mobile phone signalling and other data. Considering that building type is not available in all building information, it was predicted based on the building's typology and its surrounding POIs. Convolutional autoencoder (CAE) was used to estimate the one-dimensional representation of the building outline without losing excessive information. Next, this representation was combined with POIs within a comfortable walking distance from the building as the input for LightGBM to predict unknown building types. The third part introduced a hybrid method integrating CA and LSTM to predict population distribution (indicated by mobile phone signalling) based on building, land use, road network, and weather conditions. Finally, the model's performance was analysed and compared with that of other ML models, and a feature importance analysis was conducted to highlight the contribution of each feature.

Data collection and pre-processing
In section 2.1, several influencing factors are summarised, but no study comprehensively compares the influence of all factors in population distribution prediction. The study area is Milan, the second-largest city in Italy. Milan covers 181.67 km 2 and had more than 1.37 million residents in 2019. The Italian government has been devoted to the development of open data; therefore, the data used in this study were collected from four publicly available sources (Table 2). POI, building, and road data in this study were collected from OpenStreetMap (https://www.openstreetmap. org), an open-source, editable map service that provides detailed geographic data. POI data consist of location and type; building data comprise location, outline, and building type; and road network consists of shape and type. As shown in Figure 2, the road network in Milan is similar to a spider web. Most footways are in the city centre, and track roads are distributed in the periphery of city. Land use data in 2012 were collected from Copernicus (https://land.copernicus.eu), the European Union's Earth observation programme. It provides data on land, the marine environment, climate change, emergency management, and security. Land use data for Milan shows that residential areas, leisure, and public services are mainly located in the city centre, and agricultural and industrial areas are located in suburban areas ( Figure 3). European Centre for Medium-Range Weather Forecasts (ECMWF, https://www.ecmwf.int) records historical weather information worldwide each hour. The weather information comprises temperature, dew point, humidity, wind direction, wind speed, pressure, and types of precipitation. Mobile phone signalling data were collected from the mobile phone usage activity dataset issued by the Telecom Italia Big Data Challenge 2014. It records all received and sent short messages, incoming and outgoing calls, and internet usage activity every hour in Milan. Mobile signalling is a sampling data, but the ratio of coverage is large according to World Bank. The data has been widely used in human behaviour analysis (Barlacchi et al. 2015;Chin et al. 2019;Liu, Long, et al. 2021). In this study, mobile signalling was used as population distribution. Weather and mobile phone signalling were collected from 2013-11-01-2013-11-07, and the temporal granularity was processed to be one hour. Data publisher divided Milan and its neighbouring areas into 5680 square grid cells, and the size of each grid cell was 235 m × 235 m. Figure 4 shows the total population distribution for seven days in Milan. The figure indicates that most people are located in city centres and surrounding towns. Figure 5 depicts the data pre-processing steps. For building data, the percentage of each building type occupied in every grid cell was calculated, and entropy was introduced to represent the diversity of building types in each cell. High entropy reflects a diversified building type in that cell and indicates a high-level neighbourhood activity . For land use, the area of each land use type was subdivided into each grid cell, and more land use types in each grid cell reflect a higher level of diversity of people's behaviour patterns. The road network is another crucial factor, and its connectivity is derived to reflect the possible mobility with neighbouring cells. It is defined as the number of roads connected to the neighbouring grid cells. Next, each road density was also calculated for each cell. Other data used to predict population distribution include the day of the week, holidays, and the order of hours in a day.

Building type prediction
As discussed in Section 2, buildings play a crucial role in shaping human behaviour. In this section, unknown building types (nearly 30%) were predicted in the dataset. Building type prediction involves building features and associated surrounding environments. Using building images alone may lead to low prediction accuracy (Hoffmann, Abdulahhad, and Zhu 2022); hence, building outline and neighbouring POI data were combined for the prediction. The process can be divided into two parts: building outline embedding and building type prediction by surrounding POIs.

Building outline embedding
In this study, building outlines were selected as features for predicting building types, and CAE was used to acquire a low-dimensional representation of building outlines. CAE is designed to extract image features and transform them into embeddings which include nearly all image information and can be used as input for ML methods. It uses convolution operations to preserve the positional relationship between image pixels, rendering it suitable for building outline extraction (Zhang 2018). CAE combines CNNs and autoencoders; it consists of a convolutional layer, max-pooling layer, and fully connected layer. It uses fewer parameters to capture the key characteristics of an image and consistently outperforms the backpropagation neural network. Autoencoders are unsupervised ML methods that map the input to a latent representation. Autoencoders apply the function h = s(Wx + b), where s is an activation function, W is the parameters of the autoencoder, and b is bias. Autoencoders reconstruct the input x to x ′ and attempt to reduce the difference between them.
CAE is widely used because of its strong ability to process images (Guo et al. 2017;Mirjalili et al. 2018;Suganuma, Ozay, and Okatani 2018). A CAE consists of two layers: encoder f E and decoder f D , and it aims to minimise the mean squared error between its input and output.
(1) Where x i is the i th input, W E and W D are the parameters of the encoder and decoder, respectively, and b E and b D are the biases of the encoder and decoder, respectively.

Building type prediction
The building type prediction process is illustrated in Figure 6. POI is an important feature that indicates the type of built environment and the influence of POIs on building type only takes effect within a certain distance from each building. Pedestrian shed is the distance which is 400 m that people prefer walking before driving or choosing public traffic, and it has been widely accepted as a comfortable walking distance (Weinstein Agrawal, Schlossberg, and Irvin 2008;Wolch, Wilson, and Fehrenbach 2005). In this study, POIs within a comfortable walking distance, building outline embeddings, location, and floor area are inputs to LightGBM.
LightGBM is a novel gradient-boosting decision tree algorithm suitable for classification problems. It uses an improved histogram algorithm to divide continuous values into k intervals, and the split points are selected from the k intervals. In addition, LightGBM uses leaf-wise strategy to improve accuracy and limit the growth depth. Gradient-based one-side sampling introduced in LightGBM only considers data with a large gradient to reduce computation cost. Finally, this method uses exclusive feature bundling to bind many mutually exclusive features into a single feature to achieve dimension reduction. Thus, unlike other gradient-boosting decision tree methods, LightGBM accelerates the training process without losing accuracy and has been widely used in classification and regression problems. Several studies have demonstrated the superiority of LightGBM compared with other tree-based models such as Random Forest and XGBoost (Huang 2021;Ke et al. 2017;Wang, Zhang, and Zhao 2017).

Population distribution prediction
This section introduces an integrated method combining CA and LSTM to predict population distribution. It can be concluded from Figure 4 that population distribution has aggregation effects; densely populated areas are surrounded by at least one or more densely populated areas, which is suitable for applying CA which is designed to model the effect of neighbouring areas of a cell. Population distribution is in the form of a typical time series, indicating that data in the current time step is highly correlated to the data in the past time step. Because of their specific design, LSTM can 'memory' past data to support the prediction of current data. Thus, LSTM is chosen as a significant component of the hybrid model.
CA is one of the most widely used spatially explicit models to accurately capture spatial characteristics and simulate the law of natural evolution (Itami 1994). CA consists of four elements: (1) grid cells; (2) neighbourhood, neighbouring cells affect the change in the target cell; (3) the state of a cell, which indicates population distribution in this study; and (4) transition rules, which determine how neighbourhoods influence the state of the target cell. All cells change their states according to the rules at a time. The new states are added to the grid cells, and the procedure is repeated until the termination conditions are satisfied (Clarke 2014). Two elements need to be determined: neighbourhood and transition rules. For the neighbourhood, the most widely used neighbourhood is the Moore neighbourhood, whose shape is a rectangle, as shown in Figure 7(a); however, a diamond-shaped neighbourhood is more suitable in this study because the distance between the centre cell and each neighbour cell from different directions is equal, as shown in Figure 7(b). Optimised numbers of neighbourhoods are evaluated to improve computation efficiency without losing accuracy. The transition rule has a significant impact on prediction performance because such a rule represents how the state of a cell changes according to its neighbours' states. An example of the transition rules is Game of life., Transition rules are summarised as follows: (1) If a cell is surrounded by three live cells, in the next time step, the cell's state is live.
(2) If a cell is surrounded by two live cells, in the next time step, the cell's state does not change.
(3) Under other conditions, this cell's state is death in the next time step. LSTM is a practical, scalable ML model for time series data, with a solid capability to learn temporal characteristics (Greff et al. 2016). The most crucial part of LSTM is the memory cell that remembers information from the past. In this study, LSTM receives states of neighbour cells including building, land use, road, time and weather data are shown in Figure 5. The output is the predicted population distribution of the target cell.
The process of combining LSTM and CA is illustrated in Figure 8. CA can be seen as an upperlevel framework of the hybrid model, it defines the neighbouring cells to be considered and their interaction, which enables the model to consider spatial information. The transition rule in our model is not as simple as that in Game of Life, so LSTM can be seen as a method to construct a complex transition rule. In the training process, the state of each cell at a time step includes data on building, land use, road, weather, time, and population distribution. The inputs of LSTM are neighbouring cells' states and the label is population distribution which is used to update the states of central cell. In the prediction process, the model first predicts population distribution using states of neighbouring cells, then the result updates the initial state, and the model starts to predict again. The process iterates until the termination rule is satisfied.

Prediction performance
Two performance indexes were used for building type prediction: accuracy and F1-score. Accuracy is the ratio of correctly classified samples to the total samples. F1-score evaluates precision and recall, and the weighted-average method is used to calculate precision and recall. Formulas for the two indexes are as follows: where n t is the number of types, TP i is the number of correct predictions of type i, M s is the number of samples, FP i is the number of incorrect predictions of type i, m i is the number of type i, and FN i is the number of samples incorrectly predicted as type i. Three performance indices were used for population distribution prediction: root mean squared error (RMSE), mean absolute error (MAE), and R 2 . RMSE represents the standard deviation of the residuals between the predicted and observed population distributions. MAE is the mean value of the absolute difference between the predicted and observed values. R 2 is the ratio between the sum of regression square to total deviation square, which reflects the degree of fit of the model to the data. The formulas for the three indices are as follows: whereŶ i is the predicted value of the i th sample, Y i is the true value of the i th sample, N is the number of samples, and Y is the average value of Y i .

Variable importance analysis
Analysing the importance of each variable is necessary to deepen the understanding of their contribution to prediction performance. This study applied a feature importance analysis method, namely Shapley additive explanation (SHAP). SHAP describes the performance of an ML method based on game theory. It defines the output model as the linear addition of each input variable, and the formula is shown as follows: where f (x) is the original model, g(x ′ ) is the explanation model, x ′ is calculated by x via a mapping function: x = h x (x ′ ), x = (x 1 , x 2 , . . . , x M ), M is the number of variables, and f i is the parameter.
According to (Lundberg and Lee 2017), there is only one solution for f i in Equation (11), and it has three desirable properties.
Property 1. (Local accuracy): The explanation model g(x ′ ) matches the original model f (x) when (0)) represents the model output with all simplified inputs toggled off.
Property 2. (Missingness): If the input feature is x ′ i = 0, its importance is f i = 0. Property 3. (Consistency): If a model is changed to increase the impact of the variable, the importance of the feature never decreases.
With three properties, f i can be calculated using the following formula: where z ′ \i represents z Among the five methods for SHAP value calculation, the expected gradient, an extension of the integrated gradients method (Lundberg and Lee 2017), is the most suitable for deep ML methods. It is based on an extension of Shapley values to infinite player games (Aumann-Shapley values). This study applied the expected gradient to analyse the contribution of each variable.

Building type prediction
Each building outline was presented as a picture of size 256 × 256 × 1 to preserve image quality and facilitate CAE processing. The CAE was used to transform these pictures into embeddings. The structure of the CAE is illustrated in Figure 9. There are five layers in the CAE, and for each layer, batch normalisation was applied to avoid gradient vanishing and explosion. A parametric rectified linear unit was selected as the activation function. Examples of the reconstruction are shown in Figure 10.
As mentioned in Section 3.2.2, building outline embeddings and POI information within pedestrian sheds were used as inputs to LightGBM for building type prediction. This study applied a 3-fold grid search to determine the optimal parameters of LightGBM, including the number of leaves, maximum depth, and learning rate. Table 3 shows the range of the grid search and the optimal parameters.
The ratio of the training and testing sets was 7:3, the F1-score was 0.902, and the accuracy was 86.8%. Figure 11 shows the prediction outcome of building distribution in Milan. Figure 9. Structure of CAE. Figure 12 depicts the relationship between temperature, building density, and population distribution, the red part (left part) is the distribution of people number for low-density building in different temperatures, the blue part (right part) is the distribution of people number for high-density building in different temperature. A low density indicates that the area covered by buildings is below 1% of that of the grid cell, and a high density indicates that the area of buildings is larger than 20% of that of the grid cell. The choice of low and high density of threshold is according to 25th and 75th quantiles. The number of people is higher in high-density building areas than that in low-density building areas. Mann-Whitney U test was used to compare differences between two data distribution. p-values for all temperatures are smaller than 0.05 (from 0 to 1.98 × 10 −92 ), which indicates that there is a difference between population distribution in high-density building areas and low-density building areas for all temperatures. Figure 13(a-o) depicts the population distribution in different land use areas. Each line indicates that the area of land use is more than 70% of the area of the grid cell. The first four days were working days, and the last three days were All Saints holidays. For most areas, working days show similar patterns, and the population distribution during holidays is lesser, especially during the daytime. In addition, in areas with fixed behavioural patterns, such as residential, agricultural, and public service areas, population distribution is more stable than in areas with high personnel mobility, such as airports and hospitals. Moreover, work, residents, and road-related areas have a higher population distribution than other areas. For comparison, Figure 13(p) shows the population distribution of each grid cell. Without considering land use, the difference between workdays and holidays and the daily behaviour of people in different grid cells are difficult to summarise. Figure 14 shows the dew point, humidity, wind speed, pressure, and population distribution for weather data. Population distribution is inversely proportional to dew point and humidity shows a similar pattern to dew point. However, wind speed is proportional to population distribution. The Figure 10. Examples of (a) building outlines and (b) their reconstruction. The ratio of the training and testing sets was 7:3, the F1-score was 0.902, and the accuracy was 86.8%. Figure 11 shows the prediction outcome of building distribution in Milan.

Data analysis
relationship between pressure and population distribution is more complicated, and the highest population distribution occurs when pressure is approximately 29.3 in. Hg. Figure 14(e) indicates that various wind direction and temperature significantly impacts population distribution. The cosine similarity of grid cells on different days of the week was implemented on several types of land use ( Figure 15). For most areas except military and flower gardens, the patterns in the first three days are similar because of the All-Saints holidays; the last four days show another pattern Figure 11. All building distribution in Milan after prediction. Figure 12. Relationship between temperature, building density, and population distribution.
because they are workdays. Thus, holidays are another important factor that should be included in the model.

Population distribution prediction
As introduced in Section 3.3, a hybrid method combining CA and LSTM was used to predict population distribution. This study applied Bi-LSTM because it captures the combined information of the input sequence in the forward and backward directions. A fully connected layer follows the two Bi-LSTM. Another important parameter of Bi-LSTM is the time step, which was determined by autocorrelation in this study. Autocorrelation measures the similarity of a particular time signal with itself on a time scale (Roy, Dey, and Chatterjee 2020). The similarity reflects that the number of prior points can be used to explain the current point. The formula is as follows: where x(i) is the original dataset, i is the time point, k is the time lag, x is the mean value of the dataset, and N is the number of time points. Figure 16 shows the result of the autocorrelation of all grid cells, where the blue area is the 95% confidence interval. Figure 16 indicates that the lag of all grid cells is 4. Therefore, 4 was chosen as the time step in Bi-LSTM.
In this study, five models were compared with the proposed integrated model. The first model used the mean value of the previous four hours' population distribution as the baseline; because CA is mostly used to predict discrete variables, it is difficult to determine the transition rule when predicting continuous variables using CA. ML methods are used to learn the complicated transition rule, and the most widely used method is the backpropagation neural network (BPNN). Thus, the second model used the combination of CA and BPNN, which considers less temporal information than the first model. The combination of CA and convolutional neural network (CNN) are used as the third model and the fourth model is the combination of CA and LightGBM which is a tree-based model. The fifth model only used LSTM, which disregards spatial relationships. Six-day records were used for training. The input information includes building footprint, build type, land use, road, time, and weather, and the population number is the model's output. Observations on the last day were used for validation using the evaluation methods introduced in Section 3.4. All models are applied cross-validation to find the optimal parameters to avoid underfitting and overfitting. The training process of proposed model and model performances are shown in Figures 17 and 18; the proposed model outperformed other popular models in terms of RMSE, MAE, and R 2 . Thus, the proposed method, which integrates model-based and ML mechanisms, improves the performance of population distribution prediction.

Variable importance analysis
As introduced in Section 3.5, SHAP was used to qualify the contribution of each influencing variable. All variables were divided into seven categories: neighbour, historical information, weather, land use, building, road, and current time information. The neighbour category consists of neighbour information and the location of the current grid cell. The historical information category includes population distribution in the four previous steps (t-1 to t-4). The current time consists of holidays and the order of hours. Information on the importance of the 35 most important variables is presented in Figure 19. The order of hours is the most significant variable, with a significance of 0.224. Historical data also have a significant effect on population distribution because   all historical variables are ranked high (larger than 0.04) and their contribution decreases from t-1 to t-4.
Types of precipitation are the most important factor in weather-related variables, followed by humidity, temperature, wind direction, dew point, and pressure. Wind speed is the least important factor.
Resident-related variables, including residential building, residential land use, and resident road, are more important than other variables for building, land use, and road category, which indicates that population distribution is closely related to housing-related factors. For neighbour data, the location of the grid cell is the most important, and the farther neighbouring cells are more important than the closer ones because people can travel a long distance within an hour. Notably, the connectivity of roads with neighbouring cells plays a more significant role in predictions than other road-related variables because the movement of people between areas is highly dependent on roads.
In addition, a more comprehensive analysis was conducted, and the seven categories above were divided into spatial-related, temporal-related, and natural-related variables (Figure 20). A comparison shows that historical information is the most important category, accounting for 29.7% of the contribution to performance because the movement of people is closely related to data in prior times (Section 4.3). The importance of current time and weather data rank second and third, respectively. The least important category is building, followed by land use, and their importance is less than 10%. Temporal variables are the most important category, whose importance is 0.531, followed by temporal and natural variables. The importance of temporal variables can be explained by the system of practice (Eon et al. 2018), system of practice indicates that a routine can be formed through the continual reproduction of everyday practices. The trip rules such as commuting demand and individual travel preference can be seen as a system of practice, that means people prefer to travel to a certain place at a certain time, which improve the importance of temporal features. In addition, the population distribution is a time-series data which has the characteristic of continuity. Thus, the population distribution at current hour is highly dependent on that in previous hours. Figure 19. Importance of each variable (n1 to n12 refers to the code of neighbour cells shown in Figure 7).
The available spatial data in Milan are ideal for modelling population distribution, which is, however, not common for other cities. If the hybrid model was applied to these cities, the analysis conducted in this study would provide a guidance on what type of data should be accorded more attention and how the model may perform with missing data. Solving these issues may show the ways to improve the transferability of the proposed model.

Conclusion and discussion
In this study, we proposed a hybrid model combing CA and LSTM to predict high spatial and temporal resolution population distributions. The proposed method addressed the limitations of previous methods, such as the low accuracy of conventional methods, the low interpretability of ML methods, and the high dependence on prior knowledge required by model-based methods. Multi-source data in Milan, Italy, were collected to validate the model. First, CAE, which extracts the embedding information of building outlines, was combined with LightGBM to predict unknown building types. Next, building information, land use, road network, weather, and time were collected and analysed to understand the relationship between the selected variables and population distribution. Subsequently, a hybrid method combining CA and LSTM, which considers spatial and temporal information, was adopted to predict population distribution. Finally, a variable importance analysis based on the SHAP value showed the importance of each variable, and the contribution of all variables was ranked based on their SHAP values.
The major findings are summarised as follows: (1) Building type prediction indicates that considering building outlines, POI information, and pedestrian sheds can achieve satisfactory performance.
(2) Compared with CA + BPNN and LSTM, the combination of CA and LSTM considering spatial and temporal information improves prediction accuracy by approximately 5%-10%. (3) Variable importance analysis shows that temporal data are the most important in prediction, and natural data are the least important. The order of hours has the highest importance of temporal data in all variables. In the spatial category, housing-related variables are more important. In addition, types of precipitation contribute the most to predictions in the natural category.
The outcomes of this research can provide the basis for improving the accuracy of population distribution prediction and help in variable selection for further research. In contrast with prior deep learning-based population prediction models, the mechanism and outcomes of this study can be explained. This work can also provide a guidance on the type of data that should be prioritised in population distribution prediction. The proposed model approves that the combination of model-based and machine learning methods improves the prediction performance of population distribution. The evolutionary mechanism makes CA suitable for predicting population distribution and LSTM learn the complex transition rule. A variable importance analysis quantifies the importance of influencing factors introduced in previous studies.
Although the proposed model shows satisfying performance, RMSE and MAE are larger than 2, which needs to be improved. Variable importance analysis shows that temporal information is the most important. Thus, more temporal information should be extracted to improve the performance. Feature extraction methods such as time-field and frequency-field analysis can be attempted to obtain time signatures of population distribution, which reflect temporal changes and population distribution patterns. They can be added in the model for future improvement. Mobile signalling is not a fully accurate estimation of population distribution, its accuracy can be affected by network failing and frequent phone using, government statistics such as mobile phone user proportion and average daily mobile phone usage will be used for calibration. The study only focuses on Milan as case study. Improving the generalisation capability of the model to achieve similar performance on other cities is important. In the study, the model should be trained again when it is used for other cities, which require high computation resources. Thus, transfer learning mechanisms such as fine-tunning can be added on LSTM in the proposed model in the future.