Estimating PM2.5 Concentrations Using Spatially Local Xgboost Based on Full-Covered SARA AOD at the Urban Scale

The adverse effects caused by PM2.5 have drawn extensive concern and it is of great significance to identify its spatial distribution. Satellite-derived aerosol optical depth (AOD) has been widely used for PM2.5 estimation. However, the coarse spatial resolution and the gaps caused by data deficiency impede its better application at the urban scale. Additionally, obtaining accurate results in unsampled spatial areas when PM2.5 ground sites are insufficient and distribute sparsely is also a challenging issue for PM2.5 spatial distribution estimation. This paper aimed to develop a model, i.e., spatially local extreme gradient boosting (SL-XGB), combining the powerful fitting ability of machine learning and optimal bandwidths of local models, to better estimate PM2.5 concentration at the urban scale by using Beijing as the study area. This paper adopted simplified high-resolution MODIS aerosol retrieval algorithm (SARA) AOD at 500 m resolution as the major independent variable, hence, ensuring the estimation can be operated at a fine scale. Moreover, the extreme gradient boosting (XGBoost) model was adopted to fill the gaps in SARA AOD, thus improving its availability. Then, based on full-covered SARA AOD and other multisource data, the SL-XGB model, integrating multiple local XGBoost models and particular optimal bandwidths, was trained to estimate PM2.5 concentration. For comparison, SL-XGB and two other models, XGBoost and geographically weighted regression (GWR), were evaluated by 10-fold cross validation (CV). The sample-based CV results reveal that the SL-XGB performed the best as assessed through R2 (0.88), root mean square error (RMSE = 24.08 μg/m3) and mean prediction error (MPE = 16.90 μg/m3). Additionally, SL-XGB also performed the best in the site-based CV with a R2 of 0.86, a RMSE of 26.15 μg/m3 and a MPE of 17.97 μg/m3, which shows its good spatial generalization ability. These results demonstrate that SL-XGB can better simultaneously handle non-linear and spatial heterogeneity issues despite spatially limited data at the urban scale. As far as the PM2.5 concentration distribution was concerned, it presented a gradient increase in PM2.5 concentrations from the northwest to the southeast in Beijing, with abundant spatial details. Overall, the proposed approach for PM2.5 estimation showed outstanding performance and can support preventive pollution control and mitigation at the urban scale.

to characterize the relationship between AOD and PM 2.5 when both the dimension and amount of data are large. In recent years, machine learning models are becoming increasingly popular because of their impressive ability in terms of fitting data and the feasibility of processing complex data. To date, machine learning models such as random forest, extreme gradient boosting (XGBoost) and generalized regression neural networks (GRNNs) have been applied and some of them outperformed statistical models [20,36,37]. These machine learning models are global models, which means that they deal with the whole data directly and neglect the spatial relationship between sampling data points. However, as a geographical phenomenon, the distribution of PM 2.5 is significantly affected by spatial factors such as meteorology and land use. Moreover, the estimation of PM 2.5 by exploiting the relationship between AOD and PM 2.5 over an extensive area is implemented based on insufficient sampling PM 2.5 ground sites. Therefore, the spatial heterogeneity of the relationship between AOD and PM 2.5 cannot be well captured by global machine learning methods, which limits improvement in the accuracy of estimation.
In this paper, our goal is to propose a spatially local machine learning model that can capture the spatial heterogeneity of the relationship between AOD and PM 2.5 to estimate PM 2.5 accurately at the urban scale. We first used XGBoost model to fill the gaps in the cloud-removal SARA AOD for obtaining the full-covered AOD. Then, the spatially local XGBoost model (SL-XGB) we proposed was trained to estimate PM 2.5 concentrations. We also compared SL-XGB with global XGBoost and GWR using both sample-based and site-based 10-fold CV to examine the effect of our strategy.
The paper is organized as follows: Section 2 introduces the data, related methods and the introduction of SL-XGB; Section 3 mainly demonstrates the result of full-covered SARA AOD and the performance of SL-XGB in PM 2.5 estimation; Section 4 discusses some advantages and limitations of this work by comparing it with some other studies; Section 5 is the conclusion of this paper.

Study Area
Beijing is the capital of China, located in the North China Plain and surrounded by Taihang Mountain and Yanshan Mountain. Beijing has 16 districts, with Dongcheng, Xicheng, Haidian, Chaoyang, Fengtai, Shijingshan as the major urban areas and Mentougou, Yanqing, Daxing, Tongzhou, Miyun, Fangshan, Huairou, Changping, Shunyi and Pinggu as the suburban areas, as shown in Figure 1. Beijing has been suffering from PM 2.5 pollution for a long time, with increasing emissions, vehicle exhausts and residential energy consumption under rapid urbanization. Therefore, it is urgent to estimate fine and accurate PM 2.5 spatial distribution in Beijing for PM 2.5 pollution control.
Remote Sens. 2020, 12, x FOR PEER REVIEW 3 of 20 dimension and amount of data are large. In recent years, machine learning models are becoming increasingly popular because of their impressive ability in terms of fitting data and the feasibility of processing complex data. To date, machine learning models such as random forest, extreme gradient boosting (XGBoost) and generalized regression neural networks (GRNNs) have been applied and some of them outperformed statistical models [20,36,37]. These machine learning models are global models, which means that they deal with the whole data directly and neglect the spatial relationship between sampling data points. However, as a geographical phenomenon, the distribution of PM2.5 is significantly affected by spatial factors such as meteorology and land use. Moreover, the estimation of PM2.5 by exploiting the relationship between AOD and PM2.5 over an extensive area is implemented based on insufficient sampling PM2.5 ground sites. Therefore, the spatial heterogeneity of the relationship between AOD and PM2.5 cannot be well captured by global machine learning methods, which limits improvement in the accuracy of estimation. In this paper, our goal is to propose a spatially local machine learning model that can capture the spatial heterogeneity of the relationship between AOD and PM2.5 to estimate PM2.5 accurately at the urban scale. We first used XGBoost model to fill the gaps in the cloud-removal SARA AOD for obtaining the full-covered AOD. Then, the spatially local XGBoost model (SL-XGB) we proposed was trained to estimate PM2.5 concentrations. We also compared SL-XGB with global XGBoost and GWR using both sample-based and site-based 10-fold CV to examine the effect of our strategy.
The paper is organized as follows: part 2 introduces the data, related methods and the introduction of SL-XGB; part 3 mainly demonstrates the result of full-covered SARA AOD and the performance of SL-XGB in PM2.5 estimation; part 4 discusses some advantages and limitations of this work by comparing it with some other studies; part 5 is the conclusion of this paper.

Study Area
Beijing is the capital of China, located in the North China Plain and surrounded by Taihang Mountain and Yanshan Mountain. Beijing has 16 districts, with Dongcheng, Xicheng, Haidian, Chaoyang, Fengtai, Shijingshan as the major urban areas and Mentougou, Yanqing, Daxing, Tongzhou, Miyun, Fangshan, Huairou, Changping, Shunyi and Pinggu as the suburban areas, as shown in Figure 1. Beijing has been suffering from PM2.5 pollution for a long time, with increasing emissions, vehicle exhausts and residential energy consumption under rapid urbanization. Therefore, it is urgent to estimate fine and accurate PM2.5 spatial distribution in Beijing for PM2.5 pollution control.    (Figure 1) in the study area established by Beijing Environmental Protection Bureau. In order to record the seasonal continuity (here, March, April and May were viewed as spring; June, July and August were viewed as summer; September, October and November were viewed as autumn; December, January and February were viewed as winter), we downloaded hourly PM 2.5 data spanning from March, 2016 to February, 2017 of Beijing from the China Environmental Monitoring Center (CEMC, http://106. 37.208.233:20035/). Then daily average PM 2.5 data were calculated further in order to match independent variables temporally.

Satellite-Based AOD
An AOD retrieval algorithm with a 500 m spatial resolution, SARA, developed by Bilal et al. [38], was adopted in our study. SARA AOD products were retrieved based on aerosol properties derived from the Aerosol Robotic Network (AERONET, https://aeronet.gsfc.nasa.gov/) and MODIS swath products (https://ladsweb.nascom.nasa.gov/) including MOD03 geolocation data, MOD02HKM-calibrated radiance data and MOD09 surface reflectance data (from Terra satellite). SARA shows the advantage of high accuracy in a complex atmospheric environment [39] and has been validated in Beijing and surrounding areas [40]. There are two AERONET sites in Beijing-the Beijing-CAMS site and the Beijing site. We applied the AOD data for the Beijing-CAMS site for SARA AOD retrieval and the AOD data for the Beijing site for validation ( Figure S1). Considering the negative effect of clouds, we used MODIS cloud mask product MOD35 to remove the cloud-contaminated pixels. This operation would produce much missing AOD data. Adding the effect of some deficiency in AERONET observation data, the number of available days for SARA AOD retrieved was 197 in total. To improve the availability and reliability of SARA AOD, we applied XGBoost regression to fill the gaps and obtained the full-covered SARA AOD. More details about XGBoost are introduced in Section 2.3.1.

Geographical Data
Geographical data contained several types: (1) the normalized difference vegetation index (NDVI), (2) a digital elevation model (DEM), (3) land use, (4) socioeconomic data including population (POP), gross domestic product (GDP), the density of residential areas and restaurants (DRR), the density of industries (DI) and road length (RL). MODIS Terra level-3 16 days average NDVI products (MOD13A1) with a spatial resolution of 500 m were downloaded from the LAADS website (https: //ladsweb.nascom.nasa.gov/). DEM products with a spatial resolution of 90 m were downloaded from the Shuttle Radar Topographic Mission (http://srtm.csi.cgiar.org/). Land use data including classification of forest, farmland, shrub land, water bodies, impervious land, and bare land with a spatial resolution of 30 m was obtained from the Resource and Environment Data Cloud Platform (REDCP) (http://www.resdc.cn). Coverages of water bodies (CWB), impervious land (CII), farmland (CFA) and forest (CFO) were further calculated based on land use data. Population raster data, GDP raster data and road shapefile data were also obtained from REDCP (http://www.resdc.cn). RL was calculated further based on road shapefile data. DRR and DI were calculated using kernel Remote Sens. 2020, 12, 3368 5 of 20 density analysis in Arcgis10.6 software based on Point of Interest (POI) data downloaded from Amap (https://www.amap.com/). Here, RL, DRR and DI represented the intensity of traffic, residential activity and industrial activity, respectively, which affect the emission of PM 2.5 [41,42].

Data Preparation
Some data pre-processes were operated for further experiments. All raster data were projected to the UTM_Zone_49N projection coordinated system using ENVI5.3 software and resampled to a 500 m × 500 m defined grid followed by SARA AOD. Then PM 2.5 concentration records were extracted to match the corresponding defined grids where the observation sites are located. All information on variables is presented in Table S1.

Extreme Gradient Boosting (XGBoost) Regression
Extreme gradient boosting (XGBoost) regression is an ensemble machine learning algorithm that is widely used in data mining with excellent performance [43]. It is the improved version of the gradient boosting decision tree (GBDT) algorithm. It supports parallel computing and can also randomly select independent variables such as random forest (RF), so it is more efficient than other boosting machine learning models. In addition, in contrast to some other machine learning models such as RF, XGBoost has a more complex structure and introduces regularization items in loss function to control against overfitting so that it can better handle complex data. Therefore, for work with large amounts of data and multidimensional influencing factors such as AOD gap fill and PM 2.5 estimation, XGBoost is the more appropriate choice. XGBoost has also been applied to the estimation of pollutants and it shows better performance than some other statistical and machine learning models [44][45][46].
As a boosting method, XGBoost is the combination of many decision trees and each tree is generated by prior trees based on residuals. Each decision tree will generate a predicted value by split operation based on different independent variables. Then, after the construction of all decision trees, the sum of generated values will be the final predictions. The process can be described as Equation (1), where h m (x) is the decision tree in iteration m, F m−1 (x) is the sum of the predictions of m−1 previous trees and γ m is the learning rate in iteration m.
For constructing h m (X), the parameters θ m are obtained by optimizing the sum of loss function L(θ m ) and the regularizing item Ω(θ m ). This process is described as: We used XGBoost to obtain full-covered SARA AOD before estimating PM 2.5 . Since AOD represents the column of aerosol from the Earth's surface to the atmosphere, it is greatly affected by meteorological factors, so six types of collected meteorological data mentioned above were used as the main independent variables. Referring to some previous studies [30,31], we also used the day of year (DOY), the NDVI, a DEM, POP and land use coverages as the auxiliary variables to indicate the effects of time and land surface on AOD. A grid search technique was carried out to search for the optimal hyper parameters for XGBoost model (Table S2). Given that AOD varies significantly across different seasons in Beijing [31], we constructed XGBoost models for 4 seasons, respectively, and compared them with the annual model which was trained based on the whole period data.
As a comparing method, XGBoost was also applied for PM 2.5 estimation. In addition to full-covered SARA AOD and all the indicators mentioned in Section 2.2, DOY and season (use dummy variables to Remote Sens. 2020, 12, 3368 6 of 20 represent 4 seasons) were also used as independent variables for considering the daily and seasonal variance of PM 2.5 .

Geographically Weighted Regression (GWR)
Geographically weighted regression (GWR) is a linearly local model considering spatial heterogeneity and can better reveal the spatially non-stationary relationship between independent variables and dependent variables [47]. In fact, GWR constructs a local original least square (OLS) regression model in each sampling position. Some previous studies show that the relationship between AOD and PM 2.5 varies significantly and discontinuously in space due to changes in surface context [19,21]. They applied GWR in the estimation of PM 2.5 by using AOD as the main independent variable and found that it outperform other traditional statistical models.
In this paper, we used GWR as a comparison method for PM 2.5 estimation. Before GWR, collinearity detection was operated to exclude variables with multicollinearity by using Variance Inflation Factor (VIF) as the measurement [48]. Here, we set the threshold of VIF as 7.5, which means that variables with a VIF larger than 7.5 would be excluded. Thus, compared with using XGBoost for PM 2.5 estimation, few temporally constant variables such as "GDP" were removed when constructing the GWR model (Table S3). The form of GWR can be represented as Equation (3) shows, where (u i , v i ) represents the geographically sampling position, a 0 (u i , v i ) is the interception of (u i , v i ), m is the number of independent variables, a k (u i , v i ) is the corresponding parameters of (u i , v i ) and ε(u i , v i ) is the corresponding error item. Bandwidth is an important parameter in GWR. The samples out of the bandwidth range would not be used in local model training [47,49]. An adaptive bandwidth strategy was used in this study for better performance.

Spatially Local XGBoost (SL-XGB)
Compared with the regression process of obtaining the gap-filled SARA AOD mentioned in Section 2.3.1, constructing a model to fit the relationship between AOD and PM 2.5 is more challenging. The available PM 2.5 data was only obtained from 35 ground sites and they do not distribute uniformly in the study area (densely in urban areas and sparsely in suburb areas). If we use a global model to predict PM 2.5 concentrations in the whole study area, it may produce some biases because spatial heterogeneity in the relationship between AOD and PM 2.5 cannot be revealed by globally constant coefficients. GWR has great applicability in describing local characteristics of the relationship between AOD and PM 2.5 . However, some recent studies pointed out that GWR is constructed based on the assumption of linearity, so it cannot capture the non-linearity in the relationship between AOD and PM 2.5 like machine learning methods such as RF and XGBoost [13,20].
In order to better address the issue of both spatial heterogeneity and non-linearity, we proposed the spatially local XGBoost (SL-XGB) model. The strategy of SL-XGB is similar to GWR but it uses XGBoost in the place of OLS regression as the kernel regression method in each local model. The concept of "bandwidth" in GWR is introduced into our model. For each local site, the data out of the range of the bandwidth will not be used for local model training. To better account for the spatial variance of the data, the adaptive bandwidths are adopted in SL-XGB, which means that each local model has its own particular bandwidth. Specifically, the optimal local bandwidth distance of a local model is selected from among all distances from all other sites to the local site. Firstly, to obtain the optimal bandwidth of each local model, root mean square errors (RMSE) under different distances are calculated by establishing different XGBoost models. These models use the data for other sites in the range of corresponding distances as the training set and the data for the local site as the test set (not added in the training set). Then, the bandwidth distance with the minimal RMSE will be selected as the optimal local bandwidth. After that, the local XGBoost models can be established based on the data in the range of the optimal bandwidth. To predict PM 2.5 concentrations in the non-sampling locations, we integrated the average predictions of all local models based on the geographically Bi-Square function. As a result, predictions in overlap areas can also be calculated by utilizing two or more neighboring local models. Equation (4) shows the form of the geographically Bi-Square function and Equation (5) shows the predictions of PM 2.5 in position i. In Equation (4), W(i, j) is the weight of the local XGBoost model j, d ij is the distance between the position i and position j, and b j is the optimal bandwidth of local XGBoost model j. In Equation (5), (u i , v i ) represents the geographical coordinates of position i, n is the number of all sampling ground sites, X i represents all independent variables in position i, and F j (X i ) is the prediction value of jth local XGBoost model in position i. Figure 2 displays the structure and schematic of SL-XGB.
Remote Sens. 2020, 12, x FOR PEER REVIEW 7 of 20 locations, we integrated the average predictions of all local models based on the geographically Bi-Square function. As a result, predictions in overlap areas can also be calculated by utilizing two or more neighboring local models. Equation (4) shows the form of the geographically Bi-Square function and Equation (5) shows the predictions of PM2.5 in position . In Equation (4), ( , ) is the weight of the local XGBoost model , is the distance between the position and position , and is the optimal bandwidth of local XGBoost model . In Equation (5), ( , ) represents the geographical coordinates of position , n is the number of all sampling ground sites, represents all independent variables in position , And ( ) is the prediction value of th local XGBoost model in position . Figure 2 displays the structure and schematic of SL-XGB.

Model Evaluation
We applied a sample-based 10-fold CV technique for validating the result of the gap-filled SARA AOD and PM2.5 estimation [50]. In sample-based CV, the entire data set was equally split into 10 folds randomly, with nine folds as the training set and the other fold as the test set for evaluating the performance of model fitting and predicting. This process was repeated 10 times until all folds were tested. Moreover, in order to examine the spatial generalization ability of SL-XGB, we further used site-based CV in PM2.5 estimation. Similar to sample-based CV, site-based CV used 10% of the sites' data as the test set and other 90% of the sites' data as the training set, which can show the prediction ability in different geographical environments. Three statistical indicators including root mean square error (RMSE), mean predicted error (MPE, the mean absolute error of prediction) and coefficient of determination (R 2 ) were used as the measurements of model performance. XGBoost also provides variable importance to evaluate the effect of variables on the result. It is determined by the frequency of variables that are used as tree splits in all decision trees. Given that our model included many local XGBoost models, average variable importance was calculated furthermore. All experiments were conducted in the environment of Python 3.7.

PM2.5 and AOD Data Set Description
The minimum, maximum, mean and standard deviation of all variables are presented in Table  S3. The maximum and mean PM2.5 concentrations for ground-level sites in the study period are 506 and 78.35 μg/m 3 , respectively. According to the Chinese standard of ambient air pollution [51], the

Model Evaluation
We applied a sample-based 10-fold CV technique for validating the result of the gap-filled SARA AOD and PM 2.5 estimation [50]. In sample-based CV, the entire data set was equally split into 10 folds randomly, with nine folds as the training set and the other fold as the test set for evaluating the performance of model fitting and predicting. This process was repeated 10 times until all folds were tested. Moreover, in order to examine the spatial generalization ability of SL-XGB, we further used site-based CV in PM 2.5 estimation. Similar to sample-based CV, site-based CV used 10% of the sites' data as the test set and other 90% of the sites' data as the training set, which can show the prediction ability in different geographical environments. Three statistical indicators including root mean square error (RMSE), mean predicted error (MPE, the mean absolute error of prediction) and coefficient of determination (R 2 ) were used as the measurements of model performance. XGBoost also provides variable importance to evaluate the effect of variables on the result. It is determined by the frequency of variables that are used as tree splits in all decision trees. Given that our model included many local XGBoost models, average variable importance was calculated furthermore. All experiments were conducted in the environment of Python 3.7.

PM 2.5 and AOD Data Set Description
The minimum, maximum, mean and standard deviation of all variables are presented in Table S3. The maximum and mean PM 2.5 concentrations for ground-level sites in the study period are 506 and Remote Sens. 2020, 12, 3368 8 of 20 78.35 µg/m 3 , respectively. According to the Chinese standard of ambient air pollution [51], the mean PM 2.5 concentration is at the range of "lightly polluted" (75 µg/m 3 -115 µg/m 3 ) and the maximum PM 2.5 concentration is higher than twice the maximum "severe pollution" concentration standard (250 µg/m 3 ). Thus, the air pollution in Beijing is extremely serious and it is necessary to estimate the continuous distribution of PM 2.5 .
We also calculated the minimum, maximum, mean and standard deviation of seasonal and annual AOD coverage (Table S4). Although some previous studies pointed out that SARA AOD has relatively high coverages compared with other AOD products [24,52], the annual coverage of SARA AOD ranged from 27% to 47% and the average coverages of four seasons in our study are only 45%, 42%, 35% and 33% for spring, summer, autumn and winter, respectively.

Missing AOD Filling
We constructed five XGBoost models for missing AOD filling based on the data for four seasons and the whole study period. Table 1 shows the sample-based CV results of five models, where "N" represents the number of training samples. The four seasonal models achieve high accuracy, with an R 2 ranging from 0.90 to 0.94 and a RMSE ranging from 0.07 to 0.15. The results suggest that the complicated relationship between AOD and other predictive variables can be investigated by XGBoost. Aerosol characteristics are significantly affected by meteorological fields and vary distinctly in different seasons [53]. Hence, the performance of the annual model, with an R 2 of 0.86 and a RMSE of 0.15, is not as good as that of the four seasonal models. Therefore, the four seasonal models were applied to obtain full-covered AOD for PM 2.5 estimation.  Figure 3 shows the difference between the cloud-removal SARA AOD and the gap-filled SARA AOD in 18 December 2016 (a specific example) and the difference between the annually averaged spatial distribution of cloud-removal AOD and gap-filled AOD. In terms of (a) and (b), the gaps in the cloud-removal SARA AOD are restored and the gap-filled SARA AOD has strong spatial continuity. As for the annual results, the gap-filled AOD values are generally higher than those in the cloud-removal AOD due to the absence of AOD data, which is mostly caused by cloud cover. When the cloud fraction is high, the humidity will increase, which promotes the growth of aerosols [29]. In terms of spatial distribution, both gap-filled AOD and cloud-removal AOD have similar spatial patterns, with an increasing trend from northwest to southeast. That is because the terrain is flatter and there are more urban regions distributed in the southeastern areas of Beijing [54]. In addition, the gap-filled AOD is relatively smoother than cloud-removal AOD, which suggests that the gap-fill process can restore more spatially continuous details. Remote Sens. 2020, 12, x FOR PEER REVIEW 9 of 20

PM2.5 Estimation Model Performance
The fitting and sample-based validation performance of GWR, XGBoost and SL-XGB are shown in Table 2 The density scatter plots of two types of CV (sample-based and site-based) performance are displayed in Figure 4. The total number of points recorded in all the sub-graphs in Figure 4 is 12,620. SL-XGB performed best in both types of CV, with the highest R 2 (0.88 and 0.86 in sample-based CV and site-based CV, respectively) and lowest RMSE and MPE. In addition, the superior performance of SL-XGB in site-based CV illustrates that SL-XGB can better adapt to complex geographical environments than the other two models and is of better spatial generalization ability. While GWR takes spatial heterogeneity into consideration, it is difficult to characterize the complicated relationship between PM2.5 and independent variables, so the performance is the worst in both types of CV. Moreover, adopting the strategy of "local regression" in non-linear machine learning has proven to be an improvement for PM2.5 estimation because of the better performance of SL-XGB than that of XGBoost. In fact, the "local" approach we applied was aimed at filtering the samples with large spatial heterogeneity compared to local samples. Although there may be little overfitting in SL-XGB due to relatively few training samples in the few local models, the overall accuracy is still better than that in XGBoost.
The sample-based CV RMSE spatial distribution of PM2.5 ground sites is given by Figure 5. Generally, all three models performed better in central urban areas than in suburban areas because of the high density of site distribution. XGBoost and SL-XGB performed better than GWR in most sites, indicating the powerful fitting ability of machine learning. The RMSEs of SL-XGB and XGBoost in suburb areas are in the between 25 and 30 μg/m 3 . However, SL-XGB performs better in central urban areas, with most RMSEs ranging from 10 to 20 μg/m 3 , because the sub-models in urban areas

PM 2.5 Estimation Model Performance
The fitting and sample-based validation performance of GWR, XGBoost and SL-XGB are shown in Table 2  The density scatter plots of two types of CV (sample-based and site-based) performance are displayed in Figure 4. The total number of points recorded in all the sub-graphs in Figure 4 is 12,620. SL-XGB performed best in both types of CV, with the highest R 2 (0.88 and 0.86 in sample-based CV and site-based CV, respectively) and lowest RMSE and MPE. In addition, the superior performance of SL-XGB in site-based CV illustrates that SL-XGB can better adapt to complex geographical environments than the other two models and is of better spatial generalization ability. While GWR takes spatial heterogeneity into consideration, it is difficult to characterize the complicated relationship between PM 2.5 and independent variables, so the performance is the worst in both types of CV. Moreover, adopting the strategy of "local regression" in non-linear machine learning has proven to be an improvement for PM 2.5 estimation because of the better performance of SL-XGB than that of XGBoost. In fact, the "local" approach we applied was aimed at filtering the samples with large spatial heterogeneity compared to local samples. Although there may be little overfitting in SL-XGB due to relatively few training samples in the few local models, the overall accuracy is still better than that in XGBoost.
can exploit more spatially homogenous data, which is in favor of describing the local relationship between AOD and PM2.5.    The sample-based CV RMSE spatial distribution of PM 2.5 ground sites is given by Figure 5. Generally, all three models performed better in central urban areas than in suburban areas because of the high density of site distribution. XGBoost and SL-XGB performed better than GWR in most sites, indicating the powerful fitting ability of machine learning. The RMSEs of SL-XGB and XGBoost in suburb areas are in the between 25 and 30 µg/m 3 . However, SL-XGB performs better in central urban areas, with most RMSEs ranging from 10 to 20 µg/m 3 , because the sub-models in urban areas can exploit more spatially homogenous data, which is in favor of describing the local relationship between AOD and PM 2.5 .
Bandwidth, defining the optimal scale for prediction, is a very important parameter in each local model of SL-XGB. Thus, the effect of bandwidth distance on SL-XGB performance needs to be further explored. Here, a sensitive analysis based on polynomial fitting was conducted to explore the relationship between bandwidth distance and the corresponding overall RMSE, which is calculated by using local sites' data as the test sets. We divided the 35 sites in Beijing into urban sites (17 sites) and suburban sites (18 sites) according to the administrative division that they are located in (sites in Dongcheng, Xicheng, Haidian, Chaoyang, Fengtai, Shijingshan are urban sites and the rest are suburb sites) and the results are given in Figure 6. The trend in the two fitting curves is similar, decreasing first and then rising. The results demonstrate that the data from sites that are far away from the local site will result in negative impacts on local model training. However, it when the bandwidth is too small, it is difficult to take advantage of the insufficient training data for capturing the localized relationship between PM 2.5 and independent variables, so the accuracy is also low. The optimal bandwidth is different for two types of sites-approximately 20 km for urban sites and approximately 50 km for suburban sites. This is because the density of urban sites is higher than that of suburban sites and abundant information used for local model training can be collected in a smaller bandwidth.
In addition, the overall optimal RMSE of urban sites is lower than that of suburban sites, which also demonstrates that it significantly increases accuracy by utilizing spatially homogeneous data.   Bandwidth, defining the optimal scale for prediction, is a very important parameter in each local model of SL-XGB. Thus, the effect of bandwidth distance on SL-XGB performance needs to be further explored. Here, a sensitive analysis based on polynomial fitting was conducted to explore the relationship between bandwidth distance and the corresponding overall RMSE, which is calculated by using local sites' data as the test sets. We divided the 35 sites in Beijing into urban sites (17 sites) and suburban sites (18 sites) according to the administrative division that they are located in (sites in Dongcheng, Xicheng, Haidian, Chaoyang, Fengtai, Shijingshan are urban sites and the rest are suburb sites) and the results are given in Figure 6. The trend in the two fitting curves is similar, decreasing first and then rising. The results demonstrate that the data from sites that are far away from the local site will result in negative impacts on local model training. However, it when the bandwidth is too small, it is difficult to take advantage of the insufficient training data for capturing the localized relationship between PM2.5 and independent variables, so the accuracy is also low. The optimal bandwidth is different for two types of sites-approximately 20 km for urban sites and approximately 50 km for suburban sites. This is because the density of urban sites is higher than that of suburban sites and abundant information used for local model training can be collected in a smaller bandwidth. In addition, the overall optimal RMSE of urban sites is lower than that of suburban sites, which also demonstrates that it significantly increases accuracy by utilizing spatially homogeneous data.

Variable Importance in SL-XGB
The values of variable importance in all local models were averaged and the top 10 important variables are shown in Figure 7. Among them, PBLH and AOD are the two most important variables with variable importance being 0.26 and 0.18, respectively. Here, PBLH made more contribution than AOD. This may be related to the frequent occurrence of haze events in Beijing, when the PM2.5 concentration is more sensitive to PBLH [55]. Moreover, PBLH was the mean of four time points and its temporal scale may be more similar to that of PM2.5 data. The variable importance value of V10 is higher than U10, so wind in a south-north direction has more impact on PM2.5 than wind in an eastwest direction. This may be due to the great amounts of pollutants from Hebei province, which is in the south of Beijing, drifting to Beijing through the wind in a south-north direction. Season and DOY are ranked as the 3rd and the 4 th most important independent variables, respectively, demonstrating the strong seasonal and daily variations of PM2.5. Due to high multicollinearity, these two variables were not used in the GWR model and this shows that machine learning models have better inclusiveness for variables. CII also has a relatively important impact on the final result, which

Variable Importance in SL-XGB
The values of variable importance in all local models were averaged and the top 10 important variables are shown in Figure 7. Among them, PBLH and AOD are the two most important variables with variable importance being 0.26 and 0.18, respectively. Here, PBLH made more contribution than AOD. This may be related to the frequent occurrence of haze events in Beijing, when the PM 2.5 concentration is more sensitive to PBLH [55]. Moreover, PBLH was the mean of four time points and its temporal scale may be more similar to that of PM 2.5 data. The variable importance value of V10 is higher than U10, so wind in a south-north direction has more impact on PM 2.5 than wind in an east-west direction. This may be due to the great amounts of pollutants from Hebei province, which is in the south of Beijing, drifting to Beijing through the wind in a south-north direction. Season and DOY are ranked as the 3rd and the 4th most important independent variables, respectively, demonstrating the strong seasonal and daily variations of PM 2.5 . Due to high multicollinearity, these two variables were not used in the GWR model and this shows that machine learning models have better inclusiveness for variables. CII also has a relatively important impact on the final result, which indicates the close relationship between urbanization and PM 2.5 . However, most spatially geographical variables contributed little to the result, because they are temporally constant.
Remote Sens. 2020, 12, x FOR PEER REVIEW 12 of 20 indicates the close relationship between urbanization and PM2.5. However, most spatially geographical variables contributed little to the result, because they are temporally constant.

Seasonal and Annual PM2.5 Distribution
Seasonal and annual PM2.5 spatial distribution estimated by SL-XGB are presented in Figure 8. Due to the high resolution of SARA AOD, the PM2.5 estimations obtained provide more spatial details than the results from some previous studies with a "km" level resolution [56,57]. In comparison, the use of full-covered SARA AOD provides smoother results with less noise interference [24,25]. PM2.5 pollution was most serious in winter, with the highest average PM2.5 concentration reaching 145 μg/m 3 in some areas. This figure is several times greater than the "good" standard according to National Ambient Air Quality Standard of China (35 μg/m 3 ). The overall average PM2.5 concentration in summer is as low as 35 μg/m 3 in some northern areas, although there is still some light pollution in the southeastern areas. This suggests a gradient increase in PM2.5 concentrations from the northwest to the southeast in all seasons. This pattern is similar to the spatial distribution of AOD, which also indicates a strong relationship between AOD and PM2.5. There are two reasons for the high PM2.5 concentrations in southeastern areas: one is that urban areas are mainly located in southeastern areas, so more pollutants are produced there. The other is that there are many highemission factories in Hebei province which are adjacent to these areas, thus external pollutants also cause an adverse impact. While in in northwestern areas, the main land use types are forest and grassland, thus improving air pollution in a sense.

Seasonal and Annual PM 2.5 Distribution
Seasonal and annual PM 2.5 spatial distribution estimated by SL-XGB are presented in Figure 8. Due to the high resolution of SARA AOD, the PM 2.5 estimations obtained provide more spatial details than the results from some previous studies with a "km" level resolution [56,57]. In comparison, the use of full-covered SARA AOD provides smoother results with less noise interference [24,25]. PM 2.5 pollution was most serious in winter, with the highest average PM 2.5 concentration reaching 145 µg/m 3 in some areas. This figure is several times greater than the "good" standard according to National Ambient Air Quality Standard of China (35 µg/m 3 ). The overall average PM 2.5 concentration in summer is as low as 35 µg/m 3 in some northern areas, although there is still some light pollution in the southeastern areas. This suggests a gradient increase in PM 2.5 concentrations from the northwest to the southeast in all seasons. This pattern is similar to the spatial distribution of AOD, which also indicates a strong relationship between AOD and PM 2.5 . There are two reasons for the high PM 2.5 concentrations in southeastern areas: one is that urban areas are mainly located in southeastern areas, so more pollutants are produced there. The other is that there are many high-emission factories in Hebei province which are adjacent to these areas, thus external pollutants also cause an adverse impact. While in in northwestern areas, the main land use types are forest and grassland, thus improving air pollution in a sense. Remote Sens. 2020, 12, x FOR PEER REVIEW 13 of 20 Figure 8. The spatial distribution of the seasonally and annually averaged PM2.5 in Beijing.

The Effect of the AOD Gap-Fill Process on PM2.5 Estimation
To explore the impact of the AOD gap-fill process on PM2.5 estimation, we also applied the SARA AOD with gaps to predict PM2.5 spatial distribution as the comparison. For predictions in grids where ground sites are located, the annually averaged PM2.5 concentration using the AOD gap-fill process (called "GF" hereafter) was 80 μg/m 3 . The average PM2.5 concentration without using the gap-fill process (called "NGF" hereafter) was lower, 64 μg/m 3 . The true annually averaged PM2.5 concentration of ground observation sites is 78 μg/m 3 , which is closer to the GF result. This means that the result will be underestimated without using the AOD gap-fill process. Figure 9 presents the annually averaged spatial differences between GF and the NGF (GF minus NGF) in the whole study area and the average spatial differences between the true PM2.5 concentration and NGF and GF, respectively, in PM2.5 sites. The results show that the difference in northern areas is small, while the difference in southern areas is large-28 μg/m 3 -likely as a result of deficiencies in the data caused by the misidentification of cloud masks. On the one hand, there are many bright urban areas which are often misidentified as clouds in the south of Beijing. On the other hand, the haze events from Hebei province often pass by this area via the North China Plain, so that some data in severely polluted areas would be ignored without using the gap-fill process [24,58]. In terms of the comparison with true PM2.5 concentrations, basically the absolute differences of NGF are higher than that of GF in most ground sites, which also suggests the essential role of the gap-fill process in our study.

The Effect of the AOD Gap-Fill Process on PM 2.5 Estimation
To explore the impact of the AOD gap-fill process on PM 2.5 estimation, we also applied the SARA AOD with gaps to predict PM 2.5 spatial distribution as the comparison. For predictions in grids where ground sites are located, the annually averaged PM 2.5 concentration using the AOD gap-fill process (called "GF" hereafter) was 80 µg/m 3 . The average PM 2.5 concentration without using the gap-fill process (called "NGF" hereafter) was lower, 64 µg/m 3 . The true annually averaged PM 2.5 concentration of ground observation sites is 78 µg/m 3 , which is closer to the GF result. This means that the result will be underestimated without using the AOD gap-fill process. Figure 9 presents the annually averaged spatial differences between GF and the NGF (GF minus NGF) in the whole study area and the average spatial differences between the true PM 2.5 concentration and NGF and GF, respectively, in PM 2.5 sites. The results show that the difference in northern areas is small, while the difference in southern areas is large-28 µg/m 3 -likely as a result of deficiencies in the data caused by the misidentification of cloud masks. On the one hand, there are many bright urban areas which are often misidentified as clouds in the south of Beijing. On the other hand, the haze events from Hebei province often pass by this area via the North China Plain, so that some data in severely polluted areas would be ignored without using the gap-fill process [24,58]. In terms of the comparison with true PM 2.5 concentrations, basically the absolute differences of NGF are higher than that of GF in most ground sites, which also suggests the essential role of the gap-fill process in our study.

Discussion
This study presented a novel approach based on full-covered SARA AOD for PM2.5 estimation. The results suggest a relatively high spatial resolution and abundant spatial details, which can be implicated well at the urban scale. Compared with previous studies based on multisource data including satellite AOD for PM2.5 estimation, our study has some advantages as follows. First, PM2.5 estimation with a 500 m spatial resolution was derived for the urban scale. Many previous studies pay more attention to the PM2.5 estimation at a large scale, such as at the national scale. However, for the PM2.5 pollution studies in a specific area, some errors will be induced if using national data for estimation. This is because PM2.5 is significantly affected by local geographical environments and more negative effects of spatial heterogeneity will be introduced. Many PM2.5 estimation models based on the data for a certain city or an area have shown good performance, so it is worth developing a specific PM2.5 estimation model when investigating PM2.5 pollution at a small scale, such as the urban scale [25,59]. Thus, studies at a relatively small scale are also important. Compared with some previous studies using satellite AOD products at the urban or the urban agglomeration scales [57,60], our study has a finer spatial resolution at 500 m by utilizing SARA AOD data. Additionally, compared with studies using a data fusion method for higher resolution [23], SARA AOD is retrieved directly based on MODIS products, so that it can be applied relatively easily. Therefore, the series of operations we applied are more practical.
Another strength of our study is the gap-fill process by XGBoost. Gaps in AOD have a significantly negative impact on the PM2.5 estimation. As shown in the analysis in Section 3.6, not adopting a gap-fill process leads to significant biases. To address AOD data deficiency, previous studies often adopted three types of methods to fill the gaps. The first is to utilize the daily, weekly, or seasonal relationship between AOD and PM2.5 [61,62]. A disadvantage of this kind of method is the poor accuracy because it is difficult to capture the relationship between AOD and PM2.5 without inducing other independent variables. The second is to use interpolation or imputation such as kriging interpolation (KI) methods and multiple imputation (MI) [29,53]. It is convenient to implement these methods in order to fill deficient AOD but they cannot guarantee accuracy when there is a great amount of deficient data. Recently, machine learning methods were applied in the AOD gap-fill process. For example, some studies used RF based on multisource data such as

Discussion
This study presented a novel approach based on full-covered SARA AOD for PM 2.5 estimation. The results suggest a relatively high spatial resolution and abundant spatial details, which can be implicated well at the urban scale. Compared with previous studies based on multisource data including satellite AOD for PM 2.5 estimation, our study has some advantages as follows. First, PM 2.5 estimation with a 500 m spatial resolution was derived for the urban scale. Many previous studies pay more attention to the PM 2.5 estimation at a large scale, such as at the national scale. However, for the PM 2.5 pollution studies in a specific area, some errors will be induced if using national data for estimation. This is because PM 2.5 is significantly affected by local geographical environments and more negative effects of spatial heterogeneity will be introduced. Many PM 2.5 estimation models based on the data for a certain city or an area have shown good performance, so it is worth developing a specific PM 2.5 estimation model when investigating PM 2.5 pollution at a small scale, such as the urban scale [25,59]. Thus, studies at a relatively small scale are also important. Compared with some previous studies using satellite AOD products at the urban or the urban agglomeration scales [57,60], our study has a finer spatial resolution at 500 m by utilizing SARA AOD data. Additionally, compared with studies using a data fusion method for higher resolution [23], SARA AOD is retrieved directly based on MODIS products, so that it can be applied relatively easily. Therefore, the series of operations we applied are more practical.
Another strength of our study is the gap-fill process by XGBoost. Gaps in AOD have a significantly negative impact on the PM 2.5 estimation. As shown in the analysis in Section 3.6, not adopting a gap-fill process leads to significant biases. To address AOD data deficiency, previous studies often adopted three types of methods to fill the gaps. The first is to utilize the daily, weekly, or seasonal relationship between AOD and PM 2.5 [61,62]. A disadvantage of this kind of method is the poor accuracy because it is difficult to capture the relationship between AOD and PM 2.5 without inducing other independent variables. The second is to use interpolation or imputation such as kriging interpolation (KI) methods and multiple imputation (MI) [29,53]. It is convenient to implement these methods in order to fill deficient AOD but they cannot guarantee accuracy when there is a great amount of deficient data. Recently, machine learning methods were applied in the AOD gap-fill process. For example, some studies used RF based on multisource data such as meteorology in order to derive AOD and obtained great accuracy (sample-based CV R 2 is higher than 0.9) and full coverage [30,31]. To our knowledge, XGBoost was used to fill the deficient AOD at a fine scale for the first time. Compared with the RF model, XGBoost can guarantee high accuracy and high efficiency at the same time. Table S5 shows that XGBoost took much less time to achieve almost the same accuracy compared with RF. Thus, XGBoost may be the better choice for gap filling when there is too much deficiency in AOD.
The proposed model, SL-XGB, showed outstanding performance in PM 2.5 estimation, with a fitting R 2 of 0.93 and a CV R 2 of 0.88, which made the greatest contribution in our study. In contrast to some studies using statistical models such as GWR [63], as a machine learning model, SL-XGB has applicability in model fitting and its non-linear assumption is more in line with the relationship between AOD and PM 2.5 . Moreover, compared with some previous global models for PM 2.5 estimation [13,64], SL-XGB enhanced the prediction ability by taking the local relationship between independent variables and PM 2.5 into consideration, so it is more robust when it is difficult to capture the global relationship. Thus, the CV performance of our model is better than that in previous studies using Beijing as the study area [24,25,57] and some using Beijing-Tianjin-Hebei (BTH, including Beijing but with some more areas) as the study area [60,65,66]. The comparisons are listed in Table 3. Through combining the powerful fitting ability of machine learning as well as the optimal local bandwidths, the results show that the performance of both sample-based CV and site-based CV in SL-XGB is better than not only some statistical models such as mixed-effects models but also some global non-linear models such as deep neural network and random forest. Additionally, the spatial resolution of our study is the highest among these studies, so the results can be better utilized at the urban scale. SL-XGB is the extension of XGBoost in space when considering the spatial heterogeneity of the relationship between independent variables and PM 2.5 . Compared with only using one global XGBoost model, the predictive accuracy can be improved by utilizing several local sub-models. According to the first law of geography, for a geographical local object, the neighboring objects are more correlated with it than the objects that are far away. That is to say, for each local model, excluding the data that is far away from the local site in the training set can reduce spatial heterogeneity, which is beneficial for local model performance [67]. Therefore, bandwidth plays a very important role in model construction. As the results in Section 3.3 show, the optimal bandwidth for the urban local model is approximately 20 km and that for the suburb local model is approximately 50 km in our study. A similar experiment was also applied in investigation into the relationship between local sites' PM 2.5 concentrations and spatial neighboring sites' lagged PM 2.5 concentrations, which demonstrated that the optimal window size is also approximately 50 km [68]. Given that this study was conducted at an urban agglomeration scale with a relatively low density of ground sites, it may be considered that when sites are distributed sparsely, the PM 2.5 data for ground sites within 50 km may be more appropriate for most local model training. Additionally, it is also noted that the optimal bandwidth may vary in different areas due to the differences in geographical environments and socioeconomic development, so it deserves more discussion in the future by extending this work to more areas.
However, there are still some limitations in this study and some work needs to be finished in the future. The first is that our model did not consider the temporal heterogeneity. Some previous studies constructed local models such as geographically and temporally weighted regression (GTWR) [56,69] by introducing the concept of "spatiotemporal bandwidth" for improvement in both spatial domain and temporal domain, while the spatiotemporal relationship is still not clear for some non-parametric models such as XGBoost. Thus, in future research, we will consider temporal heterogeneity in machine learning model structures. Secondly, limited by data acquirement, PM 2.5 concentrations were estimated from only 35 sites in Beijing, which may produce some biases. Thirdly, the performance of SL-XGB on larger spatial scales or in other cities needs to be explored, so some work about this will be further conducted. Moreover, in terms of the work on the AOD gap-fill process, the spatiotemporal relationship between AOD and independent variables and some other variables such as particle emission data may be further introduced into the model for better performance. Lastly, we estimated PM 2.5 concentrations based on one year's data, so the performance of SL-XGB implemented over multiple years remains to be conducted, which may be achieved using tools such as Google Earth Engine (GEE) [3].

Conclusions
In this paper, we proposed a novel model, SL-XGB, for PM 2.5 estimation over Beijing based on full-covered SARA AOD at a 500 m spatial resolution. Firstly, full-covered AOD was obtained by XGBoost with good performance (sample-based CV R 2 in all seasons is higher than 0.9), which was very helpful for increasing the availability of SARA AOD and reducing biases in PM 2.5 estimation. Secondly, it proved that the strategy of "local regression' in GWR can be introduced in XGBoost for performance improvement when sampling points are insufficient. Compared with GWR (with a sample-based CV R 2 of 0.71 and a RMSE of 33.67 µg/m 3 ) and XGBoost (with a sample-based CV R 2 of 0.85 and a RMSE of 27.01 µg/m 3 ), SL-XGB had better applicability in estimating PM 2.5 (with a sample-based CV R 2 of 0.88 and a RMSE of 24.08 µg/m 3 ) as it can better address non-linear and spatial heterogeneity issues based on the powerful fitting ability of XGBoost and the division of optimal bandwidth. Moreover, in site-based CV, SL-XGB showed the best spatial generalization ability among the three models, with a R 2 of 0.86, a RMSE of 26.15 µg/m 3 and a MPE of 17.97 µg/m 3 . In terms of the result, the high spatial resolution of the estimation provided more spatial details of PM 2.5 distribution at the urban scale. This shows that PM 2.5 pollution in Beijing was serious, especially in winter, and pollution prevention in the southeastern areas of Beijing should be given more attention.
Despite some limitations, the present study can still provide great support for fine urban PM 2.5 exposure assessment and air pollution management and control. In the future, we will incorporate temporal heterogeneity and test the model performance on more urban areas by multitemporal data. Additionally, as more and more PM 2.5 ground sites are established in Chinese cities, the spatial heterogeneity of the relationship between AOD and PM 2.5 can be better explored and more accurate results are expected to be obtained by using our approach.