Retrieving Ground-Level PM 2.5 Concentrations in China (2013-2021) with a Numerical Model-Informed Testbed to Mitigate Sample Imbalance-Induced Biases

. Ground-level PM 2.5 data derived from satellites with machine learning are crucial for health and climate assessments, however, uncertainties persist due to the absence of spatially covered observations. To address this, we propose 15 a novel testbed using untraditional numerical simulations to evaluate PM 2.5 estimation across the entire spatial domain. The testbed emulates the general machine-learning approach, by training the model with grids corresponding to ground monitor sites and subsequently testing its predictive accuracy for other locations. Our approach enables comprehensive evaluation of various machine-learning methods’ performance in estimating PM 2.5 across the spatial domain for the first time. Unexpected results are shown in the application in China, with larger PM 2.5 biases found in densely populated regions with abundant 20 ground observations across all benchmark models, challenging conventional expectations and are not explored in the recent literature. The imbalance in training samples, mostly from urban areas with high emissions, is the main reason, leading to significant overestimation due to the lack of monitors in downwind areas where PM 2.5 is transported from urban areas with varying vertical profiles. Our proposed testbed also provides an efficient strategy for optimizing model structure or training samples to enhance satellite-retrieval model performance. Integration of spatiotemporal features, especially with CNN-based 25 deep-learning approaches like the ResNet model, successfully mitigates PM 2.5 overestimation (by 5-30 µg m -3 ) and corresponding exposure (by 3 million people • µg m -3 ) in the downwind area over the past nine years (2013-2021) compared to the traditional approach. Furthermore, the incorporation of 600 strategically positioned ground-measurement sites identified through the testbed is essential to achieve a more balanced distribution of training samples, thereby ensuring precise PM 2.5 estimation and facilitating the assessment of associated impacts in China. In addition to presenting the 30 retrieved surface PM 2.5 concentrations in China from 2013 to 2021, this study provides a testbed dataset derived from physical modeling simulations which can serve to evaluate the performance of data-driven methodologies, such as machine learning, in estimating spatial PM 2.5 concentrations for the community.


Introduction
Accurate knowledge of PM 2.5 pollution is vital for understanding its impact on human health (Lelieveld et al., 2015;Geng et al., 2021) and the climate (Mitchell et al., 1995;Bellouin et al., 2005).Satellite products provide direct measurements of aerosol loading on broad spatial and temporal scales.While the aerosol optical depth (AOD) measured by satellites reflects the total column of particulate matter, this is challenged by the complex relationship between AOD and ground PM 2.5 influenced by various factors (Hoff and Christopher, 2009), including aerosol chemical composition and vertical profiles.Compared to traditional statistics, machine learning excels in addressing nonlinearities.Therefore, numerous recent studies leverage machine learning, such as in the random forest (RF) (Hu et al., 2017), XGBoost (Xiao et al., 2018), lightGBM (Zhong et al., 2021), and deep-learning (Li et al., 2020;Yan et al., 2020;Wang et al., 2022a, b;Wei et al., 2023) models, to establish correlations between AOD and PM 2.5 , treating AOD and related factors, including meteorological variables, as features for predicting surface PM 2.5 based on ground measurements (Ma et al., 2022).However, a limitation arises as most ground measurements are concentrated in urban and polluted areas.Their main purpose is to monitor the high pollution levels to protect human health, leading to an uneven spatial distribution.It is expected that training models predominantly on urban sites will introduce an imbalance in ground-based measurements, resulting in significant uncertainties in spatially allocating surface PM 2.5 based on satellite AOD (Shin et al., 2020).This deficiency might be particularly notable in suburban areas experiencing downwind transport of PM 2.5 from urban areas (Bai et al., 2022).The discrepancy between urban and downwind sites largely lies in their vertical profiles of aerosol across the entire vertical layer.Urban sites, which have abundant emission sources such as residential areas, transportation, construction, and industries, exhibit a higher share of ground-level aerosol relative to the total AOD compared to downwind and rural areas, where pollution tends to be lifted to upper layers through atmospheric dynamics.Accurately representing the varying aerosol vertical profiles in source/urban and downwind/rural areas is crucial for retrieving ground-level PM 2.5 from the AOD.However, imbalanced training samples make the machine-learning model unable to adequately capture such variations.The traditional cross-validation methods based either on samples or sites (Dong et al., 2020), which still rely mostly on samples available in urban sites, fail to provide a comprehensive assessment of model performance across the entire prediction space.Consequently, uncertainties in PM 2.5 estimation for these areas remain unexplored, and solutions to reduce such uncertainties are yet to be developed.
To overcome these limitations, we introduce a novel testbed utilizing a numerical model, specifically a chemical transport model (CTM) such as the Community Multi-scale Air Quality Modeling System developed by the U.S. Environmental Protection Agency (EPA) and its community (Appel et al., 2013), to establish ground-truth data beyond monitoring points.This allows for the evaluation of interpolation performance using various machine-learning models and provides solutions to mitigate the uncertainties stemming from the sample imbalance problem.More specifically, we emulate traditional machine-learning methods by using CTM-simulated PM 2.5 concentrations in grid cells corresponding to ground monitoring sites as labels for training machine-learning models.Subsequently, we validate the trained model's performance in predicting PM 2.5 concentrations in other grid cells.In addition to providing a "ground truth" for assessing performance across the entire space, the CTM-simulated data act as a testbed for efficiently seeking solutions to enhance satellite-retrieval model performance.This involves optimizing features, model structures, and training samples, as depicted in Fig. 1.

Methods
The proposed testbed is implemented in a Chinese domain, utilizing 1 whole year's simulation for 2017 with a 27 km by 27 km resolution.To ensure internal consistency during training, all the feature and label data are derived from the input and output of the commonly used Community Multiscale Air Quality (CMAQ) model.The meteorological variables include U wind, V wind, humidity, 2 m temperature, the convective velocity scale, shortwave radiation, 10 m wind speed, planetary boundary layer (PBL) height, leaf area index (LAI), cloud fraction, and precipitation and are simulated by the Weather Research and Forecasting (WRF) model (Skamarock et al., 2008).The simulated AOD is calculated based on simulated PM 2.5 chemical compositions and corresponding meteorological variables across all vertical layers (Liu et al., 2010).We also include the NO 2 column density as an important feature as it is highly correlated with emission sources and can be directly observed from satellites to better represent the emission information (Martin et al., 2003).Similarly, the simulated NO 2 column density is calculated based on simulated NO 2 concentrations and corresponding meteorological variables across all vertical layers.
In addition, we conduct model training using 9-year observational data from 2013 to 2021 to evaluate potential biases under real-world conditions.This is quantified by measuring the difference in retrieved PM 2.5 between the traditional model and the improved retrieval method optimized with the proposed testbed.The dataset for this evaluation comprises MODerate-resolution Imaging Spectroradiometer (MODIS) (Remer et al., 2008) satellite observations for AOD, Ozone Monitoring Instrument (OMI) (Celarier et al., 2008) satellite data for NO 2 column density, and ground monitoring observations of PM 2.5 from the China National Environmental Monitoring Center (CNEMC, covering a total of more than 600 grid cells of 27 km by 27 km) (Kong et al., 2021).Following the same data-filling method (He et al., 2020), we conduct data filling for the satellite measurement of NO 2 column density and AOD when applying our approach to real data.The generation of testbed data and the machinelearning methods are detailed as follows.

WRF and CMAQ numerical models
In this study, we utilized version 5.2 of the CMAQ model (Appel et al., 2018), incorporating the Carbon Bond 6 (Yarwood et al., 2010) gas-phase chemistry mechanism and the AERO6 particulate matter chemistry mechanism.CMAQ, a widely recognized CTM, is renowned for its accurate simulation of air pollutant concentrations, including PM 2.5 , which is attributed to its comprehensive representation of particulate matter formations.Meteorological data were generated using the WRF model, version 3.8, configured in the same manner as our previous studies (Ding et al., 2019a, b).Emission data were obtained from the highresolution emission inventory developed by Tsinghua University (ABaCAS-EI) (Zheng et al., 2019), characterized by a spatial resolution of 27 km by 27 km and a temporal resolution of 1 h to match the CMAQ model.Biogenic emissions were derived from the estimation of the Model for Emissions of Gases and Aerosols from Nature (MEGAN) (Guenther et al., 2012).We conducted a thorough assessment of the performance of WRF and CMAQ in simulating meteorological variables and air pollutant concentrations, employing extensive comparisons with observational data in our previous studies (Ding et al., 2019a, b).
We align the simulated PM 2.5 concentrations from CMAQ with the CNEMC based on their respective locations, treating them as the "label" for training the machine-learning model.Though previous studies provide some validation schemes to evaluate the model's extrapolation capacity (Dong et al., 2020), for the remaining grid cells that encompass surrounding PM 2.5 areas where observations are not available, the prehttps://doi.org/10.5194/essd-16-3781-2024 Earth Syst.Sci.Data, 16, 3781-3793, 2024 dicted concentrations with machine-learning methods cannot be directly compared and examined.This paper focuses on assessing the model's performance in predicting these points, accounting for over 90 % of the total number of grid cells.The simulation data serve as the ground truth for the evaluation of the output of the machine-learning model.The WRF and CMAQ simulations were evaluated in our previous studies (Ding et al., 2019a, b), demonstrating acceptable agreement with CNEMC observations, albeit with limitations in areas where observations are available.In rural areas where no observations are available, direct comparison of CMAQ predictions with actual observations is not possible.However, the CMAQ data used in this study primarily serve to establish a testbed representing scenarios based on physical laws such as emission, diffusion, advection, and deposition.This approach contrasts with reanalysis or data fusion methods, which may deviate from these physical functions, even though they might exhibit better agreement with observations when available.

Decision-tree-based machine-learning method
This study employed three decision-tree-based machinelearning algorithms, i.e., random forest (Belgiu and Drȃguţ, 2016), XGBoost (Chen and Guestrin, 2016), and LightGBM (Ke et al., 2017), to serve as benchmark cases given their widespread use in previous studies.Additionally, Deep Forest (denoted as DeepRF in this study) (Zhou and Feng, 2019), known for its superior performance (Wei et al., 2023), was included as an additional method to be evaluated in this study.
We incorporated similar features used in the machinelearning model, including observed meteorological variables (WRF output) and land use information.The reason is that we deliberately avoided using CTM simulation results for two key reasons, while some previous studies included CTM results as additional features in training machine-learning models.First, the CTM will be applied to the testbed to evaluate the model's performance, and introducing CTM results could leak information as these results are utilized as labels and therefore cannot be used as input thereafter.Second, we aimed to propose a comprehensive CTM-free method that relies exclusively on satellite products and meteorological variations obtained from observations.This choice is motivated by the low efficiency when using the CTM and the uncertainties that this introduces.Furthermore, the only additional information provided by the CTM is related to emissions, which still suffer from uncertainties.Therefore, instead of relying on the CTM or prior emission data, we introduce the NO 2 column density.This variable is highly correlated with emission sources and can be directly observed from satellites, offering a more accurate representation of emission information.
Given our objective of assessing grid cells outside the designated label, there is no overlap between the training and test datasets.To evaluate the model's performance on the la-bels, we employ temporal validation.Specifically, the model is trained using data from only the first 25 d of each month, and the remaining days are reserved for testing.This approach helps gauge the model's effectiveness in handling temporal variations and provides a robust assessment of its performance on the specified labels.We fixed the days for training rather than selecting them randomly to ensure that all the methods use exactly the same data for training and testing, enabling a fair comparison.Random selection would still require us to fix the randomly selected days for all the methods, similar to fixing all the days at the outset.Moreover, the purpose of this study is to investigate prediction errors for grid cells not included in the training dataset.Even when using the first 25 d of the training dataset, we consistently observe similar prediction errors in other sites (similar to out-of-sample validation), regardless of which days are selected for training or testing.

Residual neural network (ResNet) method
The incorporation of spatiotemporal-neighborhood features is crucial for enhancing the model's ability to discern the evolution of vertical profiles in both urban and downwind areas (Chen et al., 2023).Beyond simply including corresponding features from the surrounding neighborhood grid cells as additional predictors of PM 2.5 concentrations at target grid cells in decision-tree-based methods, we also employ a deep-learning method, i.e., ResNet (He et al., 2016).This choice is motivated by its demonstrated advantage in handling the nonlinearity inherent in atmospheric processes, as suggested in our previous study (Xing et al., 2020).
ResNet consists of an initial layer with 128 channels and incorporates eight residual blocks.The feature maps, encompassing meteorological variables, land use information, and AOD, are fed into the conventional neural network (CNN)based structure with a 3-by-3 kernel size, as illustrated in Fig. S1 in the Supplement.Additionally, we incorporate the feature into the previous and next days to help capture the transport flow of pollutants in the model.The training loss will concentrate solely on points corresponding to the monitoring sites, generating predictions exclusively for these specific locations given the scattered nature of the labels.As a result, predictions for other points will be entirely out of sample, relying on data from the same locations as the monitoring sites.
One thing should be noted: all machine-learning methods use the same input features to ensure a fair comparison.The only difference is that the features for the new proposed methods (e.g., ResNet) include data from the neighborhood (nearby grid cells and the previous or next time steps) in addition to the local grid and time data.
Throughout the training phase, we employed the mean squared error (MSE) loss function, conducting a total of 1000 epochs, which demonstrated sufficient effectiveness in achieving robust performance during both training and test-Earth Syst.Sci.Data, 16, 3781-3793, 2024 https://doi.org/10.5194/essd-16-3781-2024 ing.The learning rate started at 0.0001 and underwent linear decay, reaching zero by the conclusion of the training process.Additionally, we utilized the Adam optimizer (Kingma and Ba, 2014) to enhance the convergence of the model.

Results
3.1 Imbalance in site distribution leads traditional methods to overestimate downwind PM 2.5 To explore uncertainties in traditional machine-learning methods, we initially adhere to their typical design, relying exclusively on local features within each grid cell.This approach involves utilizing only the feature data from the same location as the target grid cell.The model trained with the RF method successfully captures the spatial distribution of PM 2.5 , showing elevated levels in eastern China and lower levels in the west (Fig. 2a-b).It exhibits acceptable performance for the label grid cells during validation (Fig. 2c-d: R 2 = 0.98 and RMSE = 5.28 µg m −3 in the training dataset, R 2 = 0.81 and RMSE = 16.1 µg m −3 in the test dataset).However, considerable errors are observed across space, particularly in polluted regions with high baseline PM 2.5 concentrations (Fig. 2e).Positive biases (i.e., predictions greater than those of CMAQ) increase with the distances from the monitoring sites, even as PM 2.5 concentrations decrease.This suggests an overestimation in predictions for downwind areas away from the monitoring sites (Fig. 2f).This is mainly attributed to the traditional model's difficulty in discerning variations in vertical profiles between urban and suburban areas.Training is primarily focused on urban areas, where pollution is concentrated near the surface due to ground-level emission sources.In contrast, pollution in downwind areas is transported aloft.Therefore, the model, trained on urban sites where ground-level pollution from AOD is more prominent, failed to accurately capture the diverse aerosol vertical profiles in source/urban and downwind/rural areas.This discrepancy resulted in overestimations in downwind areas (as illustrated in Fig. 2g).
Contrary to traditional expectations, significant absolute errors mostly occur in eastern China rather than in the west due to the large baseline concentration, even though the relative error is smaller (about 20 %) (Fig. S2 in the Supplement).While the east has more densely located monitoring sites, these are primarily situated in urban centers.This imbalance in site distribution, combined with much higher concentrations, results in substantial biases in eastern China.
Similar phenomena are observed in three other benchmark models that have been applied in previous studies, i.e., Xg-Boost, LightGBM, and DeepRF.All of these models demonstrate robust performance in both training and testing at the monitoring sites (R 2 >0.8 and RMSE <16.2 µg m −3 in the test cases, as depicted in Fig. S3 in the Supplement).However, they display similar uncertainties in downwind PM 2.5 , with significant errors occurring in the surrounding grid cells of the monitoring sites rather than in remote sites where concentrations are relatively low.Clearly, we can conclude that the uneven distribution of sites introduces considerable biases into PM 2.5 estimation when using traditional methods that rely on local features.

Inclusion of spatiotemporal-neighborhood features
improves the surrounding PM 2.5 prediction over traditional approaches As previously discussed, the ineffectiveness of a machinelearning model trained on imbalanced samples can be attributed primarily to insufficient information regarding the spatial variation of vertical profiles from the source to the downwind area.To enhance the integration of crucial information regarding vertical profiles, we introduce spatiotemporal-neighborhood features into the model.This addition aims to empower the model with the ability to distinguish between urban and downwind areas.Leveraging the CNN-based structure, known for its effectiveness in exploring nonlinear relationships between neighboring grid cells, we opt for the widely used deep-learning ResNet model.This choice facilitates the establishment of nonlinear relationships between predicted PM 2.5 concentrations and multiple spatiotemporal-neighborhood features.In contrast to traditional methods that solely focus on single time features, our approach incorporates both preceding and succeeding time features to enhance a model's capacity to discern differences between urban and downwind grid cells.This inclusion is motivated by the fact that plume transport is predominantly influenced by flow dynamics represented by the variation in the temporal neighborhood (before and after) features.Our previous studies also demonstrated the effectiveness of linking grid cells to time series information on PM 2.5 estimation, underscoring the rationale for this inclusive approach (Teng et al., 2023;Ding et al., 2024).Additionally, considering that AOD measured by satellites, such as MODIS, captures only a single time step while predictions are made for daily averages, the inclusion of extra time step information proves beneficial in capturing a broader temporal context compared to a single time snapshot (the model is called ResNet-time).
The results indicate that, while the spatial pattern of PM 2.5 predicted with ResNet closely resembles that of other models (Fig. 3a), it significantly enhances model performance in predicting PM 2.5 for both the training dataset (reducing RMSE from 4 to 2 µg m −3 and increasing R 2 slightly) and the test dataset (reducing RMSE from 14 to 8 µg m −3 and increasing R 2 from 0.8 to 0.9).This improvement can be attributed to the incorporation of both spatial and temporal features.The performance of the traditional RF model is enhanced by replacing it with the ResNet model, and this improvement is further amplified by including temporal features (previous and next time steps) in the ResNet-time model (see Fig. 3b).The incorporation of surrounding features in the ResNet-time model significantly mitigates both absolute and https://doi.org/10.5194/essd-16-3781-2024 Earth Syst.Sci.Data, 16, 3781-3793, 2024 relative errors in eastern China across the spatial domain (see Figs. 3c and S2).However, some deterioration is observed in the west, which is primarily attributable to limited samples.The model, becoming more complex, lacks sufficient training samples in the west, leading to overfitting in that region.The inclusion of spatiotemporal-neighborhood features also significantly improves the performance of traditional benchmark models by incorporating corresponding features of the surrounding eight neighborhood grid cells and the temporal (before and after) neighborhood information as additional predictors of PM 2.5 concentrations at the target grid cells.Improvements are observed in both the training and test datasets across all four benchmark models, as depicted in Fig. S4 in the Supplement.Notably, all the models demonstrate a reduction in RMSE after integrating spatiotemporalneighborhood features, especially for the downwind area (within a distance of one to three grid cells) (see Fig. 3d).However, performance is barely improved or even worsens in faraway sites (distance of more than four grid cells) due to the limitations of the training samples.Even the ResNet-time model demonstrates better performance, primarily in eastern China, where the distance to the monitoring sites is within zero to two grid cells.The performance is slightly worse in the western region, where the distance to the monitoring sites exceeds four grid cells (Fig. 3d).The "new" method, applied to the original tree-based method, also shows superior performance compared to the original, although it performs slightly worse in western China.Clearly, enhancing the training sample is crucial for further improving the model predictions, as discussed in the following.

Balancing site distribution is crucial for improving
the prediction for the entire space of PM 2.5 Utilizing the ResNet-time model, we explore the correlation between model errors and the distance to the monitoring sites, together with the concentrations at the nearest monitor.Notably, significant errors were observed in sites within a distance of two grid cells (refer to Fig. S5 in the Supplement) and those near monitors exhibiting high concentrations (refer to Fig. S6 in the Supplement).Consequently, two criteria, i.e., (1) the baseline concentration in nearby monitoring sites (referred to as "B-conc") and (2) the distance from monitoring sites (referred to as "D-site"), are established to select the potential samples to refine predictions across the spatial domain.
Three sample groups are delineated based on these criteria: 1. B-conc >30 µg m −3 and D-site of one to five grid cells; 2. B-conc within 20-30 µg m −3 and D-site of two to five grid cells; and Earth Syst.Sci.Data, 16, 3781-3793, 2024 https://doi.org/10.5194/essd-16-3781-2024This design not only focused on the area suffering from large impacts of pollution but also allowed the selection of sites in remote regions with moderate baseline concentrations, as illustrated in Fig. S7 in the Supplement.To enhance the representativeness of the chosen sites, random selections are independently conducted in each of the three groups, encompassing 10 % (∼ 300 sites, half of the existing sites), 20 % (∼ 600 sites, equal to the existing sites), 30 % (∼ 900 sites, 1.5 times the existing sites), 40 % (∼ 1200 sites), 70 % (∼ 2100 sites), and all of the samples (∼ 3000 sites).The testbed developed in this study enables an efficient evaluation of the model's performance by training it with these additional sites.
The results indicate that an increase in the number of training samples effectively enhances a model's performance in PM 2.5 estimation, with RMSEs continuously decreasing as the number of samples increases (see Fig. 4a).This improvement is primarily observed in the downwind area, while the performance at the monitoring sites deteriorates due to the original model being overfitted to these specific sites (Fig. 4b).The rate of improvement diminishes after the inclusion of 20 % of the samples, implying that just doubling the current ground monitors wisely can effectively balance the training samples to ensure the accuracy of PM 2.5 estimations (RMSE reduced by 20 %-30 %).As illustrated in the example with 20 % sample inclusion in Fig. 4c-e, it is recommended that more than half of the new sites be set up in eastern China, where PM 2.5 concentration is high.Additionhttps://doi.org/10.5194/essd-16-3781-2024 Earth Syst.Sci.Data, 16, 3781-3793, 2024 ally, it is suggested that 10 % of the sites be set up in remote areas that are influenced by transport from heavy pollution regions but lack nearby ground measurements.The inclusion of additional sites proves effective in significantly reducing prediction errors across the entire spatial domain, leading to much closer agreement with the ground truth (Fig. 2a) in the PM 2.5 spatial pattern.
It is important to acknowledge that errors may be influenced by factors beyond site distribution problems, such as systematic errors arising from insufficient features.Baseline errors are referenced to those trained with all points using ResNet, amounting to within 1.7 µg m −3 (Fig. S8 in the Supplement).Similarly, training with all points may increase errors in monitoring sites, as the original model might be overfitted to these sites rather than representing the overall situation (Fig. 4b).

Potential biases and optimized site selections under real-world conditions
While ground measurements are unavailable for the entire space, we conducted the evaluation using both the traditional RF method and the ResNet-time model developed previously with satellite data.Both models were trained using real-world satellite data and ground monitoring PM 2.5 observations during 2013-2021, and their differences can be considered part of the potential biases associated with the influence of incorporating spatiotemporal features for enhancing the model's ability to identify vertical structures.
The results suggest that both models effectively replicate the time series of monthly mean PM 2.5 concentrations across monitoring sites from 2013 to 2021 (see Fig. 5a-b).However, considerable disparities emerge in their predictions for other areas (Fig. 5c).The new predictions using the ResNettime model generally exhibit lower PM 2.5 concentrations, particularly in the northern and western regions (Fig. 5df), with a more significant impact observed as the distance to the ground monitoring sites increases (Fig. 5j).A notable discrepancy in the population-weighted PM 2.5 concentration is observed in eastern China, which has a large population, implying that the errors also applied to human health assessment (Fig. 5g).Since the ResNet-time model demonstrates superior performance compared to the traditional RF model in both training (reducing RMSE from 6 to 2.5 µg m −3 and increasing R 2 from 0.9 to 1.0) and test data (reducing RMSE from 20 to 15 µg m −3 and increasing R 2 from 0.4 to 0.6), it appears that traditional methods might significantly overestimate PM 2.5 concentrations (by 5-30 µg m −3 ) and PM 2.5 exposure in suburban and rural areas by 3 million people • µg m −3 (Fig. 5k) due to the sample imbalance problem throughout 2013-2021.Similar results are also suggested in the other three benchmark models (Figs.S9-S10 in the Supplement).The actual errors might be even larger, as the inclusion of spatiotemporal-neighborhood features in the ResNet-time model can only mitigate a portion of the errors.
Incorporating a significant number of additional sites is necessary to balance the training samples and further reduce the uncertainties.Following the two previously defined criteria (i.e., B-conc and D-site), three groups of samples are selected.Similarly, 20 % of the samples (631 in total, close to the number of existing sites) in each group are proposed as potential additional sites in the future, as presented in Fig. 5lm.Group 1 (30 % of the total number of add-on sites) is primarily situated in polluted regions (B-conc >60 µg m −3 , D-site one to five grids), encompassing areas such as the Beijing-Tianjin-Hebei region and the desert region in the west.Group 2 (40 % of the total number of add-on sites) represents sites with a moderate distance to existing monitoring sites with a heavier pollution level, compared to Group 1 (B-conc within 40-60 µg m −3 , D-site two to five grid cells).Lastly, Group 3 (30 % of the total number of add-on sites) represents sites located far away from existing monitoring sites with a low pollution level (B-conc <40 µg m −3 , D-site four to five grid cells), situated in remote areas with limited influence from transport originating in polluted regions.As indicated by the previous testbed analysis, including these additional sites has the potential to reduce errors by at least 20 %, leading to more accurate machine-learning estimation of PM 2.5 concentrations with a more balanced training sample set.

Data availability
The numerical-model-informed testbed and the corresponding estimated PM 2.5 concentrations spanning the 9 years (2013-2021) can be found at https://doi.org/10.5281/zenodo.11122294 (Li et al., 2024a), and an updated version with files in NetCDF format can be found at https://doi.org/10.5281/zenodo.12636976(Li et al., 2024b).In addition to the longterm PM 2.5 dataset created using our new method, which can be used for health assessments and studying air pollution influences, we provide testbed data crucial for evaluating machine-learning-based retrieval methods, especially in scenarios where no ground-truth data are available.
The testbed dataset includes all inputs and outputs following the physical model simulation, which naturally correlates with physical laws such as emission, diffusion, advection, and deposition, representing typical conditions that any prediction method should meet.These data can be used to evaluate and compare methods using the same dataset, allowing for continuous improvement.Besides traditional crossvalidation, our proposed testbed validation is highly recommended for examining a method's predictive ability.We will continue updating the testbed data for other pollutants and with different resolutions and regions in future studies.

Discussion and conclusions
Amidst the advancements in satellite products and machinelearning techniques, ground-level PM 2.5 data have found extensive applications in health assessments and related fields.However, their uncertainties have remained unexplored due to the lack of ground-truth data covering the whole space.This study designed a physically informed testbed by leveraging CTM simulations to evaluate PM 2.5 estimation across the entire spatial domain and quantified the associated uncertainties in the PM 2.5 mapping across the whole space.Traditionally, it was believed that errors would be signifi-cant in remote areas with few or no ground-based measurements, while observation-dense regions, following the spatial interpolation principle, were expected to exhibit better accuracy.Contrary to our expectations, our findings reveal that the largest absolute biases occur differently.One reason is the heavier baseline PM 2.5 concentration, and another significant factor is the sample training imbalance problem.Ground-based measurements, designed for monitoring heavy pollution, are predominantly located in urban or industrial areas.Using these measurements as training samples misleads the machine-learning model into assuming uniform similarity to urban sites, especially in vertical structures.In reality, the vertical profile varies significantly with the flow after the pollutant is emitted from the source.This sample imbalance issue causes the machine-learning model to fail to provide accurate predictions for PM 2.5 across the entire spatial domain.The newly developed testbed also enables us to seek the best solutions, such as optimizing model structures or enhancing training samples, to improve satellite-retrieval model performance.Our results underscore the importance of incorporating spatiotemporal features to enhance the machine-learning model's ability to identify differences between urban and downwind conditions that are not explored in the recent literature.However, fully addressing the sample imbalance problem necessitates the addition of more ground monitoring sites to achieve a more balanced distribution of training samples for machine learning in China.In recent years, the Chinese government has expanded monitoring sites towards suburban areas, increasing the total number of monitoring sites by about 400 (from about 1600 in 2017 to about 2020 in 2021).While these additional samples have effectively improved PM 2.5 predictions (as presented in Fig. S11 in the Supplement), they only account for about 100 grid cells in the 27 km-by-27 km domain.According to the estimations in this study, approximately 600 grid cells are needed to locate monitoring sites in the future.Some studies incorporate CTM simulation data as an additional feature for predicting PM 2.5 , while the uncertainties of CTM hinder performance enhancement.To demonstrate this, we conducted RF predictions with CMAQ data as an additional feature, but the improvement compared to the original RF model was minimal, especially when compared to using the additional neighborhood information proposed in this study (Fig. S12 in the Supplement).Besides, compared to CTM simulations, NO 2 column density better represents emission information and can significantly enhance model performance.As illustrated in Fig. S13 in the Supplement, excluding NO 2 column data from the features used in the machine-learning model reduces its performance in predicting surface PM 2.5 , leading to even more errors due to the sample imbalance problem.
Although this study is conducted at a relatively coarse resolution of 27 km over China due to the computational burden of running a CTM at a fine resolution in a large-scale domain, the testbed method proposed here can also be applied with higher-resolution retrievals when the simulation data are available.A similar testbed study conducted in the Continental United States (CONUS) domain at 12 km resolution revealed the same imbalance problem (Zhang et al., in preparation), indicating that this issue persists at finer-resolution scales, especially in urban and industrial areas, due to spatial heterogeneity in emissions (Li and Xing, 2024) and the complexity of spatial gradients of particulate matter pollution observed at high resolution through AOD (Lin et al., 2021).At a fine resolution (e.g., 1 km), while the number of observation sites may increase slightly (eliminating the need for grouping to one 27 km grid cell like in this study), the number of grid cells to be predicted increases significantly.Therefore, it is essential to conduct similar testbed studies using 1 km CMAQ results (Tao et al., 2020) to evaluate the performance of machine-learning methods.This might be more feasible with a more comprehensive CMAQ dataset using nesting rather than specific subdomains.
This study also successfully demonstrates leveraging of the CTM to generate abundant data to test machinelearning methods, overcoming limitations associated with data availability.While derived from a numerical-modelbased testbed, it is important to acknowledge that the numerical model itself may encounter uncertainties related to emissions and chemical mechanisms, potentially leading to discrepancies with real observations.Nevertheless, the testbed serves as a specific scenario for evaluating satellite-retrieval methods, with the expectation that these methods should perform effectively in various scenarios, including those generated from CTM simulations.Therefore, the errors observed in the CTM-based testbed also imply their existence when applied to real data.Additionally, although this study primarily focuses on the analysis of PM 2.5 in China, the identified errors may extend to other pollutants and countries as a whole, particularly when facing similar sample imbalance problems (i.e., lacking suburban or rural representative sites).Leveraging the testbed developed in this study can be immensely helpful in examining uncertainties in other pollutants and countries or in other geoscience applications facing similar sample imbalance challenges.
Author contributions.SL and JX designed the experiments and carried them out.YD helped with the data processing.JSF helped with the manuscript review.SL prepared the manuscript with contributions from all the co-authors.
Competing interests.The contact author has declared that none of the authors has any competing interests.
Disclaimer.Publisher's note: Copernicus Publications remains neutral with regard to jurisdictional claims made in the text, published maps, institutional affiliations, or any other geographical representation in this paper.While Copernicus Publications makes every effort to include appropriate place names, the final responsibility lies with the authors.
Acknowledgements.The authors are grateful for the free use of the MODIS-AOD, OMI-NO 2 column density, and ground monitoring observations provided by the China National Environmental Monitoring Center.

Figure 1 .
Figure 1.The proposed testbed for evaluation of satellite-retrieval surface PM 2.5 concentration.

Figure 2 .
Figure 2. Performance of the monitor-located RF model in predicting surface PM 2.5 levels.The comparison includes the spatial distribution of surface PM 2.5 (a: ground truth; b: model prediction), PM 2.5 levels at monitoring sites (c: label values; d: predicted values), the error distribution across space (e) and distance (f), and the vertical structure of PM 2.5 concentration (g) across the distances from the monitors, with their spatial distribution shown in panel (h).

Figure 3 .
Figure 3. Improvement after implementing surrounding grid cell features.The panels show (a) prediction of surface PM 2.5 using ResNet, (b) comparison with the random forest (RF) model at the monitoring sites, (c) error comparison between ResNet and RF (where blue indicates better performance by ResNet and red indicates worse performance), and (d) error comparison across the distances to the monitors for all the models, including the new addition of surrounding features to each model's prediction.

Figure 4 .
Figure 4. Improved performance through the integration of additional sites using the ResNet model.The comparison includes model performance when adding points across the overall domain (a) and across distances to the monitoring sites (b), the scenario of adding 20 % more sample details to the locations of specific sites (black dots represent the 619 original monitoring sites) (c), the prediction of the surface PM 2.5 concentration in this scenario (d), and the corresponding reduction in errors (e).

Figure 5 .
Figure 5. Improved estimation of PM 2.5 and the related exposure across China using satellite products and ground observations from 2013 to 2021.The comparison of predicting the time series of monthly mean PM 2.5 concentrations from 2013 to 2021 is conducted using the original RF model and the new ResNet-time model based on monitoring sites (a) and other grid cells (b), with the difference between the new and original predictions shown in panel (c).The prediction of the 9-year average PM 2.5 concentration over 2013-2021 is compared using the new ResNet-time model (d) and the original RF model (e), with their differences being in absolute concentrations (f) and populationweighted concentrations (g).The influence of considering spatiotemporal features in the new model on the performance at the monitoring sites is shown for RMSE (h), R 2 (i), PM 2.5 prediction across the distances to the monitors (j), and PM 2.5 exposure (k, where suburban/rural represents areas within a distance of five grid cells from the monitoring sites).Potential grid cells as additional sites to improve the accuracy of downwind PM 2.5 prediction are shown with their spatial locations (l) and the selection of each group (m).