Introduction

Surface visibility (SV), often used as a proxy for ambient air quality, exhibits strong association with national transportation and human health1,2,3,4. In recent decades, severe and frequent low-SV episodes related to air quality during events such as haze and dust storms have received widespread attention5,6,7. The main factors affecting SV are relative humidity (RH) and ambient particulate matter (PM), which can affect SV by directly or indirectly changing the transmission path of solar radiation8,9. However, the relationships between PM concentration and SV are complex and nonlinear owing to variations in the size distribution, mixing state, and chemical composition of aerosol particles10,11. Moreover, this relationship is also affected by various meteorological factors (especially RH) that exhibit notable regional and seasonal differences12.

Since 2015, the China Meteorological Administration (CMA) has conducted automated observations of SV, gradually replacing the previous manual method of conducting observations only 3–6 times daily13. The automated instruments for monitoring SV measure surface scattering coefficients and can provide high-quality SV observations on an hourly basis. Currently, the CMA has equipped more than 2400 sites with such instruments, thereby providing a good opportunity to track the diurnal cycle of SV variability. However, despite the considerable efforts made to date, the relatively sparse distribution and limited number of SV monitoring sites make it difficult to meet the current urgent need for spatiotemporally seamless fine-scale monitoring of SV. Thus, seamless hourly high-resolution SV retrievals would be useful for tracking various low-SV events caused by different weather conditions or background emissions. Moreover, such data would also strongly support quantitative retrieval of aerosol-related parameters, including fine PM (PM2.5)14,15,16, coarse PM (PM10)17, and aerosol optical depth (AOD)18,19, for which SV is recognized as a key intermediate variable.

To address the dearth of SV observational data, studies in recent decades have investigated obtaining SV data indirectly. Initially, such studies used statistical models or radiation theory for indirect estimation of SV using synchronized observations of meteorological, aerosol optical, and chemical component properties as key predictors20,21. Such a method is somewhat accurate but has strong geographical dependence, which leads to discrepancies in its spatial applicability. Additionally, it also relies on online observations of chemical components, which can hinder its application to large-scale estimation. In recent years, to resolve the difficulty in obtaining SV data on a large scale, some studies indirectly retrieved the spatial distribution of SV based on spectral signals from polar-orbiting or geostationary satellites and their derivatives (e.g., AOD)22,23,24,25. However, satellite-derived SV products can have incomplete spatial coverage due to cloud interference, thereby limiting their application prospects26,27,28. Furthermore, owing to the unavailability of nighttime aerosol products, satellite-based retrieval strategies are not yet able to achieve seamless SV monitoring on a 24-h cycle. Consequently, a real-time seamless hourly SV retrieval framework for China that can be operationalized and shared with the wider community remains lacking.

In this study, a spatiotemporally seamless SV retrieval framework for China with real-time operational capability was constructed based on multisource data fusion and machine learning (ML) methods. Here, we show the development of a real-time SV retrieval (RT-SVR) model in this framework, which is an end-to-end stacked ensemble model composed of bagging, boosting, and deep network algorithms. With the support of real-time availability of multisource input data, the RT-SVR model has the capability to provide real-time output of seamless hourly SV data for China with 6.25-km spatial resolution. Real-time retrieval of high-accuracy seamless hourly SV data is beneficial both for further improving the monitoring capability of national-scale low-SV events, and for advancing the application of spatiotemporally continuous multiple aerosol-related parameter retrievals.

Results

Advantages of the stacked ensemble model

In this study, we chose not to use an independent ML algorithm to establish the complex relationships between multiple predictors and observed SV. Instead, we developed a two-layer stacked ensemble model that combines five widely used ML algorithms (i.e., MLP, RF, CatBoost, XGBoost, and LightGBM; see Methods) with robust performance and a deep residual network (Fig. 1). In the following, we explain in detail the advantages of the stacked ensemble model developed in this study compared to an independent ML algorithm through consideration of a case that occurred at 02:00 CST (China Standard Time) on December 10, 2022 (Supplementary Fig. 1).

Fig. 1: Operational procedure of the RT-SVR framework for generating seamless hour-by-hour SV data in China.
figure 1

The first and second steps are the data pre-processing and detailed model building process. The third step is the operational application of the model.

Supplementary Fig. 2a–f show the spatial distribution of SV at 02:00 CST on December 10, 2022, as estimated by the MLP, RF, CatBoost, XGBoost, LightGBM, and RT-SVR models, respectively. For the MLP model, although the estimated SV exhibits a continuous spatial pattern similar to that of the observed SV (see Supplementary Fig. 1), the model tends to be spatially over-smoothed in areas of high SV, e.g., northwestern China (Supplementary Fig. 2a). This is an inherent drawback of the MLP approach, where the bias between estimated SV and observed SV tends to be large on the tabular data29. Compared with the MLP model, the RF model provides better portrayal of the spatial details of SV in the Tibetan Plateau region, but overfitting occurs in the northwestern region (Supplementary Fig. 2b). Overall, in areas of high SV, the three boosting models (i.e., CatBoost, XGBoost, and LightGBM) successfully address the problems encountered by the MLP and RF models. However, for regions with extremely low SV, e.g., the North China Plain (NCP), the SV estimated by all three boosting models exhibits unrealistic anomalies with negative retrieval values (Supplementary Fig. 2c–e). The statistics indicate that the grid points with negative retrieval values encountered by CatBoost, XGBoost, and LightGBM over the NCP region comprise approximately 0.84%, 0.58%, and 0.63% of the total sample size, respectively. It should be noted that we revised the loss function of XGBoost to eliminate negative retrieval anomalies in the model, which explains the lower number of negative retrieval values for the XGBoost (Supplementary Fig. 2d)30.

To address the various shortcomings encountered by the different ML models, we constructed a deep residual network in the second layer using the outputs of five independent ML models as inputs to comprehensively integrate the strengths and weaknesses of the base models in the first layer (Fig. 1). Supplementary Fig. 2f depicts the spatial map of SV retrieved by the RT-SVR model coupled with a two-layer stacked structure. It is evident that the deep residual network avoids the overfitting properties of the stacking methods and is spatiotemporally capable of combining the advantages of the first layer of the base model.

Relative importance of explanatory variables

We used the permutation method to interpret the relative importance of each predictor in the five base models in the first layer of the RT-SVR model. The same approach was also applied to account for the relative contribution of the retrieval from each base model in the first layer to the final retrieval from the second layer (a deep residual network). The permutation method calculates the relative importance of each feature by shuffling it randomly during training31. The relative importance of the predictors for retrieving SV in the RT-SVR model is illustrated in Supplementary Fig. 3. In the first layer of the RT-SVR model, each base model has different usage of the predictors. As expected, surface RH, which can directly or indirectly affect fog or cloud formation and modify light scattering and thereby lead to changes in SV, accounts for approximately 40% of the overall importance of predictors among the base models. For tree-based models, RH and PM2.5 are the dominant factors affecting SV retrievals because aerosol hygroscopic growth varies under different RH conditions12. It is notable that in the MLP model, surface temperature (TEM) is the most dominant predictor. Statistically significant correlation between TEM and SV has also been identified in some previous studies. On the one hand, TEM can affect the rate of aerosol hygroscopic growth and indirectly affect SV32,33; on the other hand, TEM affects the stability of the atmospheric boundary layer, which can influence pollutant diffusion20. In addition to the predictors mentioned above, geographical factors (e.g., the normalized difference vegetation index (NDVI) and elevation) and temporal factors (i.e., day of year), which have notable spatial heterogeneity, are also critical in estimations of SV.

Evaluation of the RT-SVR model performance

The performance of the RT-SVR model was evaluated using a sample-based cross-validation (CV) approach and a hindcast-validation (HV) experiment. The k-fold CV is used most often to test model robustness. In this study, a 5-fold CV approach was chosen, i.e., all site-based samples were randomly divided into five subsets. Each time, the model was trained using data from four subsets and tested on the remaining subset. In contrast, the HV experiment evaluated the generalization ability of the model (i.e., the real prediction ability) on a completely independent dataset that was not involved in any training process of the model.

Figure 2 shows the overall fitting and the 5-fold CV results of the RT-SVR model for all 24-h periods during 2020–2021. Here, the performance of the RT-SVR model for SV retrieval is indicated using the Pearson correlation coefficient (R), root mean square error (RMSE), mean absolute error (MAE), and relative mean bias (RMB). Overall, the 5-fold CV results of the model are comparable to the fitting results, indicating that our model is robust and does not experience overfitting problems. The RT-SVR SVs estimated from the 5-fold CV, tested on approximately 41.3 million data pairs, exhibit reasonable agreement with the in situ observations (R = 0.95, MAE = 2.16 km, and RMSE = 3.17 km).

Fig. 2: The performance of the RT-SVR model.
figure 2

Density scatterplots between observed SV and (a) fitted SV and (b) estimated SV across China for sample-based fivefold CV from 2020 to 2021. CV results are generated with hourly temporal resolution. The dashed black line is the 1-to-1 line and the solid red line is the linear regression line. The number of sample matchups (N), Pearson correlation coefficient (R), slope, mean absolute error (MAE), root mean square error (RMSE), and relative mean bias (RMB) of the linear regression are shown in the lower-right corner of each panel.

To further check the performance of the model in real application scenarios, we performed an additional HV experiment to examine the model’s ability to generalize when using unused data. For the HV experiment using one-year’s (2022) continuous observations (approximately 20.8 million data pairs), the RT-SVR model still exhibits powerful generalization capabilities, albeit with slight degradation in performance (R = 0.85, MAE = 3.77 km, and RMSE = 5.28 km) (see Fig. 3i and Supplementary Fig. 4f). Similar HV experiments were also undertaken with five independent ML models in the first layer and the results are shown in Supplementary Fig. 4a–e. Overall, in addition to the improvement in performance spatially, the RT-SVR model also exhibits improved accuracy compared to all five ML models, with the lowest RMSE and the highest value of R (Supplementary Fig. 4f). Importantly, the RT-SVR model resolves the problems of anomalous negative retrievals encountered by the boosting models. Moreover, the RT-SVR model demonstrates outstanding predictive performance compared to that of other models that use satellite-retrieved AOD as the predictor22.

Fig. 3: Performance of the RT-SVR model in the hindcast-validation (HV) experiment in 2022.
figure 3

Comparison of daily SV time series from observations and from the RT-SVR model at six different sites in China: (a) Urumqi, (b) Beijing, (c) Nanjing, (d) Nagqu, (e) Wenjiang, and (f) Shenzhen. g Geographic locations of the six independent sites selected for this study. h Boxplot of observed and estimated SV at the six sites (statistics performed on the daily time series for 2022). The boxplot illustrates the median, the interquartile range, the upper (lower) whisker extending from the hinge to the largest (smallest) value no further than 1.5× interquartile range, and outliers as individual points. The box extends from lower to upper quartile values of the data, with a horizontal line at the median. The whiskers extend from the box to show the range of the data. i Density scatterplots between observed SV and estimated SV across China for the HV experiment in 2022 (results based on hourly temporal resolution).

In addition to examining the overall performance of the model, the site-scale performance was also evaluated. Supplementary Figs. 5, 6 show the spatial distribution of the site-based statistical indicators associated with the 5-fold CV and HV experiment, respectively. For the 5-fold CV (Supplementary Fig. 5), the averaged R, RMSE, and MAE across China is 0.93, 3.10 km, and 2.16 km, respectively. Among them, the number of sites with R > 0.94 accounted for 45.4% (1091 sites) of the total number of sites, the majority of which are located on the NCP. Meanwhile, those sites with relatively low R-values are mainly located on the Tibetan Plateau and in northeastern China. This is attributed to the fact that the SV sites in those regions are sparsely distributed, making it difficult to acquire sufficient information for model development (Supplementary Fig. 1). For the HV experiment, the overall distribution pattern of the above metrics is similar to that of the 5-fold CV (Supplementary Fig. 6). Statistically, the averaged R, RMSE, and MAE for the HV experiment across China is 0.80, 5.08 km, and 3.77 km, respectively, and approximately 43.9% (1055 sites) of the sites have an R-value of >0.85. Spatially, our model has greater bias in south-central and north-central China compared to that in the region of the NCP, mainly because of the more complex relationships between SV, RH, and PM2.5 in those areas34,35. The drier climatic background of the NCP allows SV variability in this region to be modulated mainly by atmospheric PM2.5 pollution. The low bias in the region of the NCP is largely attributable to the relatively accurate pollutant and meteorological fields associated with the large number of monitoring sites distributed throughout the region. We also note that a similarly high bias in the model also exists in northeastern and northwestern China, which might be related to the inadequate spatial representation of background PM10 information interpolated from the fewer observational sites in those regions (see Methods).

Consistent accuracy on the full 24-h diurnal cycle

Supplementary Figs. 7, 8 show the performance of the RT-SVR model on an hourly scale based on the sample-based 5-fold CV and HV experiments, respectively. On the CV dataset, the RT-SVR model shows superior temporal robustness and its statistical parameters do not exhibit significant fluctuations over time; the model has values of R > 0.93 and RMSE < 3.5 km at all 24-h periods (Supplementary Fig. 7). On the HV dataset, although the accuracy of the RT-SVR model is reduced compared to that on the CV dataset at all 24-h periods, the performance of the model is robust across hours (Supplementary Fig. 8).

We further selected six SV observation sites with different underlying surface types, i.e., desert (Urumqi site), basin (Wenjiang site), plateau (Nagqu site), and urban (Beijing, Nanjing, and Shenzhen sites), to further examine the performance of the RT-SVR model. Figure 3 and Supplementary Fig. 9 present the daily and hourly time series of SV at those sites retrieved by the RT-SVR model during the HV experiment in 2022, respectively. In Beijing, the estimated daily SV agrees well with the observed SV, with an R-value of 0.97 and RMSE of 2.47 km. Similarly good performance is observed in both Nanjing (R = 0.96, RMSE = 2.22 km) and Wenjiang (R = 0.94, RMSE = 2.28 km). At the other city site, i.e., Shenzhen, the model performance is slightly reduced. In contrast, there is notable reduction in model performance at the western sites of Urumqi (R = 0.85, RMSE = 2.68 km) and Nagqu (R = 0.83, RMSE = 2.02 km) compared to that of the eastern sites in China. However, it should be noted that there are also differences in model performance across seasons at different sites (Supplementary Fig. 9). For example, at the Urumqi site, larger error occurs in spring (Supplementary Fig. 9a), during which time frequent dust storm events tend to cause short-term abrupt changes in SV, leading to an under-response of the model5,36,37. At the Nanjing and Wenjiang sites, the RT-SVR model has worst performance in summer (Supplementary Fig. 9c, e), which might be related to the seasonal increase in precipitation events caused by frequent strong convective activities38,39. The combination of higher temperature and increased precipitation in summer highlights the difficulty the model has in accurately retrieving SV under conditions of high RH12. The model performs poorly at the Nagqu site in all seasons except winter (Supplementary Fig. 9d). The uncertainty of the meteorological fields in the region of the Tibetan Plateau might contribute to this level of performance. Figure 4 displays the distribution of the retrieved hourly SV averaged over the period 2020–2022 for the RT-SVR model, trained using the 2020–2022 data. The site observations are shown in Supplementary Fig. 10. Overall, the RT-SVR model captures the features of the diurnal cycle of the seamless hourly SV in China. For example, in the early morning and evening, low surface temperatures (due to radiative cooling) and high RH are conducive to fog formation, leading to reduced SV40. Additionally, during this period, such meteorological conditions coupled with low boundary layer heights favor accumulation of pollutants (especially surface PM), leading to enhanced extinction, which further reduces SV12,41. Spatially, low SV values are mainly located in the regions of the NCP, central-eastern China, Sichuan Basin, and Taklimakan Desert. In contrast, as temperatures rise during the day, surface heating leads to reduction in RH and enhancement of vertical motion, the latter of which contributes to greater dispersion of pollutants and thus improves SV. Supplementary Fig. 11 shows the multiyear averaged seasonal SV maps retrieved from the RT-SVR model during 2020–2022. During the summer, enhanced surface heating leads to vigorous convective activity in the lower atmosphere and significant precipitation, resulting in a notable improvement in SV. In contrast, during the winter, the prevalence of low-SV events escalates due to suboptimal atmospheric dispersion conditions that are compounded by the synergistic effects of adverse weather conditions and anthropogenic emissions, with this phenomenon being particularly pronounced in the central and eastern regions of China. Supplementary Figs. 1215 show the multiyear averaged hour-by-hour SV maps retrieved from the RT-SVR model in the four seasons during 2020–2022, respectively. Consistent with the patterns observed in Fig. 4, the diurnal variations of SV in China exhibit a similar pattern across different seasons, characterized by the lowest SV in the early morning or evening, and the highest SV in the afternoon. Overall, the diurnal and seasonal variations of SV in China are driven by a complex process that is influenced by a combination of factors such as topography, season, meteorological conditions, and ambient air pollution.

Fig. 4: Hourly SV distributions in China.
figure 4

Multiyear averaged hour-by-hour SV maps retrieved from the RT-SVR model during 2020–2022.

Model applications: tracking low-SV events

To fully demonstrate the model’s ability to track the entire process of occurrence, development, and dissipation of various extreme low-SV weather episodes, we considered three typical events that can have major impact on SV: fog, haze, and dust storms.

Fog event

Figure 5 shows the 3-h variability of SV retrieved from the RT-SVR model, together with the observed PM2.5 and the geopotential height, wind vector, boundary layer height, and RH from the fifth-generation ECMWF atmospheric reanalysis (ERA-5) data, during the formation of a fog event that occurred on the NCP during November 19–20, 2022. The full hour-by-hour evolution of this process is depicted in Supplementary Fig. 16. This event began after 12:00 CST on November 19, 2022. As shown in Fig. 5, this event was primarily driven by meteorological conditions because the regional PM2.5 concentrations were at low levels. Specifically, from the night of November 19 to the early morning of November 20, 2022, with the rapid increase in RH (>90%) and the reduction in the boundary layer height, a persistent foggy event affected the NCP and surrounding areas, which caused rapid reduction in the regional SV to the 100-m level. Our retrieval products well captured this process, demonstrating the ability of the RT-SVR model to reproduce extreme low-SV events.

Fig. 5: Model application to track typical low-SV events.
figure 5

Evolutionary maps of 3-h SV (first column), surface PM2.5 observations (second column), and meteorological fields (third column: geopotential height (GH) and wind field at 850 hPa; fourth column: boundary layer height (BLH); fifth column: surface relative humidity (RH)) during a fog event that occurred on November 19–20, 2022. Meteorological fields are from the fifth-generation ECMWF atmospheric reanalysis (ERA5).

Haze event

Apart from fog, increase in atmospheric aerosols can also cause rapid decline in SV. We further examined the performance of the RT-SVR model in tracking extreme low-SV events caused by anthropogenic aerosols (described as a haze event) and dust aerosols (described as a dust storm event). Supplementary Fig. 17 shows the 6-hourly variability of SV during the formation of a haze event that occurred in eastern China during January 20–22, 2021. Despite the spatial extent of the impact of this haze event being comparable to that of the fog event (see Fig. 5), its duration was longer, persisting for nearly three days.

Overall, this haze event was mainly caused by gradual accumulation of local anthropogenically derived PM2.5 under unfavorable meteorological conditions. From the early morning of January 21, 2021, gradual increase in PM2.5 (up to 240 μg m−3) was observed in eastern China, especially in southern Hebei, Shandong, Shaanxi, and Henan. Correspondingly, our retrieval product also revealed gradual decline in regional SV, reaching a minimum (approximately 500 m) at approximately 10:00 CST on January 22, 2021 (Supplementary Figs. 17 and 18). This process was mainly regulated by changes in local meteorological conditions caused by the large-scale circulation. During this period, the reduction in the planetary boundary layer height and the diminished wind speed were not conducive to the dispersion of pollutants. Meanwhile, the increase in RH further promoted the hygroscopic growth of aerosols and contributed to the increase in PM2.542,43. The formation of these unfavorable meteorological conditions collectively contributed to the rapid accumulation of PM2.5 and the subsequent rapid decline in SV. After 10:00 CST on January 22, 2021, as atmospheric conditions led to improved dispersion of pollutants and reduced RH, the regional PM2.5 concentration began to decrease, leading to improvement in SV.

Dust event

Figure 6 illustrates the evolution of a mega dust storm event that occurred in northern China during March 14–15, 2021. A recent study, based on long-term satellite observations, registered the intensity of the dust loading during this event as the strongest episode over the past two decades5. Thus, examining the retrieved SV products during such an event is beneficial for deepening our understanding of the model’s reproducibility under large-scale, dynamic, and extremely low-SV conditions.

Fig. 6: Model application to track typical low-SV events.
figure 6

Evolutionary maps of 3-h SV (first column), surface PM10 observations (second column), and meteorological fields (third column: GH and wind field at 850 hPa; fourth column: BLH; fifth column: RH) in a mega dust storm event that occurred during March 14–15, 2021.

ERA5 data showed that this dust storm event was triggered by a strong Mongolian cyclone on March 14, 2021, in conjunction with a surface cold anticyclonic system. Under the control of such powerful weather systems, a sharp difference in pressure gradient between Mongolia and northeastern China formed, inducing exceptionally strong air movement (wind speed at 850 hPa of up to 20 m s−1), which dragged dust aerosols from the dust source into the atmosphere. Subsequently, these dust aerosols moved rapidly from the west toward the east in association with the trough. In terms of timing, the dust storm entered China from northern-central Inner Mongolia at approximately 21:00 CST on March 14, 2021, and it moved rapidly toward the NCP region (Fig. 6 and Supplementary Fig. 19). Synergistic analysis showed that our SV products correspond well with ground-based PM10 observations in terms of spatial and temporal variations. Specifically, at 03:00 CST on March 15, 2021, surface PM10 concentrations increased rapidly at several sites in northern China, corresponding to sharp decline in SV in southern Inner Mongolia. At 09:00 CST on March 15, 2021, the dust belt was transported to the NCP and mixed with local PM2.5 pollution, further reducing SV. Until 12:00 CST on March 15, 2021, as dust aerosols continued to be deposited, SV in northern China began to improve, but it remained below 5 km in some areas, e.g., southern Inner Mongolia. Overall, the RT-SVR model proved itself capable of accurately tracking the formation, transport, and dissipation processes of large-scale mega dust events both spatially and temporally.

Real-time retrieval capability

We designed the RT-SVR framework with the ability to retrieve hourly SV in real time. Several advantageous factors support this technical realization. First, the two dynamic datasets involved in the model are available in real time, i.e., the PM observations and the CMA Land Data Assimilation System (CLDAS) data. In terms of the PM data, the National Meteorological Information Center releases to the public the most recent hourly national-scale observations in real time (with delay of approximately one-half hour). The assimilation system of the CMA can assimilate quality-controlled meteorological data from tens of thousands of ground-based sites to generate hourly CLDAS data in real time (with delay of approximately 40 min). Second, although the RT-SVR model is a combination of several base models, each model is lightweight and efficient, and can be deployed on a GPU. As tested on a workstation with a GeForce RTX 3090 graphics card, for each hourly SV retrieval, it was found that the entire RT-SVR framework took approximately 35–40 min (including time waiting for the PM and CLDAS data to become available) in total from the processing of the model input variables to the retrieval, of which model extrapolation took only approximately 12 s. The entire framework is expected to run even faster when running on a higher performance server. Therefore, the above conditions powerfully support the RT-SVR framework to be operationalized to provide real-time seamless hourly SV output in the future.

Discussion

In this study, an operational SV retrieval framework (i.e., RT-SVR) was developed, which takes advantage of the real-time availability of ground-based observations and CLDAS meteorological data to achieve real-time retrieval of spatiotemporally seamless SV data. The retrieval accuracy and reliability of the RT-SVR model were proven satisfactory through HV experiments on more than 20 million data pairs.

The RT-SVR model adopts a double-stacked structure combining multiple ML algorithms and deep residual neural networks, which resolves several problems encountered by individual ML algorithms, including low retrieval accuracy, negative anomalies of retrieval values, and spatial discontinuities. However, the RT-SVR model still has several limitations. First, we noted underestimation of SV levels, as suggested by the slopes of 0.9 and 0.71 between the retrievals and the observations for the CV and HV experiment (Figs. 2b and 3i, respectively). Such underestimation might be related to inaccurate information on dynamic PM pollution, insufficient spatial prediction capabilities, and uncertainties regarding the input predictors. In this study, the seamless PM2.5 and PM10 background fields were generated through the spatial inverse difference weighting interpolation technique, which caused large spatial bias in some areas with a limited distribution of air quality monitoring sites, e.g., western parts of China. The limited number of SV monitoring sites also resulted in insufficient learning of the spatial information by the model. Moreover, some uncertainties exist regarding the CLDAS data used in the model, and regarding all the other predictors (e.g., anthropogenic emission inventories); these uncertainties, in conjunction with modeling errors, contribute to the uncertainty in the final SV retrievals. It is thus difficult to quantify the uncertainties of the model and its specific sources. Here, we expect that well-estimated air pollutant concentration fields could be developed in the future to further improve model performance. As demonstrated in a previous study44,45,46,47, instead of using traditional interpolation methods, more accurate pollutant reanalysis fields can be estimated from site observations using deep learning models to learn multivariable spatial correlations from chemical transport models. Second, compared to other observational instruments, SV observation instruments are special because most have an upper limit (i.e., up to 30 km). This leads to large biases in SV under similar meteorological and PM conditions. Additionally, although rigorous quality control was implemented, automatic SV sensors inherently have different uncertainties at different threshold ranges (e.g., 10% for SV <10 km and 20% for SV >10 km)13. This calls for the need to develop distinct models to address the uncertainties that exist within the different ranges of SV observations in future studies.

Despite these limitations, the RT-SVR framework developed in this study has potential for application in several research areas. First, the seamless hourly SV product enables us to provide refined real-time monitoring of various low-SV events (e.g., fog, haze, and dust storms) at horizontal resolution of 6.25 km, which addresses the shortcomings of incomplete spatial coverage of observation sites. Second, once the complex decoupling relationships between SV and the PM2.5/PM10 and prevailing meteorological conditions have been resolved, such a product has potential to serve as a key parameter in real-time retrieval of the seamless diurnal cycle of surface PM2.5/PM10 pollution in China. This will contribute to resolving the limitations in spatial coverage and temporal resolution of current retrieval strategies for surface PM2.5/PM10 using satellite-based AOD products17,23,24,25,48,49,50. Meanwhile, it might also be used for retrieval of key aerosol property parameters such as AOD and single-scattering albedo, as implied by previous studies18,19,51,52. Moreover, with the availability of long-term SV datasets, a short-term forecast model for SV could be built in the future through ML-based spatiotemporal extrapolation techniques.

Methods

Figure 1 illustrates the entire modeling architecture for the real-time surface visibility retrieval (RT-SVR) framework, which is divided into three main steps: multisource data processing and fusing, development of the stacked ensemble model, and model application.

Multisource input data

This data uses multi-source datasets, including in situ observations, meteorological fields, emission inventories, and other ancillary data such as total population, elevation, normalized difference vegetation index (NDVI), and land cover type. The in situ observations comprised surface visibility (SV), fine particulate matter (PM2.5), and coarse particulate matter (PM10). The hourly SV observations from 2020 to 2022, recorded at approximately 2400 stations distributed throughout China, were collected from the National Meteorological Information Center (http://data.cma.cn/) of the China Meteorological Administration (CMA). To ensure the quality of the SV data, we used only those SV records that were labeled “good” after the quality control procedure. The quality control algorithm for the SV records was proposed by the National Meteorological Information Center to eliminate a few random and systematic errors in the raw data. We collected hourly surface PM2.5 and PM10 concentrations for the same period from the China National Environmental Monitoring Center network. For each site, we removed PM2.5/PM10 outliers that exceeded three standard deviations from the 1-month moving average50. The meteorological fields were taken from the CMA Land Data Assimilation System version 2 (CLDAS-V2.0) at resolution of 0.0625° × 0.0625°53. The CLDAS products are generated by fusing high-density automated surface meteorological observations, multi-satellite retrieval products, and numerical model analysis and forecast fields using multiple grid variational data assimilation techniques, and they have lower uncertainties and higher spatiotemporal resolution (spatial resolution of 6.25 km and temporal resolution of 1 h) in China compared with other similar products (e.g., Global Land Data Assimilation System) used in the international arena54,55. Currently, this product is updated in real time with a delay of <1 h (approximately 40 min). In this study, we utilized the following parameters extracted from the CLDAS dataset: temperature at 2-m height (TEM), specific humidity at 2-m height (SHU), wind speed at 10-m height (WIN), U wind component at 10-m height, V wind component at 10-m height, surface pressure (PRS), and downward surface shortwave radiation (SSRA).

To further improve the retrieval capability of the model, multisource ancillary data were used. We utilized annual land cover type and monthly NDVI data for 2020 from the Moderate Resolution Imaging Spectroradiometer (MODIS) with spatial resolution of 500 and 250 m, respectively56,57. Population datasets for 2020 with resolution of 30 arcseconds were taken from the Gridded Population of the World (GPW) version 4 dataset and calibrated using the total population reported in China City Yearbooks58. Monthly anthropogenic emission inventories for 2020 for multiple species, including primary PM2.5 and PM10, sulfur dioxide (SO2), organic carbon (OC), black carbon (BC), nitrogen oxides (NOx), ammonia (NH3), and volatile organic compounds (VOCs), were taken from the Multi-resolution Emission Inventory for China (MEIC) with spatial resolution of 0.25° × 0.25°. We also download elevation data at spatial resolution of 300 m from the 2-min Gridded Global Relief Data (ETOPO2). Apart from the features mentioned above, we also considered temporal features (day of year, month, and hour) and spatial features (longitude and latitude) in our model to reduce the effect of spatiotemporal heterogeneity.

The data fusion step in the RT-SVR framework consists of three procedures. First, we removed samples containing missing values from all datasets. Second, PM2.5/PM10 observations and other ancillary data (except for population, elevation, and land cover data; the first two are averaged within the 6.25 km grid and the last are summed within the 6.25 km grid) were interpolated to the same spatial resolution (i.e., 6.25 km) as that of the CLDAS data using the inverse distance weighting method. Finally, we sampled the predictors for every SV site based on geographic location and time using nearest neighbor interpolation. In this way, a dataset containing predictors and observed SV was constructed. The full datasets were divided into two parts: the data from 2020 to 2021 were used to train the RT-SVR model to perform sample-based 5-fold cross-validation, and the data from 2022 were used to conduct a hindcast-validation experiment to evaluate the generalization capability of the model.

Stacked ensemble model description

A stacked ensemble model (RT-SVR), incorporating multiple machine learning (ML) algorithms and a deep learning module, was developed in this study to generate seamless hourly SV in China, as illustrated in Fig. 1. Overall, the RT-SVR model consists of two structural layers. In the first layer, five base ML models are used to establish the relationships between the predictors and SV. These five base models comprise the Multilayer Perceptron (MLP)59, Random Forest (RF)60, Categorical Boosting (CatBoost)61, eXtreme Gradient Boosting (XGBoost)62, and Light Gradient Boosting Machine (LightGBM)63. We chose five base models with different principles because the purpose of stacked ensemble ML is that the diversity of base models needs to be ensured64. The MLP model is a deep network that fits the target through the backpropagation method59. The RF model is a type of bagging algorithm that reduces fitting errors through use of the bootstrap aggregating method60,65. The boosting models (CatBoost, XGBoost, and LightGBM) are currently the tree methods used most commonly for retrieving surface pollutants (e.g., PM2.5, PM10, and ozone) because the advantage of the gradient descent method can achieve better results66. In addition, in our experiments, we found that due to the complexity of the relationship between SV, PM, and meteorological elements, traditional statistical-based methods for constructing models are less effective in retrieving SV (e.g., linear regression models, generalized additive model, etc.). To achieve better performance, we performed hyperparameter tuning for each base model. Supplementary Table 1 summarizes the key parameter information of the five base models finally selected for this study.

We found that different base models have distinct advantages and disadvantages in retrieving the SV in this study, depending on their accuracy and spatial performance (see Results). Therefore, a deep residual network was developed in the second layer to fully utilize (eliminate) the different advantages (disadvantages) of the base models in the first layer. This residual network comes from the transformer model, and its fixed feature size and residual connection method can effectively resolve the problem of overfitting in the stacked ensemble model67,68.

In the residual feedforward network, the encoder block is repeated six times to fully learn the features from the different base models. We employ feature embedding and subsequent mapping in each block using the following equation:

$$G\left(x\right)=\max \left(0,x{w}_{1}+{b}_{1}\right){w}_{2}+{b}_{2}$$
(1)

where max(x) represents the Rectified Linear Unit (ReLU) function. w represents the weight, and b represents the bias in the feedforward network.

Compared to the models presented in previous studies, our model has two major advantages. In the first layer, conventional gradient boosting models typically employ the square error as the loss function, which results in the same punishment for underestimation and overestimation for the model. In experimental tests, we found that the estimated SV from the boosting model suffered from unrealistically frequent negative retrievals. To address this issue, we modified the gradient and hessian of the loss function in the boosting model:

$$\frac{\partial [L(Y|f(x))]}{\partial [f(x)]}=-2\,\bullet\, ({Y}_{i}-f({x}_{i}))$$
(2)
$$\frac{\partial [L(Y|f(x))]}{\partial [f(x)]}=\left\{\begin{array}{c}-2\alpha \,\bullet\, ({Y}_{i}-f({x}_{i})), \ f({x}_{i})\, <\, {Y}_{i}\\ -2\beta \,\bullet\, ({Y}_{i}-f({x}_{i})), \ f({x}_{i}\ge {Y}_{i})\end{array}\right.$$
(3)
$$\frac{{\partial }^{2}[L(Y|f(x))]}{\partial {[f(x)]}^{2}}=\left\{\begin{array}{c}2\alpha , \ f({x}_{i})\, <\, {Y}_{i}\\ 2\beta , \ f({x}_{i})\ge {Y}_{i}\end{array}\right.$$
(4)

Equation 2 represents the gradient of the conventional loss function used in boosting models. We add two penalty coefficients α and β in the gradient (Eq. 3) and hessian function (Eq. 4) to constraint the extreme values obtained from the boosting model. f(x) and Yi represent the retrieved and observed SV respectively. In order to make the loss function have a penalizing effect on the low values of the retrieval SV, α was always kept greater than β during the experiments. In the field of artificial intelligence, the loss function serves as a crucial metric for quantifying the disparity between the predicted outputs of a model and the target values. The optimization of ML models often involves the minimization of this loss function to enhance the model’s ability to accurately fit the training data. The gradient informs about the rate of change of the loss function at the current parameter values, indicating the direction in which the loss function is ascending or descending most rapidly. The hessian derivative provides insights into the curvature of the loss function69. During the model training process, we modified the gradient method of XGBoost, whilst keeping CatBoost and LightGBM unchanged.

Additionally, compared to the traditional stacked ensemble model constructed with a simple statistical model (e.g., generalized additive model, linear regression model) in the second layer, our meta model in the second layer has the following advantages. First, compared with simple statistical models, residual feedforward networks are better able to capture the nonlinear relationships between predictors and the target by using nonlinear activation functions and residual connections between layers70. Second, deep residual networks have been theoretically demonstrated to possess superior expressive capacity, enabling them to approximate more complex functions with greater accuracy70. This endows them with an advantage in handling large-scale and highly nonlinear datasets. Moreover, in the first layer, there are boosting models with negative outputs, which can have a large impact on the linear models. Thus, the use of this method improves the accuracy of the stacked ensemble model and maintains the same retrieval time.

Operational process of the RT-SVR framework

In the model application stage (step 3 in Fig. 1), the well-trained RT-SVR model in step 2 can be directly deployed to a cloud server, such as the CMA meteorological big data cloud platform “Tianqing”71. Benefiting from the real-time access to hourly PM2.5 and PM10 observations and CLDAS meteorological fields on the “Tianqing” platform, the RT-SVR model can generate seamless hourly SV in real time once these data are made available and automatically processed spatiotemporally to a uniform grid resolution (6.25 km) to form an input predictor matrix.