LGHAP: a Long-term Gap-free High-resolution Air Pollutants concentration 1 dataset derived via tensor flow based multimodal data fusion 2

. Developing a big data analytics framework for generating a Long-term Gap-free High-18 resolution Air Pollutants concentration dataset (abbreviated as LGHAP) is of great significance for 19 environmental management and earth system science analysis. By synergistically integrating 20 multimodal aerosol data acquired from diverse sources via a tensor flow based data fusion method, a 21 gap-free aerosol optical depth (AOD) dataset with daily 1-km resolution covering the period of 2000– 22 2020 in China was generated. Specifically, data gaps in daily AOD imageries from MODIS aboard 23 Terra were reconstructed based on a set of AOD data tensors acquired from diverse satellites, 24 numerical analysis, and in situ air quality measurements via integrative efforts of spatial pattern 25 recognition for high dimensional gridded image analysis and knowledge transfer in statistical data 26 mining. To our knowledge, this is the first long-term gap-free high resolution AOD dataset in China, 27 from which spatially contiguous PM 2.5 and PM 10 concentrations were then estimated using an 28 ensemble learning approach. Ground validation results indicate that the LGHAP AOD data are in a 29 good agreement with in situ AOD observations from AERONET, with R of 0.91 and RMSE equaling 30 to 0.21. Meanwhile, PM 2.5 and PM 10 estimations also agreed well with ground measurements, with R 31 of 0.95 and 0.94 and RMSE of 12.03 and 19.56 μg m -3 , respectively. The LGHAP provides a suite of 32 long-term gap free gridded maps with high-resolution to better examine aerosol changes in China over 33 the past two decades, from which three major variation periods of haze pollution were revealed in 34 China. Additionally, the proportion of population exposed to unhealthy PM 2.5 was increased from 35 50.60% in 2000 to 63.81% in 2014 across China, which was then reduced drastically to 34.03% in 36 2020. Overall, the generated LGHAP dataset has a great potential to trigger multidisciplinary


Introduction
Atmospheric aerosols not only impact regional climate by changing the Earth radiation budget but significantly influence air quality at the ground level (Fuzzi et al., 2015;Gao et al., 2018;Shen et al., 2020;Sun et al., 2015;Yang et al., 2020;Zheng et al., 2020).Monitoring aerosol loading in the atmosphere is thus of great significance for climate change attribution and haze pollution assessment.
Aerosol optical depth (AOD), an indicator of aerosol bulks distributed within a column of air from the Earth's surface to the top of the atmosphere, has been monitored for decades to map global aerosol loading in the atmosphere.Compared with sparsely and unevenly distributed ground-based aerosol monitoring stations (e.g., AERONET), satellite instruments can map AOD with vaster spatial coverage at even sub-hourly sampling frequency (e.g., geostationary satellite).An overview of sensors, algorithms, and AOD datasets that are widely used in the community can be found in the literature such as Sogacheva et al. (2020) and Wei et al. (2020).
Due to negative impacts of bright surface (e.g., snow cover) and clouds, as well as algorithmic restrictions, satellite AOD retrievals often suffer from extensive data gaps, significantly reducing the downstream application potential such as mapping particulate matter (PM) concentrations at the ground surface (e.g., Bai et al., 2019a;Wei et al., 2021a).Also, excessive data gaps in AOD imageries may result in large uncertainty when assessing aerosol impacts on weather and climate (Guo et al., 2017;Li et al., 2019;Zhao et al., 2020;Zheng et al., 2018).Over the years, versatile gap filling methods have been developed (e.g., Bai et al., 2016Bai et al., , 2020b;;Chang et al., 2015).Nonetheless, filling data gaps in satellite-based AOD retrievals is still challenging due to extraordinary nonrandom missing values and high aerosol dynamics in space and time.Wei et al. (2020) provided a short review of methods that have been frequently applied to deal with data gaps in AOD products.In general, merging AOD data acquired from diverse instruments and/or platforms is the most popular approach to improve AOD spatial coverage (Sogacheva et al., 2020).Statistical methods such as linear regression (Bai et al., 2019a;Wang et al., 2019;Zhang et al., 2017), inversed variance weighting (Chen et al., 2018;Ma et al., 2016;Sogacheva et al., 2020), and maximum likelihood estimate (Xu et al., 2015), are often applied to account for systematic bias among different datasets.Data fusion methods such as Bayesian maximum entropy could be applied to blend AOD products with different resolutions (Tang et al., 2016;Wei et al., 2021b).Another way is to reconstruct missing AOD values using either neighboring observations in space and time or external data sources such as AOD simulations from numerical models (Li et al., 2020;Xiao et al., 2017), even meteorological factors (Bi et al., 2018).
Although there exist a variety of gap filling methods, spatially gap free AOD datasets are still rare, particularly high-resolution AOD datasets from satellites, significantly limiting downstream applications such as PMx concentration mapping.In spite of versatile PM2.5 concentration prediction models (e.g., Di et al., 2019;Fang et al., 2016;Hu et al., 2014;Li et al., 2016;Lin et al., 2016;Liu et al., 2009;Wang et al., 2021a), to date, there are few publicly accessible PMx concentration datasets that can be used to examine haze pollution variations regionally and globally.Several typical datasets, e.g., the one generated by the Dalhousie University (van Donkelaar et al., 2010(van Donkelaar et al., , 2016)), CHAP (Wei et al., 2021a), andTAP (Geng et al., 2021), have been widely applied to advance our understanding on aerosol impacts across China and globe.However, these datasets more or less still suffer from drawbacks in spatial and/or temporal resolution, spatial coverage, and data accuracy.To meet the contemporary needs, Zhang et al. ( 2021) provided a more comprehensive review of the widely used PMx concentration mapping approaches.With a thorough review for PM2.5 concentration mapping techniques, an optimal full-coverage PM2.5 modeling scheme was proposed, in which diverse aerosol datasets were fused toward a full-coverage AOD map based on a multi-modal approach (Bai et al., 2022).In parallel with these efforts, some attempted to improve AOD data coverage over space with high accuracy by merging AODs observed at adjacent times directly (Li et al., 2022).
With such prior knowledge, the current study developed a big data analytics framework for generating a Long-term Gap-free High-resolution Air Pollutants concentration dataset (abbreviated as LGHAP hereafter), aiming at providing gap-free AOD, PM2.5 and PM10 concentration data with a daily 1-km resolution in China for the period of 2000 to 2020.Toward such a goal, multimodal aerosol data acquired from diverse sources including satellites, ground stations and numerical models were synergistically integrated via the higher order singular value decomposition (HOSVD) to form a tensor flow based data fusion framework in the current study.Full coverage PM2.5 and PM10 concentration data were then estimated on the basis of the gap-filled AOD dataset.This 21-year-long gap-free high resolution (daily/1km) aerosol dataset was then compared against ground-based AOD and PMx observations to validate the data accuracy of each product, particularly their performance in spatial pattern recognition and temporal trend assessment.These advances endorsed a better assessment of long-term variability of haze pollution in China as well as the corresponding population exposure over the past two decades.

Data sources
Table 1 provides a brief summary of the multisource datasets used in this study to generate the LGHAP dataset.As shown, six satellite-based AOD products, five numerical simulations of AOD and aerosol components, eleven meteorological factors, six datasets of ground-based AOD and air pollutants concentration measurements, as well as a set of land cover, topographic and socioeconomic parameters, were employed.Descriptions of these datasets are given in the following subsections.

Gridded aerosol products
In many previous studies, coarse AOD and/or aerosol components simulations acquired from numerical models were oftentimes used as the primary data source to help derive full-coverage AOD and/or PM2.5 concentration maps (e.g., Park et al., 2020;Wang et al., 2021b).However, due to the lack of high accuracy near real-time emission inventory, simulated AOD and/or aerosol components are often prone to large uncertainty, which could be inevitably introduced to the final PM2.5 estimations if no observational data are applied for possible bias correction.In such a research context, here we used six satellite-based AOD products with a relatively long temporal coverage (>5 years) to help better reconstruct historical AOD variations over space and time, though geostationary satellites can provide AOD observations at even hourly resolution.The reasons are twofold.On the one hand, the operational AOD product from the recent Chinese FY-4 satellite is still unavailable.On the other hand, AOD product from Hamawari-8 cannot provide observations in the northwest region of China.
The latest AOD product derived from the MODerate-resolution Imaging Spectroradiometer (MODIS) onboard Terra using the multiangle implementation of atmospheric correction (MAIAC) algorithm (Lyapustin et al., 2011(Lyapustin et al., , 2018)), was hereby used as the baseline dataset for the generation of gap free AOD maps.This AOD product has not only a finer spatial resolution (1 km) but a comparable and even better accuracy, when comparing with those derived from the Dark Target and Deep Blue algorithms (Goldberg et al., 2019;Lyapustin et al., 2018) In addition to satellite-based AOD products, numerically simulated aerosol diagnostics from MERRA-2, including AOD and aerosol components such as black carbon, organic carbon, dust and sulfate, were also applied to help reconstruct missing AOD information and to predict PM2.5 and PM10 concentrations at the ground level.The aerosol components were used here as a proxy of emission inventory when predicting PMx concentrations.Big data analytics procedures applied to these datasets will be described in section 3.

In situ AOD and air quality measurements
AOD observations from Aerosol Robotic Network (AERONET) were hereby used as the ground truth to evaluate the data accuracy of the generated gap free AOD product, as well as the learning target to infer AOD from air pollutants concentration and atmospheric visibility.Considering few valid data were provided in the Level 2.0 dataset, here we used the Level 1.5 AOD data to guarantee adequate in situ AOD data coverage in space and time.To validate the gridded AOD products in this study, each in situ AOD observation was registered with the gridded mean AOD over a 50×50 km window.
Near-surface air pollutants concentrations including PM2.5, PM10, NO2, and SO2 that were sampled at state-controlled monitoring sites were also applied, not only to help establish machinelearned regression models for PMx prediction (PM2.5 and PM10), but to infer AOD over air quality monitoring sites given their dense distributions across China.The gauged air pollutants concentration data have been released online on an hourly basis by the China National Environment Monitoring Center since the late 2013.For quality control, outliers were first detected and removed from each pollutant dataset by following the criteria used in our previous study (Bai et al., 2020a).The missing values were then reconstructed using the diurnal cycle constrained empirical orthogonal function (DCCEOF) method proposed in Bai et al. (2020b).
The 3-hour resolution atmospheric visibility data acquired from 4,052 weather stations were employed to help generate gap free AOD maps before 2014, at which in situ air quality measurements were not available.Previous studies have attempted to predict PM2.5 concentration from atmospheric visibility data with good accuracies (Liu et al., 2017), indicative of a great potential for estimating AOD.Specifically, visibility data were used as an important predictor for site-specific AOD prediction, and the resulting AOD predictions were then used as a critical prior information for reconstructing AOD distributions over space, especially over those regions without satellite AOD observations.Given the availability of abundant air quality measurements and the fact that automatic visibility sensors have been widely used across China since 2014, atmospheric visibility data after 2014 were thereby excluded to guarantee the data consistency (Li et al., 2018a).For quality control, the consistency of visibility data was examined using an outlier detection method, i.e., the annual mean should not exceed 3 times the standard deviation of data over a 5-year time window (Zhang et al., 2020).Those with apparent jumps and drifts in visibility time series were excluded.Meanwhile, visibility data on rainstorm and foggy days were eliminated as well.

Auxiliary data
As shown in Table 1, eleven meteorological factors, including air temperature at the near surface, wind speed and direction, relative humidity, surface pressure, boundary layer height, total column water vapor, downwards solar radiation, and instantaneous moisture flux, were used to help resolve nonlinear relationships between PMx and AOD, as well as to downscale AOD from MERRA-2.These data were acquired from the fifth generation ECMWF atmospheric reanalysis (ERA-5), and the first three factors were extracted at the levels of not only ground surface but 850 hpa and 500 hpa so as to indicate the vertical structure of the atmosphere.Additionally, population data from WorldPop, land cover from CLCD during 2000 to 2019 (Yang and Huang, 2021) and GLOBELAND 30 in 2020 (Chen et al., 2014), elevation data from the Global Digital Elevation Model (GDEM) version 2, as well as monthly composited 1-km normalized difference vegetation index (NDVI) from MODIS, were employed to resolve the socioeconomic and ecological contributions to haze pollutions.Properties of these datasets can be found in Table 1, and datasets with a finer resolution were upscaled to 0.01° via a cubic interpolation method.

Methodology
Toward the generation of LGHAP aerosol datasets to advance environment management and earth system science analysis, here we developed a big data analytics framework via a seamless integration of the tensor flow based multimodal data fusion with ensemble learning based PMx concentration estimation.The proposed method transformed a set of data tensors of AOD and other related datasets such as air pollutants concentration and atmospheric visibility that were acquired from diversified sensors or platforms via integrative efforts of spatial pattern recognition for high dimensional gridded data analysis toward data fusion and multiresolution image analysis, as well as knowledge transfer in statistical data mining.The proposed method consists of three major procedures in general, including multisensory data homogenization, tensor flow based AOD reconstruction, and ensemble learning for PMx concentration estimation.The analytical framework of the big data analytics is depicted in Figure 1 and described in details in the following subsections.

Multisensory data homogenization
Since a set of aerosol products with different types, resolution, and accuracies were applied to support the reconstruction of gap-free AOD imageries, harmonizing cross-platform biases and scale differences between these diversified datasets is crucial to multisensory data integration.In this study, machine-learned regression models were established to harmonize these heterogeneous aerosol datasets.A baseline dataset was first selected to be used as the learning target while other datasets were calibrated to the level of baseline dataset to make them comparable.Given finer resolution and higher proportion of data coverage in space and time, the MAIAC AOD product from Terra (AODTerra) was selected as the baseline dataset.Consequently, six machine-learned regression models were established between AODTerra and each gridded AOD product (i.e., five satellite-based AOD products plus MERRA-2 AOD simulations) using the random forest method.Meteorological factors (MET), land cover types (LULC), topographic (DEM) and population (POP) were used as covariates to help downscale these multimodal AOD products to have a resolution same as AODTerra while accounting for cross-mission biases arising from temporal and algorithmic differences.
Considering data gaps are extensive in satellite AOD products, especially over regions with vast cloud cover, providing prior AOD information over such region is thus of great value in support of the reconstruction of missing AOD values.As indicated in our recent studies, AOD can be accurately predicted from ground measured air pollutants concentration, showing an accuracy even over some satellite AOD retrievals (Li et al., 2021;Bai et al., 2021).To support AOD reconstruction over regions with less or even without valid satellite AOD observations, we attempted to infer AOD over air quality monitoring sites from in situ air pollutants concentration measurements via a machine learning approach.Similarly, machine-learned regression models were established using random forest by taking AODTerra as the learning target while ground measured air pollutants concentration, meteorological factors, land cover, and terrain information, were used conjunctively as predictors.
The transformation of ground measured air pollutants concentration data to AOD allows for providing external observational AOD data to supplement satellite observations, especially over regions suffering from significant data gaps.Since air pollutants concentration data were not available before 2013, atmospheric visibility data sampled at dense weather stations were hereby used as an alternative for site-based AOD prediction, by applying a similar prediction model as used above for air pollutants concentration.Figure S1 show the ground-based validation results of AOD inferred from atmospheric visibility and air pollutants concentration, indicative of a generally good accuracy of these inferred AOD values.All efforts led to aggregate a set of multimodal aerosol data with different properties for multisensory data fusion toward gap free AOD mapping as the next step.

Tensor flow based AOD reconstruction
The core of generating full coverage AOD imageries is to fill in data gaps in AODTerra.Previous studies have demonstrated that merging satellite AOD retrievals at adjacent time steps can help improve the observational AOD coverage at each single snapshot, while the involvement of numerical AOD simulations can help bridge AOD data gaps (Li et al., 2022;Bai et al., 2022).In this study, a tensor completion method was particularly designed and applied to fulfil the gap filling in AODTerra.
Specifically, the incomplete AODTerra imageries were deemed as the hard data (true AOD state) while other AOD datasets (e.g., the downscaled AOD datasets and site-specific AOD predictions inferred from air pollutants concentration and atmospheric visibility) were used as the soft data (complementary data) to help reconstruct AOD distribution in AODTerra via tensor flow based pattern recognition.
Detailed procedures for gap filling are outlined as follows.

Initial AOD tensor construction
Due to extensive data gaps in satellite-based AOD retrievals, it is insufficient to reconstruct all missing AOD information in AODTerra for a given date by simply merging the harmonized satellitebased AOD data synchronously.To fulfill AOD gap filling, the tensor completion method was thus applied to synergistically integrate AOD acquired from diverse sources.Consequently, creating the data tensor of AOD is of critical importance.In this study, the data tensor of AOD was constructed by incorporating not only observational AOD from both satellites and those inferred from in situ air quality indicators on the same date, but also historical AOD retrievals from MODIS instruments (AODTerra and AODAqua) and part of data from the downscaled MERRA-2 AOD (denoted as AODM2 hereafter).The latter two were applied to provide knowledge of AOD distributions over space to guide the reconstruction of missing values in AODTerra.
For the screening of historical observations resembling AODTerra distribution on the given date to be reconstructed, AODM2 was used in concert with AODTerra and site-based AOD estimations to identify similar imageries.Toward this goal, site-specific AOD estimations and 5% randomly selected downscaled AODM2 data were merged directly with valid AODTerra to form a new image on each date.
Subsequently, correlations and biases were estimated between AODTerra on the given date to be reconstructed and each newly merged historical AODTerra image.To avoid the inclusion of imageries with distinct variation patterns, only those closely resembling AODTerra on the date to be reconstructed were finally retained in terms of their correlations and biases subject to a threshold of R>0.7 and RMSE<0.2.Once sufficient historical imageries were obtained, the data tensor of AOD was constructed by compiling the observed AOD imageries on the given date with historical imageries to a three-dimension data array  ∈  !!×! " ×! # (composed of  # images with a size of  $ ×  % ).
Considering satellite AOD retrievals suffer from extensive data gaps, we injected data values of sitespecific AOD estimations and 1% randomly selected downscaled AODM2 data directly onto grids where AODTerra values missed on each specific date as prior knowledge.This not only accelerates convergence speed during the reconstruction process but avoids large reconstruction errors over regions with tremendous data gaps in satellite observed AOD imageries.

Gap filling via tensor completion
Previous studies have well demonstrated the good performance of matrix decomposition methods such as empirical orthogonal function and singular value decomposition (SVD) for missing value imputation (Bai et al., 2020b;Beckers and Rixen, 2003;Folch-Fortuny et al., 2015).However, these methods can only work on two-dimension matrix mathematically, namely the matrix domain.To integrate spatial features of AOD revealed by datasets to generate a smooth AOD distribution with complete coverage, in this study, the HOSVD, a specific orthogonal Tucker decomposition, was applied.More detailed descriptions to HOSVD can be found in the literature such as Sun et al. (2021), Tucker (1966), Kolda and Bader (2009), and Sidiropoulos et al. (2017).
In Table 2, we provided a stepwise description of the algorithm used to fill data gaps in AODTerra by integrating AOD features recognized in different imageries as the data tensor of AOD via HOSVD.
To initiate the tensor decomposition, grids with missing values in the original AOD tensor were first filled with the spatial average of valid AOD data in each individual image.Then, the AOD tensor was decomposed along each of three dimensions, while the dominant features in each dimension determined by the corresponding rank values were applied to reconstruct the data tensor.By gradually increasing the rank values and iteratively updating the initial filled values, the tensor can be reconstructed to better delineate AOD distribution over space after several iterations.
To confirm the convergence, a small portion of observational AOD values were randomly held out in advance, and the reconstructed values over these grids in each iteration were compared with these hold-out data till the difference between them lower than 0.01 (a threshold to determine convergence, a.k.a,  $ in Table 2).Meanwhile, to make the computational burden manageable, the study region (China in this study) was divided into 40 subregions (refer to Figure S2 for the spatial distribution of these subregions), and the tensor completion was then performed over each individual region.Finally, the reconstructed imageries were mosaiced to attain a national gap-free AOD map on each specific date.During this step, a smooth filter was applied to solve the boundary effect when mosaicking two adjacent maps.Specifically, data value on each overlapped grid at the boundary (50 km on the edge of subregion) was averaged via an inverse distance (the distance to the edge) weighting scheme.In the end, the mosaic AODTerra image was retained as the final gap-free AOD product.
Table 2.The proposed tensor completion algorithm for AOD distribution reconstruction in AODTerra.

PMx concentration estimation
In this study, the widely used random forest method was applied to establish regression models for PM2.5 and PM10 concentration estimation.Ground measured PM2.5 (or PM10) concentration data were used as the learning target while gap filled AOD, aerosol components (AERcomp), meteorological factors (MET), digital elevation model (DEM), NDVI, land cover information (LC), and population were used as regressors.The random forest regression model can be generally formulated as: where month is a categorical variable that was used to account for monthly varying relationships between AOD and PMx.For validation, PM2.5 and PM10 measurements from 10% of monitoring sites were randomly held out to evaluate the predictive performance of each regression model.During the training process, 500 regression trees were used in each RF model, and each tree was grown on a bootstrap sample.The learning data set was randomly divided into two parts during the training process, with 80% used as the training set while the rest 20% for testing.In order to guarantee a larger value of PM10 than PM2.5, PM2.5 estimations from Eq. ( 1) were used as one predictor in addition to factors used to predict PM2.5 when estimating PM10 concentration.Such a model can also significantly improve the prediction accuracy of PM10 given the prior PM2.5 information.

Point-surface data fusion
Ground measured PM2.5 and PM10 concentration data were further fused with their gridded estimations to enhance the data accuracy of PMx data after 2014.Here, the well-known optimal interpolation (OI) method was applied to perform point-surface fusions between two different types datasets.Please refer to Bai et al. (2022) and Li et al. (2022) for a more detailed description of the OI method used to fuse PMx concentration data.In this study, a modified scheme was developed to select neighboring observations.To avoid an isotropic interpolation effect, here we only used 30 ground observations with land cover, terrain and atmospheric conditions similar to those at the analyzed grid cell to estimate the innovation that should be assigned to the background value at the given grid.In other words, a similarity measure was first estimated between the analyzed grid cell and neighboring sites in terms of land cover, DEM, and atmospheric conditions.The 30 observations with similar background fields were then used in the OI procedure to correct possible bias in gridded PMx estimations.Such a treatment can help exclude those observations with different ambient background, e.g., one site not far from the given grid but separated by a high mountain, thereby avoiding the possible propagation of antiphase corrections to data over adjacent grids.Compared to the first seven gridded AOD datasets, the LGHAP AOD dataset has an accuracy slightly worse than the original MODIS AOD product but comparable to AODs from MISR, POLDER and MERRA-2, with R of 0.91 and RMSE equaling to 0.21 compared to ground-based AOD observations.Nevertheless, the gap-filled AOD appeared to overestimate ground-based AOD observations, and this could be due to the involvement of AODs from VIIRS and POLDER as these two products significantly overestimated ground AOD observations, which can be indicated by the proportion of data pairs above the expected error (EE).On the other hand, significant underestimations in AODM2 were not introduced to the LGHAP AOD as the former had a below EE ratio of 32.97% which was only12.27% in the latter.These results indicate that the LGHAP AOD data are more likely to resemble AOD distributions revealed by satellite observations rather than AODM2, endorsing the advantages of involving multisensory satellite AOD observations to support missing AOD reconstruction.Figure 2   In Figure 3 we presented a comparison of AOD time series between the LGHAP dataset and ground observations at three AERONET sites under different air pollution levels.As shown, the AOD time series from LGHAP are temporally continuous whereas data gaps are common in AERONET observations.Generally, AODs from LGHAP are well reconstructed with respect to the temporal variations of aerosol loading at these three sites, with R ranging from 0.77 to 0.90 and RMSE varying between 0.11 and 0.21.For illustration, Figure 4 compares the spatial distribution of original and gap filled AOD on four days with different AODTerra coverage over space.As shown, the missing AOD values were well reconstructed after gap filling, resembling a smooth and reasonable AOD distribution over space, even over regions with very limited prior AOD observations from Terra/MODIS (e.g., Figure 4d).As indicated in Figures 4a and 4c, the high AOD loading was also properly reconstructed even though no prior information was provided by AODTerra.Since AERONET AOD observations

Data accuracy of gap-free AOD in LGHAP
were not used as a data input when generating the LGHAP AOD dataset, these independent validation results clearly demonstrated the high accuracy of the LGHAP AOD product as well as a good performance of the proposed AOD gap filling approach.Since the final gap-free AOD product was generated mainly by integrating a set of data tensor of gridded AOD with AOD estimations from in situ air quality measurements, the relative contribution of each product to the final gap-free dataset is worth being investigated.In this study, a data coverage ratio weighted nonlinear correlation coefficient was proposed to examine the relative contribution of each gridded product to the LGHAP AOD dataset.The nonlinear correlation coefficient was used to assess the mutual information between two variables (Sun et al., 2021;Wang et al., 2005), while the data coverage ratio was multiplied to indicate the overall contribution of one product to the final fused dataset (refer to Text S2 for the definition of this indicator).As shown in Figure 5, the relative contribution of each gridded product varied with time and the input data sources.In the early two years (2000)(2001), the AOD distribution in gap-free imageries was determined largely by AODTerra (81%), whereas this ratio decreased to about 30% when many other products were involved, especially AOD from Aqua and PARASOL.With the advent of VIIRS and the loss of PARASOL after 2012, the relative contribution changed drastically as AOD from MODIS and VIIRS played the dominant roles in reconstructing AOD distribution.Note the relative contribution of AODM2 remained lower than 10%, indicative of the greater importance of satellite observations in generating the LGHAP AOD product.
With respect to the temporally averaged contribution in each subregion, it shows that the relative contribution of each product also varied significantly across regions.Generally, AOD from MODIS aboard Terra and Aqua played the most important role (>60%) in generating the LGHAP AOD product, except over the southwest part of the country (Tibet plateau) where AODM2 contributed most.This is largely associated with the fact that data gaps are abnormally high in satellite observations over this region because of the vast and long-lasting snow cover (refer to Figure S3 for the data integrity distribution).Consequently, AODM2 would play an important role in reconstructing AOD distribution over such regions.Note that the relative contribution of AOD estimations from in situ air quality measurements were not accounted for in the current analysis because of incomparable spatial coverage of in situ data contrast to gridded AOD products, and this does not imply the contribution of in situ AOD estimations being negligible.Overall, the results shown here clearly highlight the success of big data analytics in generating the LGHAP AOD dataset via integrative efforts from diversified data sources.for the arithmetic theory to calculate this measure).The annual mean shown outside is the national averaged contribution in each individual year while the regional mean shown on the map was averaged over the past 21-year in each subregion.

Data accuracy of PM2.5 and PM10 estimations
By taking advantage of the gap-filled AOD, daily 1-km resolution PM2.5 and PM10 concentration data in China were then estimated via an ensemble learning approach.Figure S4 shows the samplebased cross validation accuracy of two prediction models.It shows that the original daily PM2.5 prediction model had a sample-based cross validation R 2 of 0.79 and RMSE of 20.04 μg m -3 .This accuracy is comparable to our previous study (Bai et al., 2019a), but slightly worse than those reported in some recent studies (Table 4).In contrast, PM10 had a much higher prediction accuracy, with R 2 of 0.90 and RMSE of 21.06 μg m -3 for the daily product.This good performance should be attributed to the involvement of PM2.5 estimations as a predictor in the PM10 prediction model.Figure 6 shows the site-specific (held-out in advance) validation accuracy of daily, monthly, and annual mean PM2.5 and PM10 concentration in LGHAP.As shown, the site-specific validation results indicated that the final full-coverage (gap free) daily PM2.5 and PM10 concentration data are in a good agreement with groundbased measurements, with R of 0.95 and RMSE of 12.03 μg m -3 for PM2.5 while R of 0.94 and RMSE of 19.56 μg m -3 for PM10.Overall, PMx data in LGHAP are not only spatially complete with a finer resolution but have a comparable accuracy with previous studies.As shown, all these three datasets well reconstructed temporal variations of PM2.5 from 2019 to 2020.
Temporally, LGHAP and TAP are continuous while CHAP suffers from significant data gaps because no gap filling was applied when generating the dataset.Compared with the other two datasets, LGHAP PM2.5 data had a better agreement with ground-based PM2.5 measurements.This high accuracy could be partially due to the fusion of in situ PM2.5 data measured at adjacent sites via the OI method.In Figure 8 we compared the spatial distribution of PM2.5 that was reconstructed by different datasets.Compared to LGHAP and TAP, PM2.5 data from CHAP are not gap free since the spatial coverage is determined by the AOD data coverage in the MAIAC product.Compared to TAP, LGHAP PM2.5 data have a finer resolution (1 km versus 10 km), enabling us to examine PM2.5 variations in space with more details.Overall, LGHAP has a better performance in reconstructing PM2.5 spatial distributions than the other two datasets.Reasons could be attributed to the following two aspects.
Firstly, in situ PM2.5 measurements were fused with gridded PM2.5 estimations using the OI method when generating the final PM2.5 product in LGHAP.This can help correct modeling biases in original PM2.5 estimations.Secondly, a set of satellite-based AOD retrievals were incorporated when generating the full-coverage AOD product, which greatly helps reduce large biases in numerical AOD simulations, yielding more accurate PM2.5 estimations in turn.This also highlights the great advantages of using big data analytics methods to advance air pollution assessment.From the left to right, it shows in situ PM2.5 concentration measurements, CHAP, TAP, and LGHAP, respectively.
To illustrate the fine resolution of LGHAP dataset, we compared the annual mean PM10 concentration in 2019 with the proportion of impervious surface that was derived from 30-m resolution land cover data in eastern China.As shown in Figure 9, the finer resolution of LGHAP dataset enables us to easily recognize the "hot spot" regions with high PM10 loading.By referring to the impervious surface distribution on the right, we found that these hot spots are mainly over cities and towns, indicative of the presence of pollution island in urban regions.Owing to the involvement of such highresolution datasets, the spatial details of PM2.5 and PM10 can be then well recognized in LGHAP.The finer spatial resolution advantage of the LGHAP dataset can be also demonstrated by comparisons of spatial distribution of annual mean PM2.5 concentration that was revealed by four different datasets shown in Figure S6.China.During the second period, PM2.5 concentration over most regions shows a small decreasing trend except in the Ji-Lu-Yu region where an increasing trend was still observed.Apparent decreasing trend was observed over most parts of the country after 2014, indicative of significant reductions in PM2.5 loading across China.This trend distribution is in line with our previous finding that was derived using the annual mean PM2.5 concentration dataset generated by the Dalhousie University (Bai et al., 2019b).However, differences were still observed in terms of the regions where significant decreasing trends were present.Most significant decreasing trends were mainly observed in Sichuan basin and Pearl River Delta in the previous study.However, regions with drastic PM2.5 decrease were found mainly in the North China where severe haze pollution events were oftentimes reported.Similar variation patterns can be also inferred from PM10 (Figure S8) and AOD (Figure S9).Overall, the LGHAP dataset provides us a gridded perspective to better examine long-term variations of haze pollution in China during the past two decades.China.Nonetheless, PM2.5 sources in these two areas could be different.In northwest China, natural emissions could be the dominant source since very limited population resides there.In contrast, most population lives in eastern and central China with highly developed economy, and anthropogenic emissions thus might play more important roles in PM2.5 formation (Xin et al., 2015;Yang et al., 2011).
In regard to the proportion of population exposed to the ambient with PM2.5 concentration greater than 35 μg m -3 , we observed that the annual mean population ratio exposure to unhealthy PM2.5 increased gradually from 50.60% in 2000 to 65.72% in 2007.During 2007-2014, the ratio varied with small changes (<5%), whereas a drastic decline was observed after 2014, with the annual mean proportion of population exposed to unhealthy PM2.5 was reduced from 63.81% in 2014 to 34.03% in 2020, even though the total population was increased from 1.37 billion to 1.41 billion during the synchronous period.Nonetheless, more than one-third population was still exposed to unhealthy PM2.5, highlighting the requirement of further emission reduction actions to manage haze pollutions in China.

Data availability
The LGHAP dataset, consisting of gap free AOD, PM2.5, and PM10 concentration with daily 1km resolution from 2000 to 2020, are all publicly accessible.All data were provided in the NetCDF format and data in each individual year were archived in a zip file.For AOD, the dataset has a disk storage size of near 27 GB in total, which is avaiable at https://doi.org/10.5281/zenodo.5652257(Bai et al., 2021a).PM2.5 (38 GB) and PM10 (48 GB) concentration data can be acquired from https://doi.org/10.5281/zenodo.5652265(Bai et al., 2021b) and https://doi.org/10.5281/zenodo.5652263(Bai et al., 2021c), respectively.Additionally, monthly and annual mean datasets were also provided, which is publicly available at https://doi.org/10.5281/zenodo.5655797(Bai et al., 2021d) and https://doi.org/10.5281/zenodo.5655807(Bai et al., 2021e), respectively.In addition to these datasets, Python, Matlab, R, and IDL codes that can be used to read and visualize these data were provided as well.

Conclusion
In this study, a big data analytics method was developed for generating a LGHAP dataset to

Figure 1 .
Figure 1.Flowchart of the proposed big data analytics framework for generating a long-term gap-free high-resolution air pollutants concentration dataset (LGHAP), taking aerosol optical depth (AOD) and PMx (PM2.5 and PM10) concentration in China as illustration.HOSVD is an acronym of high order singular value decomposition.MET, LULC, DEM, and POP denote variables of meteorology, land use/land cover, digit elevation model, and population, respectively.
further compares the data accuracy of original AODTerra and the reconstructed data over different regions of China.It is indicative that the purely reconstructed data have an accuracy (R=0.88 and RMSE=0.26)lower than the original AODTerra (R=0.95 and RMSE=0.13)across China, especially in South China where the reconstructed data were significantly underestimated the groundbased AOD observations.Possible reasons for this effect could be attributed to extensive data gaps in satellite AOD retrievals due to frequent and extensive cloud covers over there (refer to FigureS3for the distribution of mean data integrity of AODTerra during 2000-2020), and the scarce AOD observations significantly limit the learning capacity in space and temporal domain during the tensor completion process.In other words, limited observations in satellite imageries greatly reduced the learning performance from the sparse tensor.Even though, the purely reconstructed data exhibit a bias level comparable to AOD retrievals from several satellite instruments, e.g., MISR, VIIRS, and POLDER.This demonstrates the good performance of the proposed tensor completion method in reconstructing missing AOD information.By combining the reconstructed data with original AODTerra, we obtained a 21-year-long gap free high-resolution (daily/1-km) AOD dataset with satisfying accuracy (R=0.91 and RMSE=0.21).

Figure 2 .
Figure 2. Scatter plots between ground observed and satellite-based AOD data in different regions of China.(a-e) original Terra/MODIS AOD, (f-j) reconstructed AOD, and (k-o) combined AOD between original and reconstructed data.BTH, YRD, SC, and WC refers to regions of Beijing-Tianjin-Hebei, Yangtze River Delta, South China, and West China, respectively.

Figure 3 .
Figure 3.Comparison of monthly AOD time series from LGHAP and AERONET at three different stations in China.Latitude and longitude information of each site was given in brackets.

Figure 4 .
Figure 4. Spatial patterns of the reconstructed AOD under different baseline AOD coverage ratios.In each sub-diagram, the upper panel presents the original AOD distribution from Terra/MODIS while

Figure 5 .
Figure 5. Spatiotemporal variations of the relative contribution of each gridded AOD product to the generation of LGHAP AOD dataset.The relative contribution was estimated as the data coverage ratio weighted nonlinear correlation coefficient (please refer to Text S2 in the supplementary information

Figure 6 .
Figure 6.Scatter plots between observed and estimated PM2.5 and PM10 concentration.(a-c) respectively denotes daily, monthly, and annual mean PM2.5 validation results, while (d-f) are for PM10 concentration.The ground measurements were acquired from 30 independent air quality monitoring sites that were randomly held-out before the model training.

Figure 7
Figure 7 presents a two-year-long comparison of PM2.5 concentration time series from LGHAP and two other open access datasets with PM2.5 measurements sampled at four United States Embassy in China.Since this ground-based dataset has been seldomly noticed and used, it can be applied as an independent dataset to fairly evaluate the accuracy of these three machine-learned PM2.5 estimations.

Figure
Figure 7.Comparison of PM2.5 concentration time series between LGHAP (red line) and two open datasets (blue: TAP, green: CHAP).Here, hourly PM2.5 concentrations measured by four United States Embassy in China from 2019 to 2020 (grey bar) were used as an independent PM2.5 dataset to validate these three daily products.CHAP and TAP are two open access datasets providing PM2.5 concentration that were created by Wei et al. (2021a) and Geng et al. (2021) respectively.

Figure 9 .
Figure 9.Comparison of annual mean PM10 concentration with the proportion of areas coved by impervious surface in eastern China.

Figure 11
Figure11shows the temporal variations of the proportion of land areas covered by PM2.5 concentration exceeding 35 μg m -3 (the national ambient air quality standard for 24-hour PM2.5 concentration given in GB 3095-2012).As shown in Figure11a, severe PM2.5 pollution occurred mainly during the wintertime in China, as more than one-third land areas (indicated by the blue lines) were exposed to unhealthy PM2.5 pollutants.Meanwhile, an apparent inflection was observed in 2007, after which the number of episode days decreased drastically at more than one-third land area covered by PM2.5 concentration exceeding 35 μg m -3 .According to the proportion of land area covered with

Figure
Figure11c-e presents the linear trend of PM2.5 concentration during these three specific periods, from which we observed that significant PM2.5 variations occurred mainly over eastern part of the country where resides two-thirds of the population.A near ubiquitous PM2.5 increasing trend was observed during 2000-2007, with significant increase (>1.0 μg m -3 a -1 ) mainly observed in eastern

Figure 12 .
Figure 12.Spatial distribution of population weighted PM2.5 concentration and the proportion of population exposed to PM2.5 concentration greater than 35 μg m -3 .Annual and daily LGHAP PM2.5 concentration data were used for the calculation of weighted PM2.5 and the proportion of population exposure, respectively.The diamond and red line indicate the annual mean and median population proportion, respectively.
advance research in earth system science and environment management.With integrative efforts of fusing AOD features extracted from a set of AOD data tensors and knowledge transfer in statistical data mining from diverse air quality indicators, a LGHAP aerosol dataset providing 21-year-long (2000-2020) gap-free AOD, PM2.5, and PM10 concentration data with daily 1-km resolution in China, was generated.Gap-filled AOD imageries were firstly generated by reconstructing AOD distribution in AODTerra via synergistically fusing AOD features recognized from diversified satellites and numerical models as well as in situ data through tensor completion.Compared to ground-based AOD measurements, the gap-filled AOD data exhibit a satisfying prediction accuracy and good performance in delineating AOD variations over space and time.To our knowledge, this is the first thrust of generating long-term high-resolution AOD dataset with gap free nature in China.PM2.5 and PM10 concentration data were then estimated using an ensemble learning approach by taking advantage of the generated gap-free AOD imageries.Ground validation results also indicate good accuracies of these two gridded products, showing a comparable bias level with many previous studies.Compared with other open access daily PM2.5 concentration datasets, the LGHAP PM2.5 dataset performs well due to the vantage of having gap free and fine resolution products.With this gap

Table 1 .
Summary of the data sources used in this study to generate gap free high resolution AOD and PMx concentration datasets.

Table 3
used as data input when reconstructing missing AOD information, thereby the gap-free AOD can be directly compared with in situ AOD measurements from AERONET.As indicated, all these AOD datasets are in a good agreement with in situ AOD measurements.Generally, summarizes the data accuracy of gap-free AOD dataset generated in this study.For comparison, the data accuracy of each original AOD dataset was also assessed.Since in situ AOD measurements were not

Table 3 .
Data accuracy of original and gap-free AOD datasets used and/or generated in this study.The expected error (EE) was defined as ±0.05+0.15×AODsite.

Table 4 .
Comparison of the data quality of PM2.5 from LGHAP with other related studies.