Introduction

Fine particulate matter (PM2.5) has drawn much attention in China in the past decades, mainly because of its adverse impacts on ambient visibility and public health, such as the increase in mortality and morbidity related to acute respiratory diseases and chronic cardiovascular ailments among the population1,2,3,4,5,6. The toxicity of PM2.5 could be attributed to its specific chemical compositions, among which some transition metals (e.g., Cu, Mn, and Fe) deserve more attention for their catalytical nature. Once inhaled into human lungs, they play significant roles in the production of reactive oxygen species (ROS) and the consequential oxidative stress is one probable cause for PM2.5-induced organ injuries7,8,9,10,11,12. Besides their adverse health impacts, airborne transition metals are also recognized for their involvement in the secondary formation of sulfate in the atmosphere13,14. As a result, it is of great scientific interest to investigate the spatial distributions as well as major sources of trace metals. Such knowledge will lay a foundation for formulating cost-effective emission control policies.

Despite the environmental importance of trace metals, there is still a lack of precise and comprehensive source apportionment techniques designated for trace metals in PM2.5. In the calculation of source contributions to trace metals, there exist two distinct groups of methods, namely receptor-based models and chemical transport models (CTMs). Chemical mass balance and positive matrix factorization (PMF) are the two most extensively used receptor-based models in resolving the source contributions to PM2.5 and its components15,16,17,18,19. These models depend on the detailed measurement data of particle chemical composition and offer relatively reliable source contribution estimations. However, due to the scarcity of observation data, the receptor-based methods are inadequate in providing spatially resolved source apportionment results.

In contrast, CTMs equipped with source contribution modules can effectively generate regional distributions of pollutant concentrations and determine the contributions from different source sectors20,21,22,23. This advantage comes with the demand of a large number of spatially and temporally resolved input data, including the meteorological conditions and emission inventories. Another point to note is that the accuracy of source apportionment results is greatly influenced by the rigorousness of emission inventories. While metal emission inventories in China have been reported in some literature24,25,26,27, they have not been carefully evaluated and are expected to have large uncertainties. For instance, emission inventories of several heavy metals in PM2.5 in Guangdong Province, China for 2015 are estimated in two parallel studies by Liu et al. (2018)25 and Liu et al. (2021)27, as shown in Supplementary Fig. 1. However, the overall emission amounts of Mn, Cu, and Pb differ significantly by as much as a factor of three. Recently, hybrid models have been developed to overcome the disadvantages of these two types of methods28,29,30,31,32,33 (Supplementary Note 1). In hybrid models, observation data are used to reconcile CTM outputs or even infused into CTM results for new information. In Hu et al.28 and Ivey et al.29,30, CTM source apportionment results of PM2.5 species were adjusted to better match the observation data by introducing the adjustment factors, whereas Ying et al.32 used CTM source contributions to primary PM2.5 (PPM2.5) and measurement data of trace metals to deduce the nominal source profiles and consequentially the source apportionment information of those species. These works shed light on the possibility of producing more accurate regional source apportionment results.

This study aims to establish a protocol for regional source apportionment of trace metals using observation-constrained source-specific mass fractions. Here we consider a case in which observation data from multiple sites with different pollution characteristics are available and incorporated into model development. A comprehensive comparison of our method with those previously reported hybrid models28,29,30,31,32 is described in Supplementary Note 1, with details of the new model described in Section M3. We use the observation data collected for 10 monitoring sites in the Pearl River Delta (PRD) region, China during 2015 (see Table 1 and Supplementary Fig. 2 for information of sampling locations), together with the source apportionment results of PPM2.5 from a source-oriented Community Multiscale Air Quality Modeling System (CMAQ) in this region (Supplementary Fig. 3). The PRD region, home to several mega cosmopolises such as Hong Kong, Shenzhen, Guangzhou, and Macao, is one of the most economically advanced and densely populated areas on the southern coastal line in China34, where an extensive air quality monitoring network has been installed35. With the ongoing establishment of more monitoring stations across China36,37, the hybrid model proposed in this study can be readily applied to other areas to generate region-specific source profiles as well as source contributions for primary species in PM2.5.

Table 1 Information of 10 sampling sites in the Peral River Delta Region in this study.

Results and discussion

Prediction accuracy of the hybrid model

To first demonstrate the improved prediction accuracy of the new hybrid model, Fig. 1 compares the model predictions against ambient measurements using three existing methods as well as the new hybrid model. Method A uses original PPM2.5 source contributions from CMAQ together with source profiles mapped from US EPA SPECIATE database38; method B utilizes original PPM2.5 contributions with profiles mapped from Liu et al. (2018)25. Supplementary Note 2 describes the synthesis of such source profiles in both methods A and B. Meanwhile, results from using the hybrid model in Ying et al.32 and the new model established in this study are denoted as methods C and D, respectively. Note that 10-fold cross validation, instead of the all-inclusive modelling, is used in methods C and D for model evaluation (Section M3).

Fig. 1: Model performance of four methods.
figure 1

a Seasonal averages of modelled concentrations of trace metals, POC, and EC vs. ambient measurements using four methods. (b) Mean fractional error (MFE) and mean fractional bias (MFB) for predicted daily concentrations from the four methods. Dash lines in panel (a) are one-to-one-ratio lines. Note that many dots from methods C and D are overlapped because of their close numbers.

Panel (a) in Fig. 1 gives the scatter plots of seasonally averaged predictions vs. measurements of each primary species. Data points from methods A and B tend to deviate notably from the one-to-one-ratio lines in all seasons, whereas data of methods C and D align well along the dash lines. For instance, the measured concentration of Cu is 21 ng/m3 in winter, and the predictions are 29 ng/m3 from method A, 6 ng/m3 from method B, 21 ng/m3 from method C, and 20 ng/m3 from method D (i.e., our new hybrid model). All seasonal measurements and predictions can be found in Supplementary Table 1.

Figure 1b compares the model errors (mean fractional error, MFE) and model bias (mean fractional bias, MFB). Both methods A and B give conspicuously worse predictions than the others. Between C and D, even though their model errors are very close to each other, results from the new hybrid model generally yield smaller model biases. The improved model performance of method D is a result of benefits from the revised regression model. To sum up, out of the 20 primary species analysed in this section, method D predicts 13 species with smaller model errors than method C, 16 species with less biases, and 14 species with higher correlation coefficients with the measurements. Supplementary Table 2 provides thorough information on the model evaluation metrics. Above results demonstrate the improved prediction accuracy of the new observation-constrained hybrid model.

Observation-constrained source profiles and validation

With the refined source contributions and observation data of primary species, we deduce their mass fractions in emitted PPM2.5. Averages and relative standard deviations (RSDs) from the Monte Carlo simulation are shown in Table 2. For validation, the deduced source profiles are first compared to profiles from four published literature sources, i.e., Liu et al. (2018)25, Ying et al.32, US EPA source profiles38, and Liu et al. (2017)39 in Supplementary Tables 3 to 9. Note that the non-PRD source sector is excluded in comparison since it is a geographically lumped source.

Table 2 Observation-constrained mass fractions of primary species in PPM2.5 emitted from different sources from the new hybrid model: averages (in percentage, %) and relative standard deviations (RSDs) from Monte Carlo simulations.

For the area source sector, the deduced mass fractions of POC and EC are 40.1% and 12.1%, respectively. Both are within the selected area source profiles from the US EPA (8.41–49.0 % for POC and 1.52–77.1% for EC) in Supplementary Table 3. Nor do the heavy metals’ mass fractions show much peculiarity compared to the US EPA profiles38 or those presented in Liu et al. (2018)25 (Supplementary Table 4). In terms of the deduced mobile source profile in Supplementary Table 5, both POC mass fraction (49.5%) and EC fraction (42.6%) are comparable to numbers in the light-duty diesel vehicle (LDDV) exhaust source profile in the US EPA database, while Fe (2.22%) is highly compatible to the mobile source in Ying et al.32 (1.99% for Fe), and the vehicular exhaust source in Liu et al. (2017)39 (1.89% for Fe). Comparisons of other source sectors (Supplementary Table 6 to Supplementary Table 9) are not described one-by-one here to avoid verboseness. However, we would like to comment on the notable differences between the deduced sea salt source profile and US EPA sea salt data38 in Supplementary Table 9. The discrepancies in Cl and Na mass fractions mainly come from the fact that in this study, the mass fractions are derived from filter-based daily observation data, which are affected by chlorine depletion40,41. In contrast, the US EPA sea salt data38 are compiled for fresh sea salt particles, hence cannot be directly compared with our results.

With each species, the relative uncertainties of estimated mass fractions are generally acceptable for their major sources but rather high for less significant sources. For example, results in the next section show that the major contributors to ambient Fe are area sources, power generation, and mobile sources, and the relative uncertainties of their mass fractions are all between 5% and 10%. In comparison, Mn has area sources, non-PRD emission, and power generation as its principal sources, whose mass fractions of Mn have relative uncertainties between 7% and 27%.

Additionally, the derived source profiles show satisfactory mass closures for power generation (102%), mobile (118%), marine (119%), and area (85%) sources (Supplementary Table 10). Only industry source has a rather poor mass closure. This is noteworthy considering that no additional constraints are imposed in deriving these source profiles as the mass fractions are calculated independently for every single species. Given this study’s emphasis on the source apportionment of trace metals, it is less important to guarantee mass closure in source profiles.

Estimated source contributions to trace metals and other primary species at monitoring sites

Seasonal averages of source contributions to EC, Fe, Cu, and Mn in 6 site groups are illustrated in Fig. 2. EC is included in the discussion for model verification, since it is a major primary species in PM2.5, whose measurement data are relatively reliable with well explored emission sources. The three metals are well known for their capability to induce the generation of ROS inside lung fluid and hence of epidemiological importance. Supplementary Fig. 5a to Supplementary Fig. 5m show the detailed source apportionment results of 13 primary species at the monitoring sites; each figure includes the comparisons between hybrid model predicted concentrations vs. observation data on a daily basis, seasonal averages of source contributions, and model evaluation metrics. The number of valid observations for each species varies because outliers and data below minimal detectable limits (MDLs) are excluded (Supplementary Fig. 4). Our source apportionment results and comparison with previous source apportionment studies in the same region42,43,44,45,46,47,48,49,50,51,52,53 are described below.

Fig. 2: Seasonal averages of source contributions to 4 primary species from the hybrid model and their observed concentrations in 6 site groups.
figure 2

(a) EC, (b) Fe, (c) Cu, and (d) Mn.

POC and EC

Although POC and EC are not trace elements, they provide an additional evaluation of the method used in our study. For POC, the overall MFE is 0.34 (0.30–0.44 for different site groups). Mobile (32%) and area (31%) sectors are the two major local sources, while the non-PRD contribution is approximately 12% (Supplementary Fig. 5a). For EC (overall MFE = 0.38), predictions at sites in Hong Kong are satisfactory. In comparison, predictions at non-Hong Kong sites, i.e., MC (MFE = 0.55) and NS (MFE = 0.69), are notably more deviated from the observations (Fig. 2a and Supplementary Fig. 5b). This is probably because PPM2.5 apportionment results at non-Hong Kong sites are less optimal, and the mass fractions of EC might be location-dependent among these monitoring stations. For instance, the two Macao sites lie in the vicinity of a heliport (~80 m) and are close to the Macao International Airport (~1 km), both of which are important EC sources. Averaged across all monitoring sites, mobile (43%) and marine (27%) emissions are the top two local sources of EC, while the contribution from the non-PRD region is 15%. Since both POC and EC are the major components in PPM2.5, the non-PRD contributions resolved here correspond well with the study by Lu et al.54. The high contribution of mobile and marine emissions to EC is generally consistent with those reported by Huang et al.45 (42% vehicle and 18% marine) based on daily samples across six monitoring sites in the PRD region in 2015. Wang et al.52 used the PMF analysis for sampling data at Macao and concluded that 64% of EC can be attributed to a mixed source of vehicle and helicopter emissions.

Fe

The origins of Fe in PM2.5 are rather diverse in the PRD region. Area, power generation, and mobile sources are the most important, with contributions of 41%, 26%, and 26%, followed by contributions from marine and industry sources of 7% and 3% (Fig. 2b and Supplementary Fig. 5i). Higher contributions from the mobile source are expected at the near-road site MK. While mobile source contributions are relatively low at NS, higher impacts from the area and power generation sources still cause a high Fe level at this site. The hybrid model-derived source apportionment results for Fe are in good agreement with those derived from the PMF receptor modeling, which are available for Hong Kong from the study by Chow et al.43, with 35% of Fe from vehicular emission, and 32% from industrial/coal combustion.

Cu and Mn

These two transition metal species play important catalytical roles in atmospheric aqueous chemistry7,8,9,10,11. The model-predicted Cu is in good agreement with observations, with an overall Pearson’s R equal to 0.66 and an MFE of 0.52, while the predictions at NS are appreciably worse (MFE = 0.82). On average, area (31%), industry (27%), and power plant (20%) sources constitute more than 75% of its concentration (Fig. 2c and Supplementary Fig. 5k). On the other hand, the reconstruction of Mn is acceptable in most locations with an overall MFE of 0.42, whereas the prediction error at NS (MFE = 0.61) is somewhat higher. Local sources of Mn account for about 75% of the total Mn, with major contributions from area (40%), power (17%), and marine emission (13%) (Fig. 2d and Supplementary Fig. 5h). Only a small selection of source apportionment studies in the PRD includes these two species. Fu et al.44, Huang et al.46, and Tao et al.49 all conclude that Cu and Mn are mostly attributed to traffic emission (60–74 % of Cu and 33–52 % of Mn). On the other hand, Tan et al.48 suggest that 61% of Cu and 49% of Mn are dust-related. Note that these receptor-based source apportionment results are inconsistent with the emission inventories, which have large variations among themselves. For example, emission inventories of heavy metals in Guangdong developed by Liu et al. (2018)25 imply that power (23%) and industry (45%) are the major sources for Cu, and industry (73%) for Mn. In Liu et al. (2021)27, industry (39%) and brake/tire wire (36%) are recognized as important sources for Cu, while industry (53%) is the dominant source for Mn. These disparities among different methodologies raise questions about the applicability of emission inventories of these trace metals, as well as urging the development of region-specific measurement-based source profiles. In comparison, the new model not only ensures model performance by validating results against ambient measurements (Supplementary Figs. 5k and 5h), and also provides uncertainty information in estimated source profiles (Table 2).

V and Ni

V and Ni, mainly emitted from residual oil combustion, are exclusively used as the tracers for marine ship emission in many receptor-based model studies in the PRD region42,43,44,45,46,47,49,50,51,52. According to a study on ship emissions in the PRD region, 88–91% of V and 54–63% of Ni is associated with this source49. This is consistent with our hybrid model results illustrated in Supplementary Fig. 5g and Supplementary Fig. 5j, where marine emission alone could contribute to 93–97% of total V and 74–91 % of Ni at these monitoring stations.

Zn and Pb

These two trace metals are often used as tracers for industrial sources in receptor-based model studies in the PRD region43,44,45,46,47,48,49,50,51,52,53, thus discussed together here (Supplementary Fig. 5l and Supplementary Fig. 5m). Like many other primary species, model performance at NS and MC are generally worse than others, although their overall model errors (0.44 for Zn and 0.37 for Pb) are low. Our results suggest that the industry sector is the top contributor to both Zn (33%) and Pb (42%). Additionally, non-PRD emissions also greatly influence Zn (25%) and Pb (33%) concentrations. Other source apportionment studies found that 46–92% of Zn and 38–98% of Pb are related to industrial sources in the PRD region43,45,48,50,51, whereas emission inventories estimate the industrial emissions of Zn and Pb are 59–76% and 68–74%, respectively25,27.

As for the seasonal variations, species such as Cu, Mn, Zn, and Pb have their highest concentrations in winter and lowest in summer, in consistent with their main sources being on-land sources from the north. On the contrary, elements contributed largely by marine vessels or sea salt emissions from the south reach their lowest concentrations in winter, as observed in V, Ni, and Na. Species from both sources, such as EC, do not show obvious seasonal patterns. This phenomenon has been examined in the past and linked with the monsoons in the PRD region: the north-northeast monsoons prevail from October to March, and south-southwest monsoons from April to September, leading to different seasonal patterns in concentration and source contribution variations55,56,57.

Estimated source contributions and concentrations of trace metals in the PRD region

The source contributions to PPM2.5 apportioned by CMAQ in D3 with the original emission inventory (Supplementary Fig. 4) are improved using gridded scaling factors interpolated from those determined at the monitoring sites by inverse distance weighting (see Section M3). Notably, area and industrial sources are the most prominent contributors to PPM2.5 in the PRD region (Supplementary Fig. 6). For example, in Guangzhou, Foshan, and Zhuhai, the annual average area source contribution to PPM2.5 can exceed 10 μg/m3, while industrial contribution can surpass 6 μg/m3 in Foshan. Consequentially, in 2015 the annual PPM2.5 concentration is more than 15 μg/m3 in a large area covering southern Guangzhou, Foshan, Dongguan, Zhuhai, and Shenzhen, with concentrations as high as 30 μg/m3 in some small districts.

Annually averaged source contributions to Cu and Mn from different source sectors as well as their annual concentrations in the D3 domain are depicted in Fig. 3 and Fig. 4, respectively. Similar to the results at the monitoring stations, the most important sources for Cu and Mn in the PRD region are area, power generation, and industry sources, whereas non-PRD emission also contributes a substantial portion for Mn. Figure 3 implies that for some places in Guangzhou, Foshan, Dongguan, and Zhongshan, Cu concentration exceeds 50 ng/m3. For Mn, ambient concentrations of more than 30 ng/m3 are predicted in these areas. Since Cu and Mn are recognized to induce the generation of ROS, such high concentrations of Cu and Mn raise concerns about the potential of elevated aerosol toxicity in these districts. Other species in the source profiles can be readily analyzed in the same manner.

Fig. 3: Annually averaged source contributions to Cu in PPM2.5 from its major contributors and its concentration in the PRD region in 2015.
figure 3

(a) Area sources, (b) industrial emission sector, (c) power generation emission, and (d) ambient concentration.

Fig. 4: Annually averaged source contributions to Mn in PPM2.5 from its major contributors and its concentration in the PRD region in 2015.
figure 4

(a) Area sources, (b) non-PRD emission, (c) power generation emission, (d) industrial emission sector, (e) marine vessel emission, and (f) ambient concentration.

Perspectives

Although the hybrid model results agree well with the observation data for many species at most monitoring stations in this PRD case study, the model was unable to accurately reconstruct the observation data for NS and MC groups. One plausible explanation is that the emission inventory outside Hong Kong is problematic and cannot be remedied simply with scaling factors. Even so, when the combined Hong Kong and Nansha data are used to deduce mass fractions, the results are reliable enough to show excellent performance at other sites, further demonstrating our model’s robustness. Herein, the implication for future model refinement is that an accurate emission inventory will always be a vital requirement in CTMs and any CTM-based hybrid model for source apportionment. One possible way to improve the emission inventories is to further examine the scaling factors and mass fractions constrained by the measurement data. For instance, an observation-constrained scaling factor larger than 1 is likely caused by the underestimation of its emission sources.

Finally, this case study has shown the advantages of our statistically tailored regression model33 in dealing with atmospheric data. Compared to the hybrid model by Ying et al.32, our method shows improved prediction accuracy in 10-fold cross validation. More trials could be done to test this regression method for other linear models in atmospheric sciences.

Methods

Monitoring program and observation data

Table 1 and Supplementary Fig. 2 show the geographical locations of the monitoring stations where daily observation data are collected. Briefly speaking, 24-h filter-based sampling of PM2.5 is conducted every 6 days throughout 2015 at 10 monitoring stations spread in three cities in the PRD region, including 1 site in Guangzhou (Nansha; hereafter NS), 2 sites in Macao (Taipa Grande and Porto Exterior; MC), and 7 sites in Hong Kong (Central/West, Tung Chung, Tsuen Wan, Yuen Long, Kwai Chung, Mong Kok, and Clear Water Bay). Stations in Hong Kong are divided into four groups: HKc (denoting combined HK general urban sites, including Central/West, Tung Chung, Tsuen Wan, and Yuen Long), KC (Kwai Chung; near the City’s shipping container terminal), MK (Mong Kok; downtown roadside and traffic impacted), and WB (Clear Water Bay; suburban). All data are categorized into 6 groups as per Table 1 in modeling and further analysis.

Sampling protocol and chemical analysis details have been described in the annual reports of PM2.5 speciation by the Hong Kong Environmental Protection Department (HKEPD) [e.g., ref. 58] and some previous studies43,59. In short, PM2.5 total mass concentration is determined gravimetrically using a digital balance (Sartorius AG, Model MC 5-0CE, Göttingen, Germany); water-soluble major ions (e.g., sulfate and nitrate) are extracted from the filters using double de-ionized water and measured using an Ion Chromatograph (Dionex DX-500, Thermo Fisher Scientific, MA, USA); OC and EC were analyzed using thermal/optical transmittance method on an Aerosol OCEC analyzer (Sunset Laboratory, OR, USA); and trace elements (e.g., Fe, Cu, Mn, Br, and Pb) using an energy dispersive X-ray fluorescence (ED-XRF, Epsilon 5, PANalytical, The Netherlands). The sampling program generates ca 60 sets of daily samples at each site, resulting in a total of 594 sets of daily PM2.5 speciation data (Table 1). Analytical specifications such as MDLs of each species are shown in Supplementary Table 11, and the distributions of all daily concentrations are plotted in Supplementary Fig. 4.

Regional source apportionment of PPM2.5 using CMAQ

The Weather Research and Forecasting model (WRF) v3.5 is applied for the meteorological simulation for the entire year of 2015 using the same model schemes and domain settings as the two previous studies54,60. In this project, the smallest D3 domain covers the entire PRD region with a horizontal resolution of 3 km, as shown in Fig. 4. A bottom-up 3-km resolution emission inventory of PPM2.5 and other major pollutants provided by the HKEPD is used for the D3 domain54. Outside the PRD region, the Regional Emission inventory in ASia v3.1 (REAS3)61 is used to generate anthropogenic emissions. Emissions of non-methane hydrocarbons from REAS3 are used to estimate the emissions of primary VOCs needed for the Carbon Bond 6 mechanism using selected speciation profiles from the SPECAITE database developed by the US EPA62. The biogenic emissions are generated using the Model for Emissions of Gaseous and Aerosols from Nature (MEGAN) v2.1063. Windblown dust and sea salt emissions are generated using the inline module in the CMAQ model64.

The simulation of PPM2.5 and its source apportionment are carried out via a source-oriented version of CMAQ (ver. 5.2.1) with non-reactive source-tagged tracers32,65 to determine the contributions of the area, mobile, industry, marine, and power generation sectors to PPM2.5 in the PRD region (Fig. 5). In this study, the area source sector includes emissions from residential, non-point industrial, off-road engine exhaust, fugitive road dust, and agricultural sources (Supplementary Note 2). PPM2.5 due to emissions from other regions in the D3 domain and upwind sources outside D3 is lumped into the non-PRD sector. In addition, sea salt and windblown dust contributions to PPM2.5 are also tracked as separate groups. An illustration of the spatial distributions of the annual average sector contributions to PPM2.5 is shown in Supplementary Fig. 3.

Fig. 5: The D3 domain that covers the PRD Region in the WRF-CMAQ simulation.
figure 5

City acronyms in alphabetical order are DG Dongguan, FS Foshan, GZ Guangzhou, HK Hong Kong, HZ Huizhou, JM Jiangmen, MC Macao, SZ Shenzhen, ZH Zhuhai, ZQ Zhaoqing, and ZS Zhongshan.

Observation-constrained hybrid model

The procedures in the new observation-constrained hybrid model are illustrated in Panel (a) in Fig. 6, while panel (b) shows details of the 10-fold cross validation for model verification. The steps are described below:

Fig. 6: Schematic diagrams of the observation-constrained hybrid model in this study.
figure 6

(a) flowchart of the base run, and (b) flowchart of the 10-fold cross validation to examine the prediction accuracy of the hybrid model.

Step 1. A Bayesian inference approach is employed to deduce the primary organic carbon (POC) and secondary organic carbon (SOC) levels in each observation, following the method described by Liao et al.66

Step 2. Similar to Ying et al.32, the measurement-based primary PM2.5 (PPM2.5) concentration is estimated via subtracting the secondary inorganic and organic components from PM2.5.

Step 3. CMAQ-deduced source contributions to PPM2.5 are adjusted with source-specific scaling factors (\({\beta }_{j},{j}=1,\,2,\,\ldots ,{m}\)) as per Eq. 1 to improve the agreement between the predicted total PPM2.5 and the measurement-based PPM2.532,

$$\begin{array}{c}{c}_{i}=\mathop{\sum }\limits_{j=1}^{m}{b}_{i,j}{\beta }_{j}\end{array}$$
(1)

where \({c}_{i}\) represents the measurement-based PPM2.5 concentration for the \({i}^{{th}}\) observation, and \({b}_{i,j}\) denotes the source contribution from the \({j}^{{th}}\) source sector for the \({i}^{{th}}\) observation estimated by CMAQ. The scaling factor calculation is performed individually for each site group to account for the spatial variability of the emission estimation biases. Data points with the fractional error (FE) between CMAQ-predicted PPM2.5 and measurement-based PPM2.5 larger than 0.8 are excluded.

The adjustment factors are determined via minimizing the objective function \(\widetilde{Q}\) in Eq. 2.

$$\tilde{Q}={Q}_{0}+\lambda {Q}_{p}=\mathop{\sum }\limits_{i=1}^{n}{\left(\mathrm{ln}{c}_{i}-\mathrm{ln}\mathop{\sum }\limits_{j=1}^{m}({b}_{i,j}{\beta }_{j})\right)}^{2}+\lambda \mathop{\sum }\limits_{j=1}^{m}{(\mathrm{ln}{\beta }_{j})}^{2}$$
(2)

The first part of the equation, \({Q}_{0}\), is a unique function with log-transformation, representing the difference between CMAQ-predicted and measurement-based PPM2.5. As explained by Liao et al.33, using log-transformed objective function better estimates the regression coefficients for log-normally distributed atmospheric data. \({Q}_{p}\) in Eq. 2 is the penalty term (also known as the regularization), included to overcome potential overfitting problems. Otherwise, unrealistically large or small β values are likely to merge (Supplementary Note 3). The L-curve method is utilized to find the optimal \(\lambda\) value in Eq. 267(Supplementary Note 3).

The robustness of calculated scaling factors is evaluated using a Monte Carlo simulation technique. The relative uncertainties of \({c}_{i}\) (i.e., measurement-based PPM2.5) and \({b}_{i,j}\) (i.e., CMAQ source contributions to PM2.5) in Eq. 2 are set to 20% and 40%, respectively32. In each trial, randomly generated errors are added onto \({c}_{i}\) and \({b}_{i,j}\), whence one set of scaling factors (\({\beta }_{j}\)) is determined. In Supplementary Note 3, we analyze the uncertainties of \({\beta }_{j}\) from such Monte Carlo simulation trials. Eventually, the arithmetic means of scaling factors (\({\bar{\beta }}_{j}\)) are used to produce adjusted source contributions (\({{b\text{'}}}_{i,j}\)) for ensuing computation (Eq. 3).

$$\begin{array}{c}b{\text{'}}_{i,j}={b}_{i,j}{\bar{\beta }}_{j}\end{array}$$
(3)

Step 4. The deduction of mass fraction (\({\varphi }_{j,k}\)) of a give primary species (\(k\)) in the \({j}^{{th}}\) source emitted PPM2.5 follows Eq. 4,

$$\begin{array}{c}{c}_{i,k}=\mathop{\sum }\limits_{j=1}^{m}(b{\text{'}}_{i,j}{\varphi }_{j,k})\end{array}$$
(4)

where \({c}_{i,k}\) denotes its ambient level in the \({i}^{{th}}\) observation, and \({{b\text{'}}}_{i,j}\) is the scaled source contribution to PPM2.5 from Eq. 3. Step 4 differs from Step 3 in two aspects: first, contrary to the site-dependent scaling factors (\({\beta }_{j}\)), we assume that the mass fraction (\({\varphi }_{j,k}\)) of a species from the same source sector is uniform within the region; and second, the penalty term is no longer applicable, whereas upper and lower bounds of \({\varphi }_{j,k}\) are set to [0,1] in Eq. 4. The objective function in Eq. 5 is minimized to determine the optimal set of \({\varphi }_{j,k}\) values for one species at a time. Similar to Step 3, Monte Carlo simulation is carried out to estimate the uncertainties of deduced mass factions. The arithmetic means (\({\bar{\varphi }}_{j,k}\)) from simulations is used in Step 5.

$$\begin{array}{c}{Q}_{k}=\mathop{\sum }\limits_{i=1}^{n}{(\mathrm{ln}{c}_{i,k}-\mathrm{ln}\mathop{\sum }\limits_{j=1}^{m}(b{\text{'}}_{i,j}{\varphi }_{j,k}))}^{2}\end{array}$$
(5)

Step 5. The product (\({{SC}}_{i,j,k}\)) in Eq. 6 is deemed as the source contribution from the \({j}^{{th}}\) source sector to a primary pollutant (\(k\)) in the \({i}^{{th}}\) observation.

$${{SC}}_{i,j,k}={b{\prime} }_{i,j}{\bar{\varphi }}_{j,k}$$
(6)

By applying the PPM2.5 scaling factors derived in Step 3 and source profiles of primary species in Step 4, regional concentrations of the metal species and their source contributions in the modeling domain can be determined. Since the PPM2.5 scaling factors determined in Step 3 are location dependent, an inverse distance weighting interpolation scheme (Eq. 7), is used to estimate the scaling factors at all grid cells in the entire domain68,69.

$$\begin{array}{c}{\beta }_{j}(\overrightarrow{x})=\mathop{\sum }\limits_{l=1}^{10}{\bar{\beta }}_{j,l}{w}_{l},\,{w}_{l}=\frac{1/|{\overrightarrow{x}}_{l}-\overrightarrow{x}|}{{\sum }_{l=1}^{10}1/|{\overrightarrow{x}}_{l}-\overrightarrow{x}|}\end{array}$$
(7)

where \(\vec{x}\) and \({\vec{x}}_{l}\) are the position vector of a grid cell and the \({l}^{{th}}\) monitor site, respectively. \({\beta }_{j}\) is the interpolated scaling factor for the \({j}^{{th}}\) PPM2.5 source.

10-fold cross validation

10-fold cross validation is conducted to examine the prediction accuracy of this hybrid model (Fig. 6b). Since daily observation data are measured at precisely 10 monitoring stations, each trial uses data at nine sites as the training set and the remaining one as the testing set. That is, CMAQ-modeled PPM2.5 contributions and ambient measurement data in the training set are utilized as model inputs to generate location-dependent scaling factors as well as regional source profiles. These scaling factors are later extrapolated to the testing site using Kriging (Eq. 7). With adjusted CMAQ PPM2.5 source contributions and source profiles, predictions of primary species are made possible for testing set, and hereafter compared to daily measurements.