Data assimilation of ambient concentrations of multiple air pollutants using an emission-concentration response modeling framework

Data assimilation for multiple air pollutant concentrations has become an important need for modeling air quality attainment, human exposure and related health impacts, especially in China that experiences both PM2.5 and O3 pollution. Traditional data assimilation or fusion methods are mainly focused on individual pollutants, and thus cannot support simultaneous assimilation for both PM2.5 and O3. To fill the gap, this study proposed a novel multipollutant assimilation method by using an emission-concentration response model (noted as RSM-assimilation). The new method was successfully applied to assimilate precursors for PM2.5 and O3 in the 28 cities of the North China Plain (NCP). By adjusting emissions of five pollutants (i.e., NOx, SO2, NH3, VOC and primary PM2.5) in the 28 cities through RSM-assimilation, the RMSEs (root mean square errors) of O3 and PM2.5 were reduced by about 35% and 58% from the original simulations. The RSM-assimilation results small sensitivity to the number of observation sites due to the use of prior knowledge of the spatial distribution of emissions; however, the ability to assimilate concentrations at the edge of the control region is limited. The emission ratios of five pollutants were simultaneously adjusted during the RSM-assimilation, indicating that the emission inventory may underestimate NO2 in January, April and October, and SO2 in April, but overestimate NH3 in April and VOC in January and October. Primary PM2.5 emissions are also significantly underestimated, particularly in April (dust season in NCP). Future work should focus on expanding the control area and including NH3 observations to improve the RSM-assimilation performance and emission inventories.


Introduction
Human exposure to air pollutants such as ozone (O 3 ) and fine particulate matter (PM 2.5 ) has been associated with considerable adverse health effects. In 2017, 2.9 million premature mortalities were attributed to PM 2.5 exposure globally and about a half-million mortalities were attributed to O 3 exposure [1][2]. Accurate estimation of air pollutant concentrations and their exposures is critical for assessing health impacts and developing emission control strategies. Previous studies have demonstrated that the assimilation of chemical transport model (CTM) simulations with monitor observations can provide spatiotemporally continuous estimates of air pollutant concentrations and corresponding exposure while incorporating the accuracy of in-situ monitoring data and the spatiotemporal continuity of CTM modeling. Traditional data fusion methods such as Voronoi Neighbor Averaging (VNA) [3], enhanced Voronoi Neighbor Averaging (eVNA) [4], and Downscaler (DS) [5][6] have been applied in many studies to estimate air pollutant concentrations, human exposure and related health impacts [7]. Concentration estimates based on fusing monitor and CTM data can be helpful in characterizing the effects of control strategies for air quality attainment [8].
However, traditional data fusion is mostly based on statistical interpolation or regression methods that are designed to predict each pollutant individually. The need for simultaneous assimilation of multiple pollutants is evident with the deterioration of O 3 pollution in China corresponding to PM 2.5 improvements [9]. The challenge for multipollutant assimilation is to wisely design certain constraints to maintain the natural linkages between pollutants during the assimilation process. Specifically, the natural linkage between O 3 and PM 2.5 is that both pollutants have contributions from common precursors (NO x and VOC), similar atmospheric diffusion/advection transport, and chemical oxidation reactions. Therefore the modulation of one pollutant concentration should exert corresponding influence on the other, and these connections should be represented during the assimilation. Traditional data fusion methods like eVNA can only fuse pollutant concentrations separately without considering multipollutant linkages. However, assimilating O 3 and PM 2.5 separately will result in different NO x and VOC adjustment ratios since the optimization is conducted individually for one pollutant instead of both. The simultaneous assimilation for both pollutants can result in a consistent adjustment of NO x and VOC emissions and also provide additional constraints to ensure the inversion problem is well posed [10]. Therefore, the development of an advanced data assimilation method is necessary to support simultaneous data fusion for multiple pollutants to accurately represent current air quality and the effectiveness of future control policies.
Discrepancies between predictions and observations are associated with uncertainties in many factors such as model emissions, resolution, chemical mechanisms, and process parameterizations as well as measurement errors. Among these factors, uncertainties in emission inventories are regarded as one of the largest contributors to the biases in CTM predictions. Unfortunately, the interpolation-based fusion methods like eVNA do not provide information on the contributors to model errors. Inversion modeling studies have used advanced assimilation techniques such as ensemble Kalman filtering to correct the emissions simultaneously during the assimilation process [11]. The revised emissions are also useful for improving emission inventories [12][13]. Ideally, the predictions developed by combining CTM simulations and observations would provide not only accurate and spatiotemporally continuous concentrations of multiple pollutants, but also corrections to emission inventories, which are one of the largest contributors to model biases. To date, however, most inversion studies only focused on individual pollutants that can be measured directly. Therefore, there remains a lack of inverse emissions estimates for multiple pollutants including those that cannot be directly observed, such as primary fine particulate matter (noted as pPM 2.5 ).
The recently developed response surface model (RSM) provides real-time prediction of both PM 2.5 and O 3 using emission-concentration relationships. The RSM can identify the emission control factors needed to meet air quality targets, and thus provides information on the changes in emissions of multiple pollutants needed to improve air quality predictions against monitoring data [14][15]. Advanced machine learning techniques enables its fast application across any time period and spatial location [16]. Different from inversion modeling, the RSM modifies anthropogenic emissions of five pollutants at the regionally aggregated level (by city in this study) based on an assumption that the spatial distribution of emissions is relatively accurate compared with emission magnitudes. In combination with surface observations from a few monitoring sites, the RSM has been successfully applied to investigate the emission changes during the COVID-19 period in North China Plain (NCP) [17]. The RSM-based assimilation can well maintain the inner linkages of PM 2.5 and O 3 , as the RSM prediction can be considered as one CTM simulation under a specific emission scenario. The emission adjustment ratios from the RSM-assimilation also address the question of how emissions should be modified to achieve a certain level of agreement between predictions and observations. In this study, an emission-concentration response modeling framework is established based on the RSM (noted as RSM-assimilation). Then its performance in the data assimilation of ambient concentrations of multiple air pollutants is tested in the case study of NCP. The new RSM-assimilation method is described in Section 2. The performance of the RSMassimilation approach as well as the implications for improving the emission inventory is discussed in Section 3. Advantages, limitations and future work on RSM-assimilation are summarized in Section 4.

Simulation and observation data
The principle of RSM-assimilation is to nudge the CTM simulation toward available observations by adjusting the emissions. The CTM used in this study is the Community Multiscale Air Quality (CMAQ, version 5.2.1, www.epa.gov/cmaq) model, and the meteorological fields are based on a simulation with the Weather Research and Forecasting (WRF, version 3.8) model. The same configuration of WRF-CMAQ was applied as in our previous studies [18]. The Morrison double-moment microphysics scheme [19], the Rapid Radiative Transfer Model [20], Kain-Fritsch cumulus cloud parameterization [21], Pleim-Xiu land-surface physics scheme [22], and the Asymmetric Convective Model for the planetary boundary layer (PBL) physics scheme [23] were used in WRF simulation. We used the Carbon Bond 6 [24] and the AERO6 aerosol module [25] for gas-phase and particulate matter chemical mechanisms, respectively.
The simulation domain covers the 28 key cities in NCP, as shown in Figure 1. The simulations of the 27 km × 27km China domain provide the boundary conditions for the simulation of the nested domain of 9km × 9km over NCP. The anthropogenic emissions of 28 cities are based on the Multi-resolution Emission Inventory for China (MEIC) dataset for the year of 2016 [26], because the data for 2017 was not available when we initiated this work. The emissions include five major sectors, i.e., industry, power, residential, transportation, and agriculture. The gridded emissions of five air pollutants over 28 cities in NCP are shown in Figure S1. Biogenic emissions were generated by the Model for Emissions of Gases and Aerosols from Nature (MEGAN) version 2.0 [27]. The inline dust model was used to estimate the wind-blown dust emissions in the 27km × 27km China domain. Considering the wind-blown dust mostly comes from outside of NCP, we turned off the inline dust model for the simulation of 9km × 9km over NCP to reduce the computational burden associated with the RSM model development. The performance of the WRF-CMAQ model in simulating meteorological variables and pollutant concentrations was evaluated by comparing with surface observations [28]. The WRF-CMAQ model well reproduces the observed meteorology, with mean biases within ±0.5° for 2-meter temperature, ±1g/kg for 2-meter humidity, ±0.5m/s for 10-meter wind speed, and 10° for wind direction. The model also exhibits acceptable performance in simulating PM 2.5 and O 3 concentrations, with slight low-biases in PM 2.5 and high-biases in O 3 . Such biases can be reduced through the data assimilation with RSM, as discussed in the Section 3.1.
The RSM model was developed based on 21 emission-control scenario simulations with the WRF-CMAQ model by implementing polynomial functions to represent the response of O 3 and PM 2.5 to emission changes of five pollutants including SO 2 , NOx, VOC, NH 3 and pPM 2.5 in 28 cities in NCP [14][15]. The polynomial functions of O 3 and PM 2.5 response to emissions are shown as E1-2. where Δ O 3 and Δ P M 2.5 is the response of O 3 and PM 2.5 concentrations (i.e., change to the baseline concentration), respectively, at each simulated grid cell; E NOx , E SO2 , E NH3 , E VOCs , and E pPM2.5 is the change ratio of NO x , SO 2 , NH 3 , VOCs, and pPM 2.5 emissions, respectively, relative to baseline (i.e., baseline = 0); a i is the coefficient of term i which is determined by fitting with 21 emission-control scenario simulations.
The RSM can predict O 3 and PM 2.5 responses to emission changes in good agreement with CMAQ predictions, as the out-of-sample validation of RSM predictions against CMAQ simulations yields NMBs (Normalized Mean Biases) within ±1% [28].
We gathered the observed NO 2 , SO 2 , O 3 and PM 2.5 measurements from monitoring sites in the 28 NCP cities from the China National Environmental Monitoring Centre (http:// www.cnemc.cn/en/) ( Figure 1). The daily maximum 8-hour O 3 and daily averaged PM 2.5 concentrations were assimilated for the modeling domain with 9-km horizontal grid spacing. The simulation period is January, April, July and October in 2017 to represent winter, spring, summer and fall, respectively. Assimilation performance was evaluated using the Root Mean Square Error (RMSE), Normalized Mean Bias (NMB) and R-squared (R 2 ).
In the baseline case of RSM-assimilation, we included all observation sites during the assimilation. To compare with the simulation, we averaged observations from sites within the same model grid cell (9km ×9km). This resulted in about 85 sites (the number is smaller than that shown in Figure 1) across 28 cities to be used in data assimilation. The monitoring sites are mostly located at the urban center of each city, were both local emission sources and regional contributions influence the ambient PM 2.5 and O 3 concentrations [29][30]. The RSM-assimilation can account for both local formation and regional transport since it was original built from CMAQ model simulations. Details about the numbers of sites used for nudging in each city are summarized in Table S1. Additionally, we designed single-site cases for each of the 28 cities in which only a single randomly-selected site was assimilated for each city. These cases were used to examine the sensitivity of the RSM-assimilation method to the number of observation sites. Figure 2 presents the framework for the newly developed RSM-assimilation method. As in our previous study [17], we first estimate the adjusted emission ratios of NO x and SO 2 based on the comparison of simulated and observed NO 2 and SO 2 concentrations. Next, an optimization process is implemented that samples VOC emission ratios as inputs for the RSM-O 3 model. This process results in optimized VOC emission ratios for the 28 cities that yield the best agreement between simulated and observed O 3 concentrations at the monitoring sites (determined by the average RMSE over all sites). A similar optimization process is then applied for PM 2.5 , where the adjusted NO x , SO 2 and VOC emission ratios are introduced into the RSM-PM 2.5 model and emission ratios are sampled for NH 3 and pPM 2.5 in the 28 cities. Since direct observations of surface NH 3 concentrations are unavailable, the optimized pPM 2.5 and NH 3 emission ratios correspond to the assimilated PM 2.5 concentrations that agree most closely with observed PM 2.5 concentrations among all possible combinations of emission ratios. The optimization was conducted through a multiple looping process to select the best combination of emission ratios (with a small step of 0.05) for different pollutants and cities to achieve the best overall agreement. As currently designed, the RSM is suitable for emission ratios of gas pollutants that range from 0 (fully controlled emissions) to 2 (doubled emissions). The response functions of O 3 and PM 2.5 to emission changes of gaseous precursors were fitted through the regression of multiple emission-control scenario simulations (emission ratios from 0-2); thus, the uncertainties of the response functions will be considerably larger when the emissions vary outside of the 0-2 emission ratio range. Here, we limit the adjusted emission ratios of NO 2 , SO 2 , VOC, and NH 3 to range from 0.1 (90% reduction) to 2.0 (100% increase). We allow a wider range for pPM 2.5 emission ratios due to the large uncertainties in pPM 2.5 emissions, particularly for fugitive emission sources like wind-blown dust. Unlike the gas pollutants, the RSM has no limits on the adjusted emission ratios for pPM 2.5 , since impacts of these emissions are represented linearly in the RSM [15].

Performance of RSM-assimilation
In Figure 3 and 4, the assimilated O 3 and PM 2.5 concentrations in the baseline RSMassimilation case are compared with the original CMAQ simulations in terms of the spatial pattern of monthly averaged concentrations and with observations in terms of daily paired values for sites in the 28 cities. The assimilated spatial fields generally maintain the spatial distribution of the original CMAQ simulation, but have slightly modulated concentrations in areas surrounding the observation sites that better reflect observed values. The accuracy of both O 3 and PM 2.5 concentrations is enhanced through the RSM-assimilation that reduces the RMSE from 12.7-24.4 ppb (pre-assimilation) to 9.9-18.1 ppb (post-assimilation) for O 3 , and from 23.7-85.3 μg m −3 (pre-assimilation) to 11.8-46.5 μg m −3 (post-assimilation) for PM 2.5 .
The high biases in simulated O 3 are greatly reduced through RSM-assimilation, as the NMBs decrease from up to 50% in CMAQ to within 20% in RSM-assimilation. The underestimation of PM 2.5 is also mitigated through RSM-assimilation, as the NMBs for PM 2.5 are reduced to 0%. The comparison of daily paired predictions and observations indicates large improvements for R 2 in RSM-assimilation: e.g., the R 2 for PM 2.5 predictions increased from 0.2-0.5 (pre-assimilation) to 0.7-0.9 (post-assimilation). However, the RSM-assimilation is much more effective for PM 2.5 than for O 3 . This behavior might be associated with the large contributions to O 3 from sources that cannot be adjusted through RSM assimilation (e.g., biogenic sources) and the effectiveness of primary PM 2.5 emissions for modulating PM 2.5 concentrations. We note that the observations displayed in Figure 4 are distributed in discrete locations across the domain, and do not necessarily match the location of the simulated and assimilated concentrations (see Figure S2 for the matched location comparisons).
Time-series comparisons for O 3 and PM 2.5 in the 28 cities ( Figure S3 and S4, respectively) suggest similar levels of improvement, as the RMSE was reduced after the four-month assimilation across the 28 cities.
Although the RSM-assimilation improves the simulation accuracy, large discrepancies in the performance improvements are evident among the 28 cities. To further investigate the variation of RSM-assimilation performance across all observation sites, we compared the RMSEs of CMAQ-simulated and RSM-assimilated O 3 and PM 2.5 in the 28 cities at the fourmonth averaged level, as shown in Figure 5.
In general, the RSM-assimilation is less effective in reducing the RMSE for sites in cities at the edge of the control region (i.e., the full 28-city area). The smallest O 3 improvements occurred in cities such as ZB on the eastern edge of the control region and YQ on the western edge of the control region. These cities had relatively large RMSEs after the assimilation, with RMSE reductions of only 6% (ZB) and −3% (YQ) (slightly worse performance) relative to the CMAQ simulation. For the other cities, the O 3 improvements are much greater, with RMSE reductions of at least 16%. For PM 2.5 , YQ also has the smallest improvement, with a 24% reduction in RMSE (compared to a 50-80% for the other cities). Such results indicate that the RSM-assimilation has limited ability to improve the accuracy of concentrations at the edge of the control region, where the influence of emissions from outside of the control region is large. Enlarging the control area or combining with an RSM model based on the larger domain is recommended to improve the ability of RSM-assimilation for those cities. Meanwhile, discrepancies also exist within a city-e.g., RMSE was reduced in eastern Tianjin but increased in western Tianjin in Jan. This behavior occurs because the RSM-assimilation adjusts emissions at the city averaged level and maintains the spatial distribution of emissions within each city at the a priori estimate. Future improvement of the spatial distribution of the emission within the city is also recommended by adopting additional observations like satellites and advanced technologies like machine-learning to address such limits.

Sensitivity of RSM-assimilation to the site number
For traditional model-observation fusion methods, the abundance of observations has significant impacts on model performance [8]. However, in RSM-assimilation, the adjustment of emissions is done at the city-level, and therefore decreasing the number of observations within each city should have relatively less influence on performance compared to regression-based methods. To investigate the sensitivity of RSM performance to the number of sites used in assimilation, we examined performance for cases based on different numbers of observation sites, as shown in Figure 6.
In assimilation for the baseline case (red hollow bar), all observation sites were used. For the other cases (C1 to C3, red symbols), a single site in each city was used based on three random site selections. Compared to the baseline RSM, the performance does not decrease considerably for the single-site RSM-assimilation cases, C1-C3, especially for PM 2.5 . When the number of observation sites are decreased from 85 to 28, the RMSE in O 3 predictions increases slightly (by 13%, from 11.5 to 13.3 ppb) and RMSE in PM 2.5 increases slightly (by 42%, from 15.7 to 22.4 μg m −3 ) but still decreased by 40% from that in CMAQ (37.3 μg m −3 ) based on the average of all 28 cities. We note that the evaluation for C1-C3 was performed while withholding observations from the sites used for assimilation, thus implying that such improvement through assimilation also applies to the locations where air pollution is not monitored.
RSM-assimilation has advantages in cases where few observation data are available. However, since emissions are adjusted at the city level, the RSM-assimilation method has limited flexibility for adjusting spatial patterns of O 3 and PM 2.5 concentrations. The spatial distribution of emissions, which are assumed to be accurate in RSM-assimilation, might also have uncertainties that could influence the performance of assimilation in terms of representing concentration gradients. In that case, adjustment of total emissions cannot reduce the biases associated with the spatial concentration gradients, which may limit the overall performance of RSM-assimilation.

Implication of uncertainties in anthropogenic emissions
In addition to reducing the RMSE of model predictions, RSM-assimilation provides the emission ratios for five pollutants that are adjusted simultaneously during assimilation. The adjusted emission ratios provide information about potential uncertainties in anthropogenic emissions.
As shown in Figure 7, for average month values, NO x emissions appear to be underestimated in the baseline case based on emission adjustment ratios >1 in January, April, and October. VOC emissions appear to be overestimated (adjustment ratios <1) in January and October, and SO 2 emissions are underestimated in April. The emission adjustment ratio for pPM 2.5 is the largest among all pollutants, in part due to the wider range of possible changes allowed in the RSM-assimilation. The pPM 2.5 ratio is much greater than 1 in all cases and indicates a significant underestimation, particularly in April during the dust season. The biases in assimilated concentrations generally do not reach to zero due to the limited range of emission adjustment available in RSM-assimilation. For cases that have large uncertainties in the prior emissions, the simulated concentrations cannot fully be assimilated such that predictions match the observation.
An underestimation of NO x and SO 2 emissions is evident on both clean and polluted days. On clean days, pPM 2.5 emissions are significantly underestimated in April. On polluted days, NO x emissions are underestimated during all seasons except summer. For all days, VOC emissions are overestimated in January and October, and SO 2 emissions are underestimated in January and April but overestimated in July and October. The adjusted emission ratio for pPM 2.5 is greater than 1 across all four months suggesting a broad underestimation of primary PM 2.5 emissions. The uncertainties of pPM 2.5 emissions may be due to the lack of inline dust simulation over NCP domain and underestimation of windblown dust emissions outside of NCP [31], as such underestimation of PM 2.5 is more pronounced in April during the dust season. Also, the underestimation of primary organic aerosol and intermediate VOC emissions appears to have resulted in relatively larger lowbiases of simulated organic aerosols than other components in April (see Figure S5). Although increasing the pPM 2.5 emissions is an efficient way to resolve differences between modeled and observed PM 2.5 concentrations, the emission adjustment of pPM 2.5 also likely corrects for other model limitations. The large adjustment of pPM 2.5 (by over two) suggests that other model uncertainties (e.g., aerosol-PBL dynamic interactions, chemical reaction rates, and potential missing chemical reaction pathways) could also play an important role in contributing to low biases. This is also suggested from the evaluation of PM 2.5 component concentrations which imply that potential missing chemical reaction pathways for the transition of S(IV) to S(VI) [32,33] might contribute to the low-biases of sulfate aerosols in January and July (i.e., SO 2 concentration was overestimated but the percentage of sulfate aerosols in total PM 2.5 was underestimated, see Figure S5). Further decoupling the influence of these processes can be done using advanced machine learning technology with the inclusion of certain feature data such as meteorological variables.

Summary and Conclusions
In this study, we developed a new assimilation method (RSM-assimilation) that uses an emission-concentration response model for assimilating PM 2.5 and O 3 observations simultaneously. The successful application of RSM-assimilation indicates that significant improvements in the agreement of predictions and observations can be achieved solely by adjusting the emissions. For instance, the RMSE for O 3 and PM 2.5 predictions decreased by about 35% and 58%, respectively, demonstrating the effectiveness of the new assimilation method based on the emission-concentration response model. We found that the RSMassimilation has limited ability for assimilating concentrations at the edge of the control region due to the influence of emissions from surrounding regions. Compared to PM 2.5 , O 3 concentrations are harder to assimilate with the new method due in part to the larger contributions from background and natural sources. An advantage of RSM-assimilation is that it requires very little observational data, since it uses prior knowledge of the spatial distribution of emissions. Therefore the approach is most suitable for cases where few observation are available.
The RSM-assimilation also provides useful information to improve understanding of uncertainties in the emission inventory. Based on the adjusted emission ratios, we found that the NO x emissions are likely underestimated in January, April, and October; SO 2 emissions are likely underestimated in April; NH 3 emissions are likely overestimated in April; and VOC emissions are likely overestimated in January and October. Also, pPM 2.5 emissions appear to be significantly underestimated in April during the dust season. Despite the success of the RSM-assimilation application, some limitations were identified that require future improvement. For example, due to the lack of NH 3 observations, we adjusted the NH 3 emissions simultaneously with pPM 2.5 emissions in this study. Future work can be conducted by implementing both surface measurements and satellite retrievals (e.g., NO 2 , SO 2 , NH 3 , and HCHO) to optimize the emission accuracy across the whole domain. The inclusion of NH 3 observations can improve the method to better constrain the adjustment for PM 2.5 . Further improvement can be also done to separate the emissions by sectors with additional correction functions such as the spatial pattern of each emission category based on observations like satellites. In addition, we assumed that the emissions changes have immediate impact on PM 2.5 and O 3 concentrations, but incorporating an assimilation time window into the method is necessary to account for the time needed for pollution transport and chemical formation upwind of the monitors.   The emission-concentration-based response modeling framework. (In assimilating observational data, NO x and SO 2 emissions are adjusted first, followed by optimization for VOC and then pPM 2.5 and NH 3 emissions) The RMSE (left: grey-CMAQ; blue-RSM) and Adjusted emission ratio for the air quality assimilation (right, baseline=1. lightblue-NO x ; cyan-SO 2 , green-NH 3 ; pink-VOC; grey-pPM 2.5 ) at different polluted levels Xing et al. Page 20