Machine learning approaches reveal highly heterogeneous air quality co-benefits of the energy transition

Summary Estimating health benefits of reducing fossil fuel use from improved air quality provides important rationales for carbon emissions abatement. Simulating pollution concentration is a crucial step of the estimation, but traditional approaches often rely on complicated chemical transport models that require extensive expertise and computational resources. In this study, we develop a machine learning framework that is able to provide precise and robust annual average fine particle (PM2.5) concentration estimations directly from a high-resolution fossil energy use dataset. Applications of the framework with Chinese data reveal highly heterogeneous health benefits of avoiding premature mortality by reducing fossil fuel use in different sectors and regions in China with a mean of $19/tCO2 and a standard deviation of $38/tCO2. Reducing rural and residential coal use offers the highest co-benefits with a mean of $151/tCO2. Our findings prompt careful policy designs to maximize cost-effectiveness in the transition toward a carbon-neutral energy system.

Although reduced-form models provide satisfying approximations for full-scale models, several important limitations restrict their use in policy analysis.First, inputs for the reduced-form model need to be developed from a full-scale model (usually for a specific simulation period) and are still resource-intensive, hence updating the meteorological conditions that dictate the simulation results requires substantial domain knowledge and efforts.Second, for the atmospheric chemistry processes that are less well-understood, the biases embodied in

Structure and performance of the machine-learning framework
Our novel machine-learning framework applies modified convolutional neural network architecture (ResCNN) to simulate the energy-related annual average fine particle (PM 2.5 ) concentration observed by China's national-level air-quality monitoring stations in 2015.The year 2015 is chosen because this is the latest year available for the high-resolution fossil energy CO 2 emissions dataset 29 used as key inputs in our framework.The architecture of our framework includes a linear component and a CNN component.A residual connection is added between the linear regression results and the final output for better causal interpretability and model stability.Detailed architecture is shown in Figure S1.We collect 1,497 stations with available data in 2015 from China's national ambient air quality monitoring network.Some monitoring stations are overlapped in the same grid cell at the geographic resolution we use (10 km 3 10 km).This resolution (100 km 2 ) is finer than most regional reduced-formed models (e.g., 1,642 km 2 (median) for AP2, 17 1,296 km 2 (median) for EASIUR, 18 293 km 2 (population-weighted mean) for InMAP 19 ), and comparable to some comprehensive regional CTMs, for example, 144 km 2 for many CTMs for the United States. 15We use the most recent year with available data (the year 2015) for the analysis in this paper, but the framework could be easily extended to future years and other regions.Specifically, the model output is the annual average value of hourly PM 2.5 concentration for each monitoring station collected from China's National Air Quality Real-time Disclosure Portal. 30The portal publishes hourly readings of PM 2.5 concentration measured by stations in China's national ambient air quality monitoring network, which has been developed to generate representative PM 2.5 concentration levels in different regions in China.We use the annual average value because the relationship between PM 2.5 exposure and premature mortality has been established on a yearly basis by mainstream epidemiology literature. 31,32For stations co-located within the same grid cell, we group them using their average PM 2.5 concentration as the output variable, leaving 943 data points in the final dataset for our study.Since some sources are not related to energy consumption, including the dust, residential and open burning of biomass.We modify the labels to exclude the influence of these sources based on government guidelines and the results of previous studies, 33,34 as detailed in the Supplemental information.
The model inputs for predicting the annual average PM 2.5 concentration of a specific monitoring station include 2D tensors that contain the geographical distribution of CO 2 emissions from fossil energy use (by sector and energy type) as well as altitude, temperature, and precipitation information, with a granular resolution (10 km 3 10 km) in a large surrounding area centered by the station (610 km 3 610 km, comparable to the domain size used for regional air quality modeling. 35Therefore, these 2D tensors are essentially 61361 matrices.We normalize energy use data when constructing 2D tensors to facilitate the training process, see details in STAR Methods.To capture absolute energy consumption levels and agriculture activity levels (proxied by fertilizer use, livestock, and poultry production per unit area) in the surrounding area (610 km 3 610 km) of a station, we include these additional variables shown as the linear architecture (1311 vector) in Figure S1.Further details about data collection and description are available in in Supplemental information.
A sequential training procedure is adopted where the coefficient for the hyperparameters for the CNN and the coefficient for the linear component are selected sequentially.The training, validation, and testing set is formed based on stratified sampling by the stations' geographic regions, and the representativeness of the samples is shown in Table S1.We show the PM 2.5 prediction performance of our model by comparing the observed and model-predicted concentrations at 943 stations in mainland China for the year 2015 (see Figure 1).The training set shows the best fit, but the validation and test set are also reasonably fitted, suggesting over-fitting is not a major issue.For the test set, mean fractional bias (MFB), mean fractional error (MFE), mean proportional error (MPE), correlation coefficient (r), and R square (R 2 ) are À0.03, 0.13, 0.14, 0.9, and 0.81, respectively, significantly improved compared to the fit of some published reduced-form air quality models (MFB: À0.06, MFE: 0.36, r: 0.74, R 2 : 0.13 in Goodkind et al. 15 for the contiguous US in 2011; MPE: 0.37, r: 0.62 in Muller 17 for the contiguous US in 2005).This prediction performance is comparable to that of traditional CTMs (MFB: À0.06, MFE: 0.36, R 2 : 0.13 in Tessum et al. 36 for the contiguous US in 2005; MPB: 0.37, R 2 : 0.13 in Zhang et al. 37 for mainland China in 2015), see Table S2.The strong predictive power of our model offers us a considerable advantage over conducting the ensuing air quality co-benefits estimation.
The high predictive performance can be attributed to the innovative design of our ResCNN model, which uses a ðl; 1 À lÞ weighting to combine the benchmark linear regression and the CNN model.By varying the value of l, we explore the balance between the linear and nonlinear components until identifying the optimum ratio as l Ã z0:2.It implies that only around 20% of the optimum ResCNN structure resembles the classical linear regression, and around 80% of the ResCNN stems from the CNN architecture.The PM 2.5 pollution is near-linear to the primary PM emission and non-linear to the emissions of gaseous precursors. 38We use the combination of the linear part and the CNN based on this prior knowledge to reduce the required training data. 39The linear component stabilizes the whole ResCNN system, and the CNN components boost the predictive performance beyond a simple linear benchmark.Compared to the prediction accuracy of a pure linear regression that usually serves as a benchmark of non-machine learning methods (R 2 : 0.56), a significantly higher accuracy (R 2 : 0.81) can be achieved by our ResCNN model.
In addition, the linear and non-linear components in the ResCNN architecture take a simple additive form.As a result, the marginal effects of the ResCNN model can be interpreted as causal when the standard conditional independence (CI) assumption holds.The STAR Methods section and the Supplemental information document in detail why the ResCNN outperforms both the reduced-form linear model and the data-driven CNN model and when it can be interpreted as causal.When the standard CI holds, the counterfactual policy analysis can be interpreted as causal.However, as with any causal interpretation using machine learning approaches and observational data only, the causal relationships need to be taken with a grain of salt due to possible confounding variables.Nevertheless, the results from policy scenarios described in the following text at least provide valuable predictive insights into the marginal effects of energy consumption changes.
We simulate annual average PM 2.5 concentration changes under different policy scenarios with reduced fossil energy use.In this study, we focus on industrial coal use (including coal use in the power sector), road transportation oil use, and rural and residential coal use (mainly for heating and cooking), as they are much discussed in China's energy and environmental policy making. 37For illustration purposes, we first simulate scenarios that curtail fossil energy use in the previous sectors by a certain percentage (from 2% to 20% with a step of 2%), respectively.Note that all emissions reduction simulations are conditional on other variables z.We then estimate population-weighted PM 2.5 concentration changes and calculate avoided premature deaths in each scenario based on the most recent concentration-response functions.Here, only premature mortality is considered in our health co-benefit estimation as the exposure-mortality relationship is well-established in the epidemiology literature.We do not consider reduced morbidity or labor productivity loss from air quality improvement.We do not include the benefits of avoiding other dimensions of ''social costs of carbon'' 40 either.
We find that although the amount of coal use in the industrial sector is about 40 times higher than rural and residential coal use, the deaths avoided by reducing industrial coal use are only about ten times higher, suggesting unit pollutant emissions and marginal air-quality-related health damage (hereafter as health damage) of rural and residential coal use are substantially higher.This result is consistent with the recent finding by Yun et al. 41 that China's residential sector contributed only 7.5% of energy consumption but contributed 27% of primary PM 2.5 emissions and 23% of the outdoor PM 2.5 concentrations, respectively.
To showcase the computational advantage of our framework, we further downscale the health benefit calculation by estimating marginal health damages of fossil fuel use (per ton of coal equivalent) in these sectors by each grid cell (10 km 3 10 km).As expected, there is a wide spread of marginal health damage of fossil fuel consumption across different sectors and within each sector spatially.We find that reducing rural and residential coal use offers the highest health benefits per ton of use reduction.Further information about the framework is provided in STAR Methods.

Health co-benefits estimations
We first illustrate how our model could be applied in conventional scenario analyses to estimate air quality and health benefits changes with policies aiming to reduce fossil fuel use.We select three key combustion sources in China as our targets, i.e., industrial coal use, rural and residential coal use, and road transportation oil use.Figure 2 shows the national population-weighted PM 2.5 concentration, and corresponding avoided premature deaths if fossil energy use in one aforementioned sector is curtailed by a certain percentage (from 2% to 20% with a step of 2%), holding other emissions sources and meteorological conditions unchanged.We find the PM 2.5 concentration decreases, and corresponding avoided deaths increase almost linearly with the size of fossil energy curtailment.Annual avoided deaths from reducing 20% of fossil fuels from these three sectors are 37 thousand, 6 thousand, and 0.2 thousand for industrial coal, rural and residential coal, and road transportation oil, respectively.Avoided deaths from reducing industrial coal are about six times higher than from reducing rural and residential coal use, although CO 2 emissions (and energy consumption) from industrial coal use are much higher than emissions from the other two sectors (7.4 gigatons for industrial coal, 0.2 gigatons for rural and residential coal, and 1.0 gigaton for road transportation oil).This result reflects the fact that the emissions factor of industrial coal use is significantly lower thanks to strict end-of-pipe measures implemented since the promulgation of the toughest-ever Air Pollution Prevention and Control Action Plan in 2013, 37 and that locations of emissions sources might be further away from densely populated areas.While most effective air quality control measures promoted by the Action Plan have been targeted at the industrial sector, 37 we show that there are enormous cost-effective potentials to further harvest health benefits from reducing rural and residential coal use.
We then demonstrate how our model could offer estimated spatial distribution of marginal health damage of fossil energy use in different sectors, a task that will require an unacceptably long simulation time using conventional CTMs.We show marginal health damages attributable to an additional ton of CO 2 emissions from a certain type of fossil energy use at every source location in mainland China in Figure 3, with maps for five major air-polluting emissions sources considered in our model (i.e., industrial coal use, rural and residential coal use, coal use in the service sector, road transportation oil use, and industrial oil use).We adopt the value of statistical life (VSL) estimate as 1.8 million in US$(2015), see STAR Methods for details.Average marginal health damage per unit ton of CO 2 emissions (weighted by CO 2 emissions levels of all the grid cells) ranges from 5 to 151 dollars (US$ 2015 price) for these sectors, with the highest health damages for emissions from rural and residential coal use (151 $/ton) followed by coal use in the service sector (128 $/ton).Health damages for emissions from industrial coal use, road transportation oil, and industrial oil use are one order of magnitude lower (14 $/ton for industrial coal use, 9 $/ton for industrial oil use, and 5 $/ton for road transportation oil use).The mean and standard deviation of marginal health damages attributable to an additional ton of CO 2 emissions in China are 19 $/ton and 38 $/ton, respectively.The mean value has the same order of magnitude compared to the  9 although the marginal health damage distribution has a long tail with the 95th percentile value reaching 150 $/ton.Similar to the pattern found in Goodkind et al., 15 all the sector-level distributions are positively skewed, indicating high marginal health damages for some hot spots, especially for rural and residential coal use, with its 95th percentile value reaching 281 $/ton (see Figure S2).The large spread across energy use sources and locations reflects substantial differences in coal quality, combustion condition, and end-of-pipe treatment.Overall, the mean marginal health damage of CO 2 emissions from coal and oil use in China is 21$ and 9$/ton, respectively.
Combining estimated marginal health damages and fossil energy use data with sectoral and regional source information, we can calculate total health damages by sector and by region.We follow the definition of total health damages recommended by Muller et al., 42 where the value of pollution damage is the marginal value of pollution times the quantity of pollution.This definition is consistent with the standard conventions of national accounting and has been adopted in multiple empirical studies 15,16,43 to facilitate cross-sector and cross-region comparisons of health damages.For illustration purposes, we aggregate all the mainland Chinese cities into seven regions following the literature 37,44 and China's official document 45 : Bejing-Tianjin-Hebei (JJJ representing the first characters of three provinces' short names) and surrounding cities (some cities in Shanxi, Shandong, and Henan province included), Yangtze River Delta provinces (YRD), Pearl River Delta provinces (PRD), other East, other Central, West, and Northeast (the first three regions are China's major air-pollution control regions).Figure 4A shows total health damages by sector.Industrial coal use has the largest total health damage (around $110 billion), while the total health damage of rural and residential coal use is about one-third of the industrial coal's total health damage.The other types of fossil fuel use have one order of magnitude smaller total health damages.Figure 4B shows total health damages with sectoral breakouts by region.As one of the most populous and polluted regions, the Bejing-Tianjin-Hebei and surrounding cities bear the highest total health damage from fossil fuel emissions (around $40 billion).Its rural and residential coal use imposes substantially higher total health damages than other regions because winter heating contributes much more to the total PM 2.5 pollution compared to other major air-pollution control regions with less heating demand, e.g., Yangtze River Delta where the total health damage is the second highest (around $30 billion).Some regions show high health damages from a specific type of fossil energy use, e.g., service sector coal use in the Northeast, reflecting the relatively concentrated energy use of this type or associated high marginal health damage in some hot spots in these regions.
The selection of VSL can strongly influence the calculation of the marginal health damage.While we use the estimation of 1.8 million in US$(2015) as an international reference, we also estimate the VSL based on the annual earnings in China.According to Aunan et al., 46 the mean VSL is 100 times of annual earnings in China (95% confidence interval, 50 to 150) with a median of 0.8 million in US$(2015).In this case, the marginal health damage would be 56% lower than the aforementioned values.The mean marginal health damage of CO 2 emissions would be 8$ and 67$/ton for all fossil fuel use and residential coal use, respectively.

Robustness analysis
We examine the robustness of the emissions-concentration relationship derived from our approach by re-calculating the average marginal health damage using models trained by different sets of hyper-parameters.We select four additional sets of hyper-parameters that produce the smallest weighted least square errors on the validation datasets after the set of hyper-parameters chosen for our base-case model.The mean and standard deviation of marginal health damage of CO 2 is 20-35 $/ton and 22-48 $/ton, which are close to the results from our basecase model (19 $/ton and 38 $/ton).
We then evaluate if the selection of surrounding area size for the PM 2.5 monitoring station could affect the prediction accuracy of our framework.Ideally, we should include an area as large as possible to incorporate the impacts of possible long-distance pollutant transportation.In practice, certain thresholds are usually chosen to avoid data availability and computational issues.Besides our base-case model with input tensors representing 610 km3610 km area centered by a monitor station, we report the prediction accuracy of two alternative models with smaller input tensors (210 km3210 km and 410 km 3 410 km) but identical training processes.Surprisingly, models with smaller input tensors can achieve almost similar accuracy compared to our base-case model on the training dataset.Our base-case model does not show any significant supremacy on the test dataset either, see Figure S3.This result suggests that emissions from a closer surrounding area (210 km 3 210 km) dictate the concentration prediction results, and our modeling framework could achieve satisfactory accuracy as long as most relevant input data (fossil energy consumption within a certain distance from the station) are included.
To quantify the spatial distribution of health damages caused by pollution sources, we calculate the total health damage within different sizes of a surrounding area centered by monitor stations using our base-case model.Figure 5 shows the total health damage caused by emissions within a certain size of squares (from 110 km3110 km to 610 km 3 610 km) centered by monitor stations.The total health damage increases linearly with regards to the edge length of the surrounding area, indicating a concave relationship with the area size (square of the edge length).Emissions from a faraway source have a smaller impact on the PM 2.5 concentration of the destination.More than half of the total health damage caused by emissions from the full 610 km3610 km area occurs within the centering 310 km3310 km area (one-quarter of the full area).However, compared to the results found by Goodkind et al. 15 (half of the total PM 2.5 health damages of a pollution source within a 4,096 km radius circle centered by the source are incurred by people living within 32 km of a source), the total health damage curve in China is less concave, suggesting health damages could be more substantially caused by faraway sources.This different pattern emerges possibly because different distributions of pollution sources and population centers in China allow the atmospheric transmission to form local pollution from more faraway sources.

DISCUSSION
It is well-recognized that marginal health damage of emissions and corresponding health benefits of abatement vary widely by place and pollution type.Traditional complex CTMs are not suitable for economic and policy analyses that require many model iterations to explore the variation of marginal health damages.Recent developments of reduced-form models with satisfactory approximations for full-scale models have allowed researchers to quantify the heterogeneity of the marginal health damage but still place some limitations on broader applications.In this study, we develop a machine learning-based framework that can produce more accurate, accessible, and easy-to-update simulations for air quality, or more specifically, PM 2.5 concentrations.Applying this framework to estimate the marginal health damage of CO 2 emissions by fossil fuel type and source location in China provides us with important insights for China's future policy designs that aim to achieve ambitious air quality and climate targets.
We use the ambient PM 2.5 concentration as labels to build the model and estimate the influence of each sector.However, residential fossil fuels not only contribute to ambient air pollution but also significantly affect indoor air pollution.Previous studies 33 estimated that household solid fuels contributed 43% to the total PM 2.5 -related mortality considering both ambient and indoor air pollution in China in 2015.Therefore, the marginal health damage could be significantly higher than current results when indoor air pollution is considered.
We find that China's current carbon pricing stringency does not match the magnitude of marginal health damage.Although many command-and-control climate policies are in place in China, limited market-based instruments have been applied.By far, China's recently launched national CO 2 ETS only covers the power generation sector, with a mild carbon price at the magnitude of 8 $/ton, substantially less than the average marginal health damage (19 $/ton) estimated by this study.Since this health damage only accounts for premature mortality, this policy gap would be more salient if co-benefits from reduced morbidity or labor productivity loss from air quality improvement and from avoiding other dimensions of ''social costs of carbon'' were considered.We urge a more accelerated development of China's national ETS to form an appropriate carbon price that can better internalize the public health externality of CO 2 emissions.
The highly heterogeneous marginal health damage by fuel type and location also has important implications.By far, only CO 2 emissions from industrial fossil fuel use are planned to be covered by China's national ETS 47 Our analysis shows that, however, CO 2 emissions from industrial fossil energy use have much lower marginal health damages compared to the emissions from coal use in the rural and residential as well as service sectors.Therefore, complementary policies that could encourage fuel switching (e.g., coal to natural gas for small boilers or electrification of residential heating) in these sectors are essential.Placing an effective carbon price to the sectoral-average level of marginal health damage could be one option, then reductions in fossil fuel use in these sectors would be determined by these sectoral-specific carbon prices and heterogeneous marginal abatement cost curves.However, fuel cost increases caused by carbon pricing would usually impose the highest burdens on low-income households. 48To ensure environmental justice while promoting the clean energy transition and addressing climate change 14,[49][50][51] policymakers may consider more progressive policy instruments, such as clean fuel subsidies or clean infrastructure investment.We also note that the marginal health damage of CO 2 emissions in more populous eastern China, particularly in the North China Plain and Yangtze River Delta region, is much higher than that of emissions in the rest of the country, suggesting a trading ratio 52 could be considered in the future design of China's national ETS.
Applying machine learning techniques in the environmental integrated assessment motivates fruitful future work.Although our framework provides satisfactory results for the annual PM 2.5 concentration estimation, prediction accuracy for pollutants other than particular matters needs further improvement.Figure 6 shows the prediction accuracy for PM 10 , SO 2 , NO 2 , CO, and O 3 by applying the same framework and training process to these pollutants.The accuracy of PM 10 is similar to that of PM 2.5 .For SO 2 , NO 2 , and CO, the framework can achieve acceptable fitting results on the training dataset but fail to achieve consistent accuracy on the validation and test datasets; the fitness is low even on the training dataset for ozone.Refining the framework architecture and training process and incorporating relevant atmospheric condition data for pollutants other than particular matters remains an interesting direction for future research.
In addition to updating the analysis when input data are available for more recent years, we believe it is also important to extend the cobenefits analysis implemented in this study to more developing countries with imperative air pollution control and climate mitigation policy development.High-resolution marginal health damage estimations that are very relevant for policy making have been almost exclusively conducted in the United States, [10][11][12][13][14][15] suggesting the complexity of the methodology even with elaborately designed reduced-form CTMs.Our machine learning framework can supply a more accessible tool for researchers and policymakers to implement more comprehensive and timely analyses in developing countries that are constrained by conventional emissions inventory and atmospheric measurement data, as new high-resolved energy consumption datasets 53,54 become more readily available.

Limitations of the study
Several caveats should be noted.(i) We acknowledge that even though we have included as many related variables as possible in the model and deducted the contribution of non-energy-related sources from the labels, the CI assumption may still not be strong enough to hold as there could be additional confounding factors and unobserved confounders that we cannot take into consideration.With increasingly available data, the causal interpretability and model stability can be improved in future studies.Nevertheless, there could still be some spatially varying unobserved confounders that would undermine the CI assumption and the claim of causality as our model learns the mapping from emissions to concentrations based on their spatial variation.(ii) The end-of-pipe control measures, which can strongly affect the marginal damages of the power and industrial sectors, are not explicitly represented in the model.Future studies can incorporate the influence of control measures in the modeling framework by re-training it with multiple-year data.(iii) The annual average meteorological parameters considered in the current framework may not be enough to fully capture the correlation (potentially spurious) between energy consumption and PM 2.5 pollution, which are both affected by atmospheric conditions.For instance, the effect of high precipitations when emissions are high may be masked by low precipitations when emissions are low.We tried to integrate monthly or seasonal meteorological data into our model without adjusting its main structure, however, the prediction power did not improve.We believe incorporating more detailed As visualized in Figure S1, our model architecture consists of two parts.The first part is a standard AlexNet, 56 which uses the normalized 3D tensor x ijkn as input for each station n.It starts with repeated blocks of convolutional layers and max pooling layers, and ends with several fully connected layers.The second part has a linear specification and average value of each 1D tensor x kn as input.These two parts provide complementary information: AlexNet absorbs the nonlinear structural spatial information of each 3D tensor, while the linear part absorbs the magnitude information of each 3D tensor, which is lost in AlexNet due to normalization.Intuitively, PM 2.5 concentration varies with the average amount of the energy use (e.g., higher fossil energy use usually causes higher pollution concentration), and it also depends on some spatial structural information that is missing in the average values (e.g., fossil energy use that is closer to the station will have a larger effect on the concentration, and the spatial distribution of one type of fossil energy use may have a nonlinear interaction with the spatial distribution of another type of fossil energy use).Different from the traditional process of explicitly specifying the nonlinear relationship, our model automatically learns the nonlinear relationship between fossil energy consumption, atmospheric conditions, geographic information, and the resulting annual average concentration level of PM 2.5 .The synthesis of the AlexNet and the linear parts is similar to ResNet, 57 which is also a linear combination of linear and nonlinear feature maps.We therefore name the model architecture as ResCNN because it combines linear regression to capture the magnitude effect and CNN to capture the residual nonlinear relationship.
The ResCNN model is more general than ResNet, 57 since it is a weighted average of the linear and the CNN components: 3) Equation 3 consists of the linear component b 0 ½z n ; x kn and the nonlinear (CNN) component f c ðx ijkn Þ, weighted by the hyper-parameter ðl; 1 À lÞ.The b parameters represent the coefficients of the explanatory variables in the linear model to improve interpretability and stability.The l term is a hyper-parameter that adjusts the ratio of the linear and nonlinear components in the ResCNN.As l ˛½0; 1, the l value can measure the closeness of the ResCNN to a linear model.When l/1, the ResCNN resembles a linear regression.When l/0, the ResCNN resembles the CNN model.The two boundary cases of the ResCNN model are the reduced-form linear model and the CNN model, corresponding to the cases in which l = 1 and l = 0. Since 1 À l shrinks the magnitude of the CNN, it resembles the nature of the L2 regularization (Ridge regression), thus mitigating the overfitting of the CNN component.In Equation 3, x represents the treatment variables (CO 2 emissions from fossil energy use); z represents the control variables; The e n term is commonly assumed to be the Gaussian noise, as shown in the training process with mean squared error in the objective function.Overall, the ResCNN is a general model family with effective regularization controls, so it can achieve higher predictive performance than both the reduced-form linear model and the standard CNN model.The superior predictive performance of the ResCNN can be found in Section 6.

Causal interpretation
The ResCNN is causal when the conditional independence (CI) assumption is satisfied.Although the CNN architecture f c is highly nonlinear, causality exists in the ResCNN model because the three major components of the control variables z n , treatment variables x n , and random noise e n take an additive form in Equation 3. Combining the additive structure of ResCNN and the CI assumption, the marginal effect of the ResCNN has a causal interpretation.For further details, please refer to Section 4 in Supplemental information.
The key assumption for the causality of the ResCNN is the independence between the noise term e n and the treatment x n conditioning on z n .This CI assumption is standard for the causal interpretation in linear regression. 58,59In other words, it does not require more assumptions than a linear model for the ResCNN model to obtain causal interpretation.The ResCNN model differs from the linear regression only in the method of estimating the average treatment effect.The ResCNN model uses the nonlinear marginal effect, rather than the simple linear parameters. 58mpirically, the CI assumption x n te n jz n is still challenging to hold.It is because this assumption implies that the major contributors to and the potential confounding factors of PM 2.5 are all incorporated into our model.We have followed the tradition of air quality modeling and include the variables that can approximate the major primary and secondary sources for PM 2.5 formation, e.g., fossil energy consumption, agricultural activities, and geographic information.Some omitted factors may still exist, such as volatile organic compounds emissions.However, the dominant factors that affect annual PM 2.5 concentration levels have already been included in the model to ensure a consistent estimation of the treatment effect.

Evaluation metrics
To evaluate the performance of the ResCNN, we have adopted some common evaluation metrics for the performance of the CTMs, for example, coefficient of determination (R 2 ), normalized mean bias (NMB), and normalized mean error/mean absolute percentage error (NME/MAPE), from the literature. 17,19,20,60,61We add an important modification to all of the evaluation metrics by introducing normalized weights w n that are proportional to the population that a specific station n corresponds to.For example, if there is only one station that enters our sample for a specific city, we assume the station generates the concentration reading that represents the exposure of the city's whole population; however, if multiple stations are present in our sample for a city, we assume each station generates the concentration reading that represents an equal share of the city's population.With this modification, the performance measure gives more weight to the prediction accuracy of stations corresponding to a larger population because the marginal health damage we estimate later is proportional to the population.We illustrate all the metrics we compute in Table S2.Our main model uses the weighted least square errors on the validation set in hyper-parameter searching, which we discuss in the section below.

Hyper-parameter searching and training
There are two sets of hyper-parameters: the set of hyper-parameters of CNN and the coefficient of the linear part l.Since the size of the CNN network is huge compared to the single parameter l.A sequential training is adopted instead of joint training: a hyper-parameter set is chosen for the CNN part (setting l = 0 and then with the CNN hyperparameters fixed, a search over l values are conducted for the final model specification. To address the challenge of specifying the appropriate CNN hyper-parameters, this study combines random search and grid search.First, random searches (400 trials) in a pre-specified, larger hyper-parameter space are done.Conditioning on each hyper-parameter and while keeping the others random, hyper-parameter values that yield consistently lower performance or higher variance in performance are pruned.Then a complete grid search was done to the remaining hyper-parameter values (384 combinations).4][65] The full hyper-parameter space is shown in Table S3, and the pruned hyper-parameter space is shown in Table S4.Among the hyper-parameters, some are model specific and require some searching in each specific application.For instance, dropout and batch normalization were found as effective regularization in several studies. 66,67Data augmentation 68 assumes the invariance property: when images are rotated or flipped, the new images should not change the predicted values b y .On the other hand, some hyper-parameters are set to follow the common practice.For instance, Rectified Linear Unit (ReLU) is used as activation functions for each neuron; He initialization 69 is used to address the problem of vanishing and exploding gradients; Adam optimizer 70 is used for gradient descent optimization.
The ResCNN model is trained by empirical risk minimization (ERM).Formally, 4) in which W represents parameters and W h represents hyper-parameters, while w n represents the weight of each observation as we describe in the above section.Note that the training in Equation 4 is conditioning on the specific choice of hyper-parameters W h .Denote W Ã = argmin W EðW; W h Þ, the optimum hyper-parameter W h is chosen by random searching: W ðSÞ h represents a random sample from the hyper-parameter space.The best hyper-parameter W Ã h is chosen out of S = 100 training.To train model and choose hyper-parameters, the full dataset is split into training, validation, and testing sets with a ratio equal to 3 : 1 : 1 with a stratified sampling approach.The training set is used to train ResCNN model as in (4); the validation set is used for the selection of hyper-parameter as in Equation 5; the testing set is used for model evaluation and comparison.Further details of the experiment design can be found in Supplemental information Section 5.

Health impacts and valuation
We apply the up-to-date GEMM NCD+LRI method 32 to estimate avoided premature death related to reductions in chronic exposure to outdoor fine particulate matter (PM 2.5 ) under different scenarios.The GEMM NCD+LRI method is considered as a major update to the widelyused IER 5-COD approaches 31,71 and has been adopted in some recent cost-benefit analysis. 11,37,44Compared to the IER approach, the GEMM NCD+LRI method incorporates findings from recent cohort studies that quantify the relationship between avoided premature death and ambient PM 2.5 concentration (C-R relationship) in regions with high PM 2.5 concentration levels, making it more appropriate to be applied in relatively more polluted countries like China.In addition, the GEMM NCD+LRI method associates PM 2.5 -related death with non-accidental deaths caused by non-communicable diseases and lower respiratory infections, a more comprehensive range than the five specific causes of death considered in the IER 5-COD approaches.
The GEMM NCD+LRI quantifies the relationship between the hazard ratio (RR) of NCD+LRI and ambient PM 2.5 exposure (c, equal to PM 2.5 concentration for each grid cell) with the following equation: 6) where q, a, m, n and c f are all shape parameters that define the C-R relationship.Since the baseline mortality rate is different for adults of different ages, we follow the convention to divide the population in a specific grid cell n into 12 subgroups (adults with ages from 25 to 85 and above in five-year intervals).Consider two scenarios (scenario 0 and scenario 1) with different ambient PM 2.5 concentration levels (c 0 , c 1 ), the avoided death under scenario 1 compared to scenario 0 in grid cell n is calculated by summing up avoided deaths of all m age groups: DM n = X m M B m 3 pop m;n 3 1 RRðc 1 ; nÞ À 1 RRðc 0 ; nÞ (Equation 7) where M B m is the baseline mortality rate for age group m in China, retrieved from the Global Health Data Exchange, and pop m;n is the population of age group m in grid cell n.
We adopt the assumption that q has a normal distribution with mean m q and standard deviation s q ð50Þ. 31We can then sample 1,000 points from the normal distribution and calculate the mean and 95% confidence interval of avoided death using Equation 7.An alternative, faster approach to obtain the mean avoided death DM n is to use the following equation to first calculate the mean of 1=RRðc n Þ, which has a lognormal distribution, 8) and then calculate the mean avoided death DM n .
DM n = X m M B m 3 pop m;n 3 ð1 RRðc 1;n Þ À 1 RRðc 0;n ÞÞ (Equation 9) In our scenario analysis, where sector k 0 emissions are curtailed by p (p ranges from 2% to 20%), c 0;n is taken to be the predicted baseline PM 2.5 concentration, and c 1;n is obtained by reducing all input emissions in sector k 0 by p: 10) The marginal change of PM 2.5 due to the marginal change of emissions in each sector in each 10 km310 km cell can be obtained from the gradients of PM 2.5 with respect to the input emissions from each sector and each grid cell ( vy vx ijk ).Since back-propagation is used for training the neural network, the gradients can be directly exported from the models.For each unit of emissions increase in sector k 0 in cell ði;jÞ, the mean avoided death for grid cell n is:  11) where again c 0;n is taken to be the predicted baseline PM 2.5 concentration.
We close the estimate of marginal monetary health damages related to increased premature mortality due to an additional ton of CO 2 emissions in each sector in each cell by using an inferred value of statistical life (VSL) for China based on the U.S. EPA recommended VSL of 8.7 million in US$(2015). 72We adopt the income elasticity for high-income countries recommended by a recent meta-study 73 for extrapolating from the U.S. VSL (VSL base ) to the VSL for China (VSL): VSL = VSL base 3 À pcGDP China pcGDP U:S: Á 0:8 (Equation 12) where pcGDP China and pcGDP U:S: represent GDP per capita in 2015 for China and the U.S., respectively.Equation 12gives the VSL estimate as 1.8 million in US$(2015) for China.
Combining the considerations of premature mortality and the inferred value of statistical life, the marginal monetary health damages due to an additional ton of CO 2 emissions in sector k 0 in cell ði; jÞ is MD ijk 0 = À X n VSL 3 DM ijk 0 n (Equation 13) The total monetary health damages incurred by emissions from sector k 0 are the sum of the production of marginal monetary health damages and emissions in each cell: TD k 0 = X i;j MD ijk 0 3 x ijk 0 (Equation 14)

QUANTIFICATION AND STATISTICAL ANALYSIS
Statistical analyses were performed using Python.

Figure 1 .
Figure 1.Comparison of model-predicted and observed PM 2.5 concentrations at 943 stations in mainland China for year 2015 by training, validation, and test datasets

Figure 2 .
Figure 2. Population-weighted PM 2.5 concentrations and corresponding avoided deaths (shaded areas represent 90% confidence intervals) in 2015 in mainland China if fossil energy use in one polluting sector were curtailed by a certain percentage (from 2% to 20%)

Figure 3 .
Figure 3. Marginal health damages measured in dollars attributable to an additional ton of CO 2 emissions from (A) rural and residential coal use, (B) coal use in the industry sector, (C) oil use in the industry sector, (D) coal use in the service sector, and (E) oil use in the transportation sector

Figure 4 .
Figure 4. Total health damages from fossil energy use in mainland China (A) by sector and (B) by region with sectoral breakouts RRC: rural and residential coal use; IDC: industrial coal use; IDO: industrial oil use; SVC: coal use in service industry; TRN: oil use in road transportation; JJJ: Bejing-Tianjin-Hebei; YRD: Yangtze River Delta provinces; PRD: Pearl River Delta provinces.

Figure 5 .
Figure 5.Total health damages by distance (measured by the edge length of the square centered at a monitor station) and emissions source type RRC: rural and residential coal use; IDC: industrial coal use; IDO: industrial oil use; SVC: coal use in service industry; TRN: oil use in road transportation.

Figure 6 .
Figure 6.Prediction accuracy (measured by weighted R 2 on the train, validation, and test datasets) by model with output variables for different pollutant concentrations