Optimal Inversion of Conversion Parameters from Satellite AOD to Ground Aerosol Extinction Coefficient Using Automatic Differentiation

Satellite aerosol optical depth (AOD) plays an important role for high spatiotemporal-resolution estimation of fine particulate matter with diameters ≤2.5 μm (PM2.5). However, the MODIS sensors aboard the Terra and Aqua satellites mainly measure column (integrated) AOD using the aerosol (extinction) coefficient integrated over all altitudes in the atmosphere, and column AOD is less related to PM2.5 than low-level or ground-based aerosol (extinction) coefficient (GAC). With recent development of automatic differentiation (AD) that has been widely applied in deep learning, a method using AD to find optimal solution of conversion parameters from column AOD to the simulated GAC is presented. Based on the computational graph, AD has considerably improved the efficiency in applying gradient descent to find the optimal solution for complex problems involving multiple parameters and spatiotemporal factors. In a case study of the Jing-Jin-Ji region of China for the estimation of PM2.5 in 2015 using the Multiangle Implementation of Atmospheric Correction AOD, the optimal solution of the conversion parameters was obtained using AD and the loss function of mean square error. This solution fairly modestly improved the Pearson’s correlation between simulated GAC and PM2.5 up to 0.58 (test R2: 0.33), in comparison with three existing methods. In the downstream validation, the simulated GACs were used to reliably estimate PM2.5, considerably improving test R2 up to 0.90 and achieving consistent match for GAC and PM2.5 in their spatial distribution and seasonal variations. With the availability of the AD tool, the proposed method can be generalized to the inversion of other similar conversion parameters in remote sensing.


Introduction
As one of the criteria air pollutants [1], particular matter (PM) refers to a mixture of solid and liquid particles suspended in the air, including tiny inhalable particles that may penetrate the thoracic region of the respiratory system. Studies [2,3] show PMs with a diameter of less than 10 µm (PM 10 ) and with a diameter of less than 2.5 µm (PM 2.5 ) are closely associated with short-(e.g., asthma, respiratory) and long-term (e.g., lung cancer, cardiopulmonary mortality) health effects. Recently, several studies [4][5][6] have also shown a significant association between PM 2.5 and diabetes/neurological disorders. Although PM can be emitted directly (primary PM) or formed in the atmosphere (secondary PM), owing to increasing fleets of vehicles and strict emission regulations for industry and coal, traffic emissions have recently contributed more to PM 2.5 concentrations, which might also result in high exposure for the population since more persons live along major traffic routes. Given acute and long-term adverse health effects of PM 2.5 , its monitoring and accurate estimation are important for the studies of its health spatial resolution of 2 • (latitude) × 5 • (longitude) [27]. However, for the purpose of applications, a formula of conversion from satellite AOD of MODIS to ground extinction coefficient is more practical than using LIDAR sensors to obtain the vertical profile that is not always available. Wang et al. (2010) [15] suggested a simplified formula for the conversion from column total AOD to ground aerosol extinction coefficient that substantially improved PM 2.5 estimation.
In this paper, in comparison with the existing methods, a more flexible conversion method is proposed. Considering multiple complex atmospheric and environmental factors (e.g., emission sources [8,10,32], meteorology [11,12], and elevation [12,13]) involved in the aerosol vertical profile, this method introduces a scaling factor (slope) and a shift factor (intercept) for both planetary boundary layer height (PBLH) and relative humidity (RH) to capture the influence of potential other confounders (e.g., atmospheric chemical processes and other meteorological factors) or random factors based on the simplified conversion formula in [15]. For the optimization of multiple parameters in the proposed conversion formula, automatic differentiation (AD) was used in gradient descent to improve learning efficiency. Automatic differentiation is essential in deep learning, and with the availability of deep learning software such as Tensorflow, PyTorch, or Caffe, it can provide an effective way of finding an optimal solution in many practical applications [33]. Additionally, the linear and polynomial scaling and shift factors were also introduced into the simulated ground aerosol coefficient as proxy to the aerosol extinction coefficient to construct a suitable loss function of mean square error (MSE) for training of the models. With the 2015 measurement data of the Jing-Jin-Ji metropolitan study area, the proposed method was used to reliably convert column (satellite) AOD to ground aerosol extinction coefficient and the results were compared with the empirical conversion methods.

Study Region
Covering most of the Jing-Jing-Ji metropolitan region of northern China, the study area ( Figure 1) is located between the latitudes of 38 • 05'N and 41 • 04'N and the longitudes of 115 • 14 E and 118 • 54 E. The Jing-Jin-Ji is also known as Beijing-Tianjing-Hebei as the biggest urbanized megalopolis region in northern China. It is the area surrounding the municipalities of Beijing and Tanjing, and 13 smaller cities along the coast of Bohai Sea. The study area has an area of approximately 64,000 km 2 with a total population of approximately 94 million. This area is the core of smog in northern China, one of the most polluted areas in China. It has multiple emission sources including industrial pollutions, bulk coal combustion, and motor vehicle exhausts [34]. This study area has a lower elevation than that of surrounding high Mount Taihang-Yanshan, which can result in heat island effects and hence concentrations of air pollutants with local not-conducive-to-disperse meteorology of high temperature and low air pressure [35,36]. Compared with the other seasons, winter in the study area has much higher pollution with up to several times the emissions of PM 2.5 , organic carbon, black carbon and other pollutants due to heating, and adverse meteorology of lower wind speed and higher atmospheric stability.

Satellite AOD
The Multiangle Implementation of Atmospheric Correction (MAIAC) has been recently used to retrieve satellite AOD from the MODIS sensors aboard the Terra and Aqua satellites at a high spatial (1 km) and temporal (daily) resolution with better atmospheric correction over both dark and moderately bright surfaces than the previous Dark Target and Deep Blue retrieval algorithms. In this study, the 2015 MAIAC AOD images covering the study area were collected from the website (https://lpdaac.usgs.gov/news/release-of-modis-version-6-maiac-data-products) of the U.S. Geological Survey. The 1954 Beijing coordinate system was used as the target kilometer projection in this study Remote Sens. 2020, 12, 492 4 of 20 and the projection transformation to the target system was conducted for these images. The quality assurance (QA) data provided in the MAIAC dataset were used to filter out invalid pixels due to closeness to cloud or snow or due to high surface reflectance. A non-linear generalized additive model (GAM) was used to fuse Aqua and Terra daily images to increase spatial coverage of available AOD when one AOD was missing and the other was available. Specifically, the fusion was conducted using the averages of both Aqua and Terra (observed/estimated) AODs when both observed AODs were available, or when one observed AOD was available and the other AOD is missing. For the latter, the missing AOD was estimated by the GAM trained using the samples of both observed AOD available. Similar methods were used in [37,38].

Satellite AOD
The Multiangle Implementation of Atmospheric Correction (MAIAC) has been recently used to retrieve satellite AOD from the MODIS sensors aboard the Terra and Aqua satellites at a high spatial (1 km) and temporal (daily) resolution with better atmospheric correction over both dark and moderately bright surfaces than the previous Dark Target and Deep Blue retrieval algorithms. In this study, the 2015 MAIAC AOD images covering the study area were collected from the website (https://lpdaac.usgs.gov/news/release-of-modis-version-6-maiac-data-products) of the U.S. Geological Survey. The 1954 Beijing coordinate system was used as the target kilometer projection in this study and the projection transformation to the target system was conducted for these images. The quality assurance (QA) data provided in the MAIAC dataset were used to filter out invalid pixels due to closeness to cloud or snow or due to high surface reflectance. A non-linear generalized additive model (GAM) was used to fuse Aqua and Terra daily images to increase spatial coverage of available AOD when one AOD was missing and the other was available. Specifically, the fusion was conducted using the averages of both Aqua and Terra (observed/estimated) AODs when both observed AODs were available, or when one observed AOD was available and the other AOD is missing. For the latter, the missing AOD was estimated by the GAM trained using the samples of both observed AOD available. Similar methods were used in [

PM 2.5 Measurement
The hourly PM 2.5 measurement (unit: µg/m 3 ) data covering the study area were from 102 monitoring stations operated by China's governmental agency (http://www.pm25.in) (see Figure 1 for spatial distribution of these stations and their 2015 mean PM 2.5 concentrations). The monitoring data of PM 2.5 were measured using a tapered element oscillating microbalance (TEOM) with the Thermo Fisher 1405F [39]. Specifically, the ambient air was first pumped in by the TEOM and heated up to 40 • C to remove the influence of water vapor. Then, the particles with a smaller size than 2.5 µm were selected using a special filter and weighted. Finally, the dry mass of PM 2.5 within a unit volume of ambient air was acquired and recorded. Therefore, relative humidity played an important role in the conversion under moisture conditions to dry particle mass. Using a 75% completeness criterion, the hourly measurements were averaged into daily PM 2.5 concentration.

PBLH
The 2015 PBLH (unit: meter) data were gathered from the reanalysis data source of MERRA2 (https://gmao.gsfc.nasa.gov/reanalysis/MERRA-2). The reanalysis data have a coarse spatial resolution of 0.25°(latitude) × 0.3125°(longitude) and a high temporal resolution of 3 h. Daily means were Remote Sens. 2020, 12, 492 5 of 20 obtained by averaging over 3 h reanalysis data. Re-sampling was conducted to convert the coarse spatial-resolution PBLH data to 1-km spatial resolution data to match the spatial resolution of MAIAC AOD.

Relative Humidity
The measurement data of RH (unit: %) were gathered from the dataset of daily ground observation values in China Mainland (Version 3.0) of the China Meteorological Data Service (http://data.cma.cn). In total, the measurement data came from 824 national meteorological stations, and spatial interpolation was conducted using these measurement data with other factors. The method of a residual deep network was used to obtain the daily grid images with a spatial resolution of 1 km (cross-validation R-squared (R 2 ) = 0.85 for RH) [40]. The study area's grids of RH were extracted from the interpolated dataset.

Simulation of Ground Aerosol Extinction Coefficient
As a metric of total column integrated aerosol extinction, satellite AOD is determined by the aerosol scale height and the corresponding aerosol extinction coefficient (Figure 2a): where τ a is column AOD, k a (λ, z) is the aerosol extinction coefficient (k a ) at the altitude of z and the spectral wavelength λ (for satellite AOD, λ = 0.55 µm), TOA represents the top of the atmosphere [41].

PBLH
The 2015 PBLH (unit: meter) data were gathered from the reanalysis data source of MERRA2 (https://gmao.gsfc.nasa.gov/reanalysis/MERRA-2). The reanalysis data have a coarse spatial resolution of 0.25˚ (latitude) × 0.3125˚ (longitude) and a high temporal resolution of 3 h. Daily means were obtained by averaging over 3 h reanalysis data. Re-sampling was conducted to convert the coarse spatial-resolution PBLH data to 1-km spatial resolution data to match the spatial resolution of MAIAC AOD.

Relative Humidity
The measurement data of RH (unit: %) were gathered from the dataset of daily ground observation values in China Mainland (Version 3.0) of the China Meteorological Data Service (http://data.cma.cn). In total, the measurement data came from 824 national meteorological stations, and spatial interpolation was conducted using these measurement data with other factors. The method of a residual deep network was used to obtain the daily grid images with a spatial resolution of 1 km (cross-validation R-squared (R 2 ) = 0.85 for RH) [40]. The study area's grids of RH were extracted from the interpolated dataset.

Simulation of Ground Aerosol Extinction Coefficient
As a metric of total column integrated aerosol extinction, satellite AOD is determined by the aerosol scale height and the corresponding aerosol extinction coefficient ( Figure 2a where τa is column AOD, ka(λ, z) is the aerosol extinction coefficient (ka) at the altitude of z and the spectral wavelength λ (for satellite AOD, λ = 0.55 μm), TOA represents the top of the atmosphere [41]. Although the vertical profile of aerosol extinction coefficient does not directly follow a simple exponential decay with altitude, the scale height simulated by an exponential distribution, although Although the vertical profile of aerosol extinction coefficient does not directly follow a simple exponential decay with altitude, the scale height simulated by an exponential distribution, although not perfect, is a good approximation for the aerosol vertical extent ( Figure 2b) with a limited degree of uncertainty and one feature of aerosol vertical mixing in the low-level troposphere [42]: where k a,0 (λ) represents the ground-level aerosol extinction coefficient at the wavelength of λ, and H A is the scale height of aerosol, approximately represented by PBLH [43]. Introducing Equation (2) into Equation (1), the following formula can be obtained: where it stands for the intercept, −H A · exp(− z TOA H A ), which is often negligible due to a high value of z TOA in many studies [15,17].
In addition, studies [44,45] show that the association between k a· and PM concentration varies with the chemical components of particles and the RH of the ambient air. Thus, the factor of RH was introduced into the adjustment of ground aerosol extinction coefficient in Wang et al. (2010) [15]: where k a,Dry (λ) represents the "dry mass" of the particles, closely associated with PM 2.5 concentration. f(RH) stands for the factor of relative humidity: where g is an empirical fit coefficient. Introducing Equations (3) and (5) into Equation (4), the following formula can be obtained: To have flexibility in the solution of k a,Dry (λ), a shift factor and scaling factor were introduced to PBLH and RH, respectively, in Equation (6) to account for the influence of the uncertainty arising from the measurement error, or other confounders (e.g., the intercept it in Equation (3)). Thus, the following formula is obtained: where s RH and i RH , respectively, represent a scaling factor and a shift factor for RH, and s H A and i H A , respectively, represent a scaling factor and a shift factor (similar to it in Equation (6)) for PBLH.

Solution by Automatic Differentiation
To solve five parameters θ(g, s RH , i RH , s H A , i H A ) in the flexible model of Equation (7), it is difficult to use a traditional method such as the analytical or least squares solution. Instead, machine learning of gradient descent may be used to find a locally optimal solution. Automatic differentiation [46], also called algorithmic differentiation, provides an efficient method with which to implement gradient descent to find an optimal solution for the complex problem of multiple parameters to be solved. For differentiation used in gradient descent, a traditional hand-crafted method needs non-trivial formulas and laborious work that is impractical for complex problems; numerical differentiation may have the issue of plague of truncation and round-off errors; symbolic differentiation predefines set of operations such as chain-rule for products but it may have a problem of expression swell that refers to a much larger representation of the derivative as opposed to that of the original function [47]. Different from these previous methods, AD is not limited by the nature of function and can go across complicated control flows, and makes gradient descent easy to apply in many practical optimization problems. It consists of a series of operations in a computational graph that is based on dynamic programming. A computational graph is a directed graph of basic operations (e.g., add, multiply, divide, trigonometric function) and the final value can be obtained by propagation through its corresponding path in the graph. Specifically, AD first generates a computational graph of domain variables and then continues to replace the domain of the variables to incorporate derivatives per the chain rule of differential calculus. In essence, AD applies symbolic differentiation at the elementary level and keeps intermediate numerical results with the evaluation of the main function. By the chain rule, the errors can be backpropagated through gradients to update the parameters until convergence is attained or an optimal solution is obtained.
The efficiency of automatic differentiation and convenience of its use have led to considerable developments of deep learning and are essential for searching for the optimal solution for general problems [33]. In this study, Tensorflow (https://www.tensorflow.org/) (Version 1.8) was used as the AD tool to solve the parameters in Equation (7).
For a typical optimization problem using gradient descent, a loss function is first needed to quantify how well the trained model fits the training input X. Then, AD is used in gradient descent to solve the minimum of the following loss function: where X denotes the input data including AOD (τ a ), PBLH (H A ), and relative humidity (RH), Y represents the target variable of ground aerosol extinction coefficient to be converted to PM 2.5 , and θ refers to the parameters to be optimized and estimated in learning.
The final target of Equation (8) is to improve the correlation between ground aerosol extinction coefficient and PM 2.5 . Therefore, the following loss function may be used to make the correlation differentiable: where c denotes the Pearson's correlation between k a,Dry (λ) in Equation (7) and ground monitoring PM 2.5 .
In order to achieve efficiency in learning, a suitable loss function should be selected to avoid the ill-conditioned issue and have an efficient convergence in optimization. However, the correlation coefficient cannot be used in a loss function due to poor convergence and the ill-conditioned issue; the sensitivity test also demonstrated this. For regression, as the most common loss function, the MSE has good properties of efficient convergence and no outlier predictions with huge errors [33] and can be used instead. Since the ground aerosol extinction coefficient, k a,Dry (λ) is finally used to estimate PM 2.5 in the downstream procedure, we can define a conversion function from it to PM 2.5 using the following non-linear polynomial formula: a,Dry (λ) + s k a,Dry k a,Dry (λ) + i k a,Dry whereŷ denotes PM 2.5 concentration to be estimated, k 2 a,Dry (λ) is the square term of k a,Dry (λ), s k 2 a,Dry is its scaling (slope) factor, and s k a,Dry and i k a,Dry represent the scaling (slope) and shift (intercept) factors, respectively, for k a,Dry (λ). With Equation (10) defined as the estimated PM 2.5 , we can get the MSE loss function: where y = y i is the vector of the observed PM 2.5 , andŷ = ŷ i is the vector of the estimated PM 2.5 by k a,Dry (λ) and the other covariates (PBLH and relative humidity) from Equation (10). If the minimum loss (MSE) is attained, the correlation of ground aerosol extinction coefficient with PM 2.5 is reasonably assumed to be improved correspondingly.
In addition, gradient descent generally finds a local optimal solution. Owing to the introduction of scaling and shift factors, such a solution cannot make the estimated k a,Dry (λ) directly equal to the ground-level aerosol extinction coefficient, but the solution can be regarded as a proxy variable to the Remote Sens. 2020, 12, 492 8 of 20 ground true low-level aerosol extinction coefficient. The proxy variable can be used as an alternative to improve the estimation of PM 2.5 . For simplicity, the proxy variable is named the ground aerosol coefficient (GAC) throughout this paper.
The computational graph of AD represents the model to be trained with the nodes used, including input variables, target variable, the parameters to be solved, and interconnections (forward and backpropagation) between them (Figure 3). In the computational graph, a composite computation may consist of multiple basic and/or composite computations. For gradient descent, the target loss function (Equation (11)) was a starting point for decomposition and complex computations were progressively decomposed until simple and basic operations were obtained. Figure 3 shows the composite and basic nodes in the decomposition, and the connections between them with a line arrow indicating the forward or reverse mode of a gradient operation (black line: forward mode; red dot line: reverse mode). Since there were multiple parameters to be solved, reverse mode or backpropagation was the best method for efficiency in computing [46]. Table 1 shows each node's name, definition, related formula, derivatives in backpropagation, and derivatives for the parameters to be solved, if any. For backpropagation, the starting point was the last row of the loss target function (Equation (11); Table 1), Column 4 shows the derivatives of the previous node against the current node, and Column 5 shows the derivatives against a target parameter, if any. In the computational graph, the target loss function (Equation (11)), simulated PM 2.5 (Equation (10)), and the composite formula for GAC (Equation (7)) were involved (   As a local optimization method, gradient descent needs to select the optimizer for training. In this study, the adaptive Ada optimizer was used with a mini-batch size of 1024 to obtain an optimal solution. As an advanced optimizer, Ada [48] is an optimization algorithm with adaptive learning rate that is dynamically adjusted based on the squared gradient and momentum.  ∂U 10 /∂U 9 · ∂U 9 /∂U 8 + ∂U 12 /∂U 11 · ∂U 11 /∂U 8 = −2U 14 · (s k α,Dry + 2U 8 · s k 2 α,Dry ) U 9 (∂U 9 /∂s k a,Dry ) = −2U 14 · U 8 U 9 s k a,Dry U 8 s k a,Dry k a,Dry U 10 (∂U 10 /∂U 9 ) = −2U 14 U 10 (∂U 10 /∂i k a,Dry ) = −2U 14 U 10 U 9 + i k a,Dry s k a,Dry k a,Dry + i k a,Dry U 13 (∂U 13 /∂U 10 ) = −2U 14

Validation and Comparison
A total of 70% of the 53,040 data samples of available AOD and PM 2.5 measurements for the study area were randomly selected to train the model, and the remaining 30% of the data samples were used to test the trained models. This procedure was repeated five times with different training and testing samples selected, and the final result was obtained by averaging over the results of the five procedures.
The proxy variables for ground aerosol extinction coefficient generated by three methods were generated and compared. These variables include original AOD and GACs simulated just using PBLH, using the empirical formula proposed in [15], and using the AD method proposed in this paper. The test Pearson's correlation of AOD or simulated GAC with measured PM 2.5 , R 2 , and root mean square error (RMSE) in the linear regressions of PM 2.5 estimation using these simulated values was reported and compared. The grid surfaces of AOD and simulated GAC were also presented and compared.  Table 2 shows the annual and seasonal statistics of MAIAC AOD, PBLH, RH, and PM 2.5 for the 2015 dataset.

Learning and Validation
For the empirical conversion formula (Equation (6) with it = 0), different values of the exponential parameter g were used from a small value (−0.3) to a large value (0.8) to test Pearson's correlation of satellite AOD with PM 2.5 . The result shows that the optimal correlation (0.54) was attained when g = 0.208 (Figure 4).   In machine learning, an epoch is one complete presentation of the data samples to be learned to a model [49]. Usually, a machine learner needs multiple epochs to approach the optimal solution. To obtain the conversion parameters (five in total), 20,000 epochs of training were conducted for this study. The learning curves ( Figure 5) show that the loss, RMSE and Pearson's correlation with PM 2.5 in training and testing, and the exponential coefficient of RH (g in Equation (6)) reached a plateau with increasing number of training epochs. The final optimal solution is g = 0.51, s RH = 164, i RH = 65, s H A = 0.40, and i H A = 33 with a significant Pearson's correlation of 0.58 with PM 2.5 for the test dataset. Supplementary Materials Figure S1 shows the trend of the other parameters (scaling and shift factors for RH and PBLH) during learning. As an effective validation for our method, the independent test results show little or very small difference (Pearson's correlation with PM 2.5 : 0.59 vs. 0.58) from the results of the training samples, indicating little over-fitting for our method. Figure S1 shows the trend of the other parameters (scaling and shift factors for RH and PBLH) during learning. As an effective validation for our method, the independent test results show little or very small difference (Pearson's correlation with PM2.5: 0.59 vs. 0.58) from the results of the training samples, indicating little over-fitting for our method.

Comparison of the Methods
The test results (Table 3) show that original column (satellite) AOD without any adjustment had the lowest correlation (0.38) with PM 2.5 and R 2 (0.15) and the highest RMSE (65 µg/m 3 ) for linear regression; column AOD adjusted by PBLH improved the correlation from 0.38 to 0.53, an increase of 15% (R 2 increased by 0.13 and RMSE decreased by 5 µg/m 3 ). Simulated GAC by the empirical method ( Figure 4) further improved the correlation to 0.54 (R 2 increased by 0.01 and RMSE decreased by 1 µg/m 3 ). The optimal simulated GAC by automatic differentiation achieved the highest correlation (0.58) with PM 2.5 (the highest R 2 being 0.33, and the lowest RMSE being 56 µg/m 3 ). The histograms (Figure 6) show the value distributions of original AOD and GACs simulated (see Supplementary Materials Figure S2 for the scatter plots between observed vs. predicted PM 2.5 using original AOD and simulated GAC). The results show a better match of optimal GAC by AD with PM 2.5 than that by the other methods. The histograms (Figure 6) show the value distributions of original AOD and GACs simulated (see Supplementary Materials Figure S2 for the scatter plots between observed vs. predicted PM2.5 using original AOD and simulated GAC). The results show a better match of optimal GAC by AD with PM2.5 than that by the other methods. The time series (Figure 7) of total means of PM2.5, AOD, and GAC show a seasonal variation of simulated GAC using AD, quite similar to that of PM2.5 (low in summer and high in winter) for the study area, compared with original satellite AOD (high in summer and low in winter). The time series (Figure 7) of total means of PM 2.5 , AOD, and GAC show a seasonal variation of simulated GAC using AD, quite similar to that of PM 2.5 (low in summer and high in winter) for the study area, compared with original satellite AOD (high in summer and low in winter).

Spatial Distributions of Simulated Ground Aerosol Coefficient and Predicted PM 2.5
With the optimal simulated GAC using AD, the grid surfaces of the simulated ground aerosol coefficient and predicted PM 2.5 were made. A residual deep network was used as the regression model [50]. Compared with the models trained just using the original MAIAC AOD with the other covariates, the models trained using the optimal GAC improved the test R 2 from 0.78 to 0.90 (improvement of 12%). Figure 8 presents the grid surfaces of a typical Summer day (a, b, and c for 07/29/2015) and a typical Winter day (d, e, and f for 12/31/2015): a and c for original satellite AOD, b and d for optimal GAC simulated, and c and e for predicted PM 2.5 using the optimal GAC.   As shown in the spatial distributions ( Figure 8) of simulated AOD/GAC and estimated PM 2.5 in summer and winter, the optimal GAC (Figure 8b,e) converted by AD better matched seasonal distribution of PM 2.5 (higher concentration in winter than in summer) ( Figure 7); otherwise, the original AOD (Figure 8a,d) had higher values in summer than in winter, opposite of the seasonal variation of PM 2.5 . In addition, spatial distribution of AOD/GAC and PM 2.5 consistently showed higher values in the central and southern regions than the other regions within the study area. As aforementioned, the study area's primary emission sources include industrial pollutions, coal, and vehicle exhausts, and these sources have been aggregated in the populated central and southern regions, thus may be resulting in heavier concentration of PM 2.5 . In winter for northern China, with the aggravating factors of low elevation, low wind speed, high atmospheric stability, and less rainfall, much more coal combustion due to heating in these populated regions has resulted in much more pollution than in the other seasons and regions. In summer, the difference in PM 2.5 concentration between the regions was small due to less coal emission and more favorable weather conditions for pollutant dispersion than in winter. For Beijing, China's capital, more heavy industry factories had been relocated and so, on average, it had less pollution than the other cities and regions, shown as Figure 8c,f.

Discussion
Although estimation of PM 2.5 concentration at high spatial (e.g., 1 km) and temporal (e.g., daily) resolution is very helpful for evaluation of its emissions and health effects [51,52], it is very challenging given the limited amount of monitoring data sources and the influence of multiple factors (diverse emission source, meteorology, atmospheric process, and elevation) and their interactions on spatiotemporal variability of PM 2.5 [53]. Satellite AOD, such as recent high spatiotemporal-resolution MAIAC AOD, has been used as a predictor of primary interest for spatiotemporal estimation of PM 2.5 due to its global spatial and temporal coverage. It can compensate unavailability or limited availability of the other predictors, e.g., emission sources and inventory at a fine resolution [38]. However, for satellite AOD, the potential difficulty is the significant inconsistency or a difference between its seasonal patterns and ground aerosol mass including PM 10 and PM 2.5 [15,17] given that AOD measures a column integrated aerosol extinction coefficient, different from ground-level aerosol mass. Previously, a simplified conversion [15] from satellite AOD to ground aerosol coefficient was given to improve the representative and prediction power of column AOD for PM 2.5 . However, this approach of a single-parameter (just an exponent parameter of RH to be solved) formula, similar to a fixed-effect model, did not consider the influence of uncertainty deriving from the measurement errors and/or other confounders and thus might not capture the influence from such uncertainty on the results. The other empirical conversion methods considered more influential factors such as particle size [23] and visibility [25] although such covariates are not available for the case study. Many of these existing methods used a conversion formula of a fixed format and used empirical values to fill in several parameters to finish the conversion. The empirical values of these parameters may not fit in with the specific context and data samples, thus are potentially suboptimal.
Compared with the existing methods [15,[20][21][22][23][24][25][26]30,31], although just having a modest improvement (4% for the empirical method) for the test correlation between the simulated GAC and PM 2.5 , this proposed method provides a more flexible framework to fuse the influence of multiple scaling, shift, and other potential factors on the conversion that involves complex atmospheric and chemical processes of aerosols; this method is convenient to train and use in support of the AD tool, and makes the conversion easily adjusted or improved using the data of additional influential factors if available. The polynomial conversion from proxy GAC to PM 2.5 was conducted to employ the stable RSE loss function to obtain a slightly better correlation than linear conversion (0.58 vs. 0.56). These factors worked similarly as random effect factors that aimed to account for more variance and the influence of uncertainty from the other potential confounders. In addition, the proposed method can conveniently incorporate more polynomial terms of the parameters, and/or the other available covariates to obtain an optimal solution of multiple parameters if these terms or covariates can significantly affect the conversion from satellite AOD to GAC. A potential extension of the proposed method is to infer the optimal parameters for the probabilistic distribution of the vertical profile of aerosol extinction coefficient if vertical data of multiple layers are available to simulate such a distribution. Automatic differentiation provides a convenient and reliable tool with which to be able to employ gradient descent to find a local optimal combinational solution for multiple parameters. To use automatic differentiation, the scaling and shift factors were also introduced for the proxy variable simulated for the ground aerosol extinction coefficient to use the MSE loss function, which can work better than the loss of Pearson's correlation coefficient in training [33]. The results show that the proposed conversion method achieved the best performance-the highest test correlation (0.58) with PM 2.5 and highest test R 2 (0.33) and lowest test RMSE (56 µg/m 3 ) for regression of PM 2.5 -compared with the other methods. The ground aerosol coefficient simulated using AD had a distribution and seasonal variation quite similar to those of PM 2.5 . Compared with original column (MAIAC) AOD, the simulated GAC improved Pearson's correlation with a ground true PM 2.5 from 0.38 to 0.58.
Use of gradient descent in AD made the solution locally optimal, not globally optimal. Thus, the output simulated in Equation (7) could not be completely equal to the ground aerosol extinction coefficient, but rather it would be called a proxy variable for the ground aerosol extinction coefficient. Direct estimation of the ground aerosol extinction coefficient is difficult since just very limited vertical profile data of the aerosol extinction coefficient have been available for training a conversion function [15,17].
For estimation of PM 2.5 , given the simulated GAC available, the other factors, including meteorology (wind speed, relative humidity, air temperature, precipitation, etc.), elevation, and traffic density, should be considered in the models since these factors affect spatiotemporal variability of PM 2.5 [53,54]. With fusion of these data within the model of a residual deep network [50], the test R 2 of PM 2.5 estimations reached 0.90. Compared with original column (MAIAC) AOD, the ground aerosol coefficient simulated contributed to a 12% improvement in R 2 . The simulated GAC was quite similar to PM 2.5 in terms of probabilistic distribution ( Figure 6) and spatial distribution (Figure 8) and had a greater contribution than original AOD to spatiotemporal estimation of PM 2.5 . The results show a better match between optimal simulated GAC and PM 2.5 than original AOD. Although the correlation between PM 2.5 and simulated GAC was moderately high, such a correlation was just linear. Given multiple different emission sources and complex atmospheric and chemical processes in generation of secondary PM 2.5 , the practical relationship is complex and non-linear. As aforementioned, the spatiotemporal estimation model of PM 2.5 should consider the influence of multiple factors and their non-linear interactions with GAC. The test showed that without GAC as a predictor, the trained model just had a test R 2 of 0.60 in estimation of PM 2.5 , compared to the test R 2 of 0.90 achieved by the models using GAC. Thus, simulated GAC made a contribution of about 30% in R 2 over the model not using AOD.
The proposed method has three limitations. First, the solution of the presented method is not globally but locally optimal. The flexible model proposed may make a difference in predicted parameters (e.g., scaling and shift factors) between different training cycles. However, results of sensitivity tests show a small standard deviation for the final simulated GAC, illustrating the reliability of the proposed method. Second, the proposed method did not incorporate specific multiple layers in a vertical profile such as those provided by the active LIDAR satellite due to data unavailability. Such vertical information may be helpful for the conversion. However, in practical applications, such multilayer information at a sufficient spatial resolution and spatiotemporal coverage is often unavailable, and many existing studies for PM 2.5 estimation did not incorporate it in using satellite AOD for PM 2.5 estimation [15,55]. This paper used an exponential distribution to simplify and simulate such a vertical profile for practical applications. Based on a flexible model and automatic differentiation, although the simulation of the vertical profile is not perfect, the proposed method achieved the state-of-the-art performance in the conversion and downstream application of PM 2.5 estimation. Third, all the discussed conversion methods work best in the areas dominated by regional aerosols residing in the boundary layer, but work less well in the areas dominated by long-range aerosol transport at higher altitudes [56,57].
With the availability of deep-learning software, it is convenient to use AD extensively to solve the complex formula (e.g., involving exponential, scaling, and shift parameters) for inversion of conversion parameters for satellite AOD and other environmental surface variables of remote sensing.

Conclusions
For inconsistency between satellite column AOD (e.g., MAIAC AOD) and ground aerosol extinction coefficient/PM 2.5 , in this paper a novel method of automatic differentiation for convenient conversion from satellite AOD to ground aerosol extinction coefficient is proposed. The proposed method introduced scaling and shift factors for PBLH and RH to account for the uncertainty deriving from the measurement and other confounders. Based on the computational graph, AD efficiently automates the differentiation, which has driven extensive applications of gradient descent in machine and deep learning. As the base of the proposed method, AD provides a convenient tool with which to obtain an optimal combinational solution for multiple parameters of the proposed method. In order to use the stable MSE loss function, a polynomial conversion from GAC to PM 2.5 was suggested to achieve a better correlation of GAC with PM 2.5 than a linear conversion. In a case study of the Jing-Jin-Ji metropolitan area, the proposed method achieved the highest test correlation (0.58) of the ground aerosol coefficient simulated with PM 2.5 and the highest test R 2 (0.33) in linear regression, compared with original AOD and GAC obtained using the single-parameter method. In the extensive downstream validation of PM 2.5 estimation, our method improved the test R 2 from 0.78 to 0.90, and achieved a better match of GAC with PM 2.5 in their spatial distributions and seasonal variation. Similar methods can be also applied to solve the inversion parameters for other surface grid variables in remote sensing.
Supplementary Materials: The following are available online at http://www.mdpi.com/2072-4292/12/3/492/s1, Figure S1: Learning curves of scaling (slope) and shift (intercept) factors for relative humidity and PBLH, Figure  S2: Scatter plots of the observed vs. predicted PM 2.5 in univariate linear regression, illustrating improvement of the optimal ACG by automatic differentiation.