retrieval : embedding machine learning to simulate complex 2 physical parameters

19 Satellite remote sensing of PM2.5 mass concentration has become one of the most 20 popular atmospheric research aspects, resulting in the development of different models. 21 Among them, the semi-empirical physical approach constructs the transformation 22 relationship between the aerosol optical depth (AOD) and PM2.5 based on the optical 23 properties of particles, which has strong physical significance. Also, it performs the 24 PM2.5 retrieval independently of the ground stations. However, due to the complex 25 physical relationship, the physical parameters in the semi-empirical approach are 26 difficult to calculate accurately, resulting in relatively limited accuracy. To achieve the 27 optimization effect, this study proposes a method of embedding machine learning into 28 a semi-physical empirical model (RF-PMRS). Specifically, based on the theory of the 29 physical PM2.5 remote sensing approach (PMRS), the complex parameter (VEf, a 30 columnar volume-to-extinction ratio of fine particles) is simulated by the random forest 31 model (RF). Also, a fine mode fraction product with higher quality is applied to make 32 up for the insufficient coverage of satellite products. Experiments in North China show 33 that the surface PM2.5 concentration derived by RF-PMRS has an average annual value 34 of 57.92 μg/m3 versus the ground value of 60.23 μg/m3. Compared with the original 35 https://doi.org/10.5194/egusphere-2022-946 Preprint. Discussion started: 27 October 2022 c © Author(s) 2022. CC BY 4.0 License.


Introduction
Epidemiological studies have indicated that PM2.5 (fine particulate matter with an aerodynamic equivalent diameter no greater than 2.5 μm) can adversely affect human health, such as increasing the risk of diabetes and respiratory diseases (Bowe et al., 2018;Pope III et al., 2002;Xu et al., 2013), and accurate surface PM2.5 concentration is the basis of air pollution-health related research.Satellite remote sensing has the advantages of high resolution and global coverage (Ma et al., 2014;Wu et al., 2020), including variables strongly associated with PM2.5 such as aerosol optical depth (AOD).
Therefore, it has become a mainstream method for fine particles estimation (Zhang et al., 2021).
There are mainly three satellite-based ways of retrieving PM2.5. 1) Chemical transport models-based method.It calculates a scaling factor η between AOD and PM2.5 simulated by atmospheric chemical transport models (CTM) (Lyu et al., 2022) and then transfers the proportional relationship to satellite AOD data when calculating surface PM2.5 concentration (Geng et al., 2015;Van Donkelaar et al., 2006).However, the assumption of a constant factor between simulated and observed values has large spatiotemporal limitations.2) Univariate/Multivariate regression.This kind of method establishes a statistical model between AOD, auxiliary variables, and ground PM2.5 observations.Machine learning is a common tool for such data-driven methods due to its powerful nonlinear fitting ability between multiple variables (Irrgang et al., 2021).
But the regression is affected by the distribution and density of ground stations (Gupta and Christopher, 2009;Li et al., 2017).3) Semi-empirical physical approach.Taking https://doi.org/10.5194/egusphere-2022-946Preprint.Discussion started: 27 October 2022 c Author(s) 2022.CC BY 4.0 License. the physical theory as the basis, surface PM2.5 is derived through an empirical formula constructed from AOD and some PM-related key parameters, including an important empirical parameter related to the optical properties (S).The process steps are explicit and independent of ground station observations.Meanwhile, this approach has stronger physical interpretability than the previous two methods with a large space for optimization.
Due to the complexity of the physical parameters, many studies have optimized the semi-empirical physical approach.Raut and Chazette (2009) introduced a specific extinction cross-section to simplify the expression of S and PM2.5 concentration was estimated based on 355nm-band radar observations.Kokhanovsky et al. (2009) constructed a particle effective radius model, which can obtain the particle concentrations throughout the atmospheric column.Furthermore, Zhang and Li (2015) proposed the physical PM2.5 remote sensing method (PMRS).It replaced S by defining a volume-to-extinction ratio of fine particles (VEf) and used a quadratic polynomial of fine mode fraction (FMF) to simulate VEf, showing certain advantages (Li et al., 2016;Zhang et al., 2020).However, the above semi-physical empirical models have some shortcomings.Firstly, the satellite data used in the models are blocked by clouds and fog in some areas, thus high-coverage and high-precision products need to be excavated and applied; secondly, there are still large uncertainties in estimating physical parameters (such as a simple polynomial fit to S in the PMRS method) and their expressions need to be improved.
To date, machine learning (ML) has developed rapidly.It can detect complex nonlinear relationships of multiple data and model their interaction (Yuan et al., 2020;Lee et al.,2022), which provides an idea for improving the accuracy of physical parameter acquisition, thereby estimating high-precision PM2.5 through semi-physical empirical models.
According to this idea, our study proposes an optimized semi-empirical physical model (RF-PMRS) based on the PMRS theory, which attempts to explore the possibility of combining physical models and ML.To be specific, we creatively embed ML (the random forest model) into the PMRS method to simulate the physical parameter (i.e., The remained part of our article is as follows.Section 2 illustrates the specific derivation process of the proposed method.Section 3 describes the experimental datasets and analyzes the evaluation results.Some supporting experiments are discussed in section 4.And the final part provides the conclusion.

Methods
Based on the basic physical properties of atmospheric aerosols, the semi-physical empirical approach starts from the integration of PM mass concentration and AOD.
Then it combines several key factors related to PM2.5, to derive the in situ PM2.5 concentration through multiple remote sensing variables (Koelemeijer et al., 2006).The overall empirical relationship can be represented as: where  denotes the particle density and H denotes the atmospheric boundary layer height.( ) f RH represents the hygroscopic growth factor related to relative humidity ( ) RH .S is an optical characteristic parameter that should be simulated.

The expression of VEf
To illustrate S more precisely, PMRS defines the columnar volume-to-extinction ratio of fine particles (i.e., f VE ), which can be regarded as the basis of our optimization method.So equation ( 1) is transformed into: Here, f AOD is the fine particle AOD and FMF is the fine mode fraction.f,column V can be expressed by the vertical integral of particle volume size distributions (PVSD) within a certain aerodynamic diameter range: 0 ( ) D represents the cutting diameter, and the empirical value of 2.0 μm is chosen based on previous literature (Hand and Kreidenweis, 2002;Hänel and Thudium, 1977).And ( ) p V D represents the PVSD corresponding to the geometric equivalent diameter ( p D ).

Specific process and limitations
The PMRS method is developed from equation (2).Based on satellite AOD, the nearsurface PM2.5 can be obtained through multi-step transformation.), bottom isolation (output: f V , fine particle volume near the ground), particle drying (output: f,dry V , dry f V ) and PM2.5 weighting.
The overall expression is as follows: ( ) where FMF denotes the fine mode fraction, f,dry  denotes the dry mass density of PMRS has strong physical significance, the calculation steps are well-defined and site-independent.Zhang and Li (2015) tested the performance of PMRS on 15 stations, and the validation results had an uncertainty of 34%.Compared with the ground value of Jinhua city in China, a 31.3%relative error was generated in Li et al. (2016).Besides, Zhang et al. (2020) applied it to the PM2.5 change analysis and prediction experiments in China over 20 years.However, there may be a more complex nonlinear relationship between VEf with FMF, not just a simple quadratic formula.Since VEf is related to the aerosol type, adding other spatiotemporal variables may optimize the fitting process.
Additionally, high-quality FMF data is the basic guarantee for the estimated PM2.5 quality.In a word, to further improve the physical method, a better nonlinear model between VEf and related variables from reliable datasets needs to be explored.

Optimization method: RF-PMRS
Therefore, to overcome the above disadvantages, an optimized method called RF-

1) FMF dataset selection
We introduce the Phy-DL FMF dataset into the PMRS method to improve the accuracy of size-cutting results.In the comparison experiment against Aerosol Robotic Network (AERONET) FMF, Phy-DL FMF shows a higher accuracy (R = 0.78, RMSE = 0.100) than Moderate-resolution Imaging Spectroradiometer (MODIS) FMF (R = 0.37, RMSE = 0.282) (Yan et al., 2022).Also, it performs better spatiotemporal continuity.The main idea is to establish an ML model between the VEf truth obtained from multiple AERONET sites and related variables, thus improving the subsequent VEfsimulation accuracy (Fig. 2).
Step 1 VEf calculation The VEf true values are calculated concerning equations ( 3)-( 5).A total of 9 AERONET sites corresponding to four typical aerosol types participate in the training.
Table 1 shows the specific information.
We optimize VEf expression based on random forest (RF).RF is made up of multiple decision trees that can build high-accuracy models based on fewer variables (Yang et al., 2020).This ensemble supervised learning method randomly samples the original dataset into multiple sets and considers random subsets of features in node splitting, which reduces correlation and the sensitivity to noise (Belgiu and Drăguţ, 2016).Note that the station FMF values (S-FMF ) are used when training.

Step 4 Accuracy validation
The VEf estimation is also based on equation ( 9), where f is the optimal relationship after RF parameter adjustment, and Phy-DL FMF is applied to realize the extension of model results from point to surface.10-fold cross-validation (Rodriguez et al., 2009) and isolated-validation are used to evaluate model performance (see Appendix A1).

3) PM2.5 value estimation and evaluation
Then, calculate PM2.5 according to the corresponding process (equation ( 6)).The statistical indicators used in the evaluation include correlation coefficient (R), mean bias (MB), relative mean bias (RMB), root mean square error (RMSE), and mean absolute error (MAE).In addition, relative predictive error (RPE) is added to validate the accuracy of the RF-based VEf model.See Appendix A2 for the specific information on these indicators.(MAIAC) algorithm, which can improve the accuracy in cloud detection and aerosol retrieval (Lyapustin et al., 2011).Besides, this new advanced algorithm jointly combines MODIS Terra and Aqua into a single sensor (Lyapustin et al., 2014).The product is produced daily with a 1km resolution, including aerosol parameters such as 470nm/550nm AOD, quality assurance (QA), and uncertainty factors.

3
The processing of MCD19A2 data (HDF format) is mainly divided into five steps: AOD/QA band extraction, best quality AOD selection, Terra/Aqua data synthesis, missing information reconstruction, and mosaic.Finally, the daily AOD distribution in GeoTiff format is obtained.

Phy-DL FMF dataset
To enhance the reliability of the global land FMF product, Yan et al. (2022) have released a satellite-based dataset (daily scale) called Phy-DL FMF, which integrates physical and deep learning methods.The product has a spatial resolution of 1° and covers from 2001 to 2020.In terms of performance, it exhibits higher accuracy and wider space-time coverage than satellite products (Yan, 2021).

Meteorological data
The values of PBLH and RH are obtained from the ERA5 dataset.As the fifthgeneration reanalysis product released by the European Center for Medium-Range Weather Forecasts (ECMWF), ERA5 provides atmospheric data at 0.25° every hour based on the data assimilation principle (Hersbach et al., 2018).It should be noted that R H is not archived directly in ERA5, thus should be calculated by 2m temperature Here, s e ( ) t represents the saturation vapor pressure related to a Celsius temperature t (Simmons et al., 1999).17.67 e ( ) 6.112 exp 243.5

AERONET data
The Aerosol Robotic Network (AERONET) is a federation of ground-based sun-sky radiometer networks, providing worldwide remote sensing aerosol data for more than 25 years (Holben et al., 1998).Until now, the Version 3 dataset has been released (Giles et al., 2017).Due to its high quality, the data from AERONET have been regarded as theoretical true values to evaluate satellite-based products in related studies (Chen et al., 2020;Gao et al., 2016;Wang et al., 2019).AOD, FMF, and Volume Size Distribution products with Level 2.0 (quality-assured) are applied to implement our purpose.

Ground PM2.5 measurements
The The above variables are spatially matched to ground sites at their respective resolutions.And based on UTC, the experiment is conducted on a daily scale in 2017.
Note that we select the measured empirical value of ρf,dry (i.e., 1.5 g/cm 3 ) for the NC region from Gao et al. (2007).

RF model performance for training VEf
The simulation model of VEf is trained based on the data in Table 1 and see Appendix A3 for the adjustment of the model parameters.the RF model for extracting information, that is, the relationship of multi-source data to VEf.In the meantime, the statistical results in CV and IV experiments are similar, indicating that the RF model has no obvious overfitting phenomenon.

Accuracy evaluation of PMRS/RF-PMRS at AERONET stations
After applying the Phy-DL FMF data to the calculation process, the experiment compares PM2.5 results of PMRS and RF-PMRS at Beijing (BJ) and Beijing-CAMS (BC) AERONET sites in 2017.Here, RF-PMRS simulates VEf based on RF, replacing the polynomial of the PMRS method.Note that the results of the two sites are compared with their respective nearest ground PM2.5 stations (distances of 3.64 km and 3.91 km, respectively, in line with the representative range of ground stations in previous studies (Shi et al., 2018)).

Generalization performance of RF-PMRS
Then, we estimate PM2.5 based on PMRS and RF-PMRS within North China (Fig. 3 exhibits the distribution pattern of the validation stations).Table 3 shows the accuracy statistics.It can be seen that RF-PMRS greatly reduces the bias (about 44.87%), with MB of about 2.31 μg/m³.Similar to the results at the sites, the RF-PMRS method can derive PM2.5 concentration with practically no overestimation (underestimation).
Although there is not much difference in R values of the two models (R of RF-PMRS is only improved by 0.01), RMSE and MAE of which decrease by about 39.96 μg/m³ and 18.86 μg/m³, respectively.As a result, the optimized method deserves to be considered excellent.Meanwhile, PM2.5 scatterplots are presented below.As depicted in Fig. 7, there are sufficient estimated samples (28305) in the NC region, which guarantees the credibility of our validation results.In general, the RF-PMRS PM2.5 values are distributed around the true values evenly, with a slightly higher R of 0.70 compared to that of the original method.And the slope of the linear fitting relationship reaches 0.82, which indicates that the proposed method greatly reduces the overestimation of PMRS with a linear slope of 1.46.Although the overall performance of the RF-PMRS estimations maintains an excellent level, defects do remain.To be specific, in areas with high PM2.5 concentration (especially greater than 150 μg/m³), RF-PMRS results exist a slight underestimation.It may be caused by the relatively small number of high-value points (only 1319 out of 28305), which is difficult to adequately reflect the fitting effect of the method.
As for RF-PMRS, the deviation is reduced to a large extent, so the probability density function maps based on the bias of PMRS and RF-PMRS are further drawn.Fig. 8 visualizes the probability densities within different bias ranges.In terms of distribution characteristics, the overall bias of RF-PMRS from the zero value (black solid line) is small.With regard to the curve shape, it is high and narrow, manifesting that the bias has a lower standard deviation (STD) and is more prone to appear around the mean.
However, PRMS shows a more discrete distribution pattern, and there are many outliers outside the range of greater than 600 μg/m³.Simultaneously, as can be concluded from the three boxes, within the bias range of ±20 μg/m³ and ±40 μg/m³, the data numbers of RF-PMRS results increase by 8.32% and 12.81%, respectively.Outside the range of ±100 μg/m³, the number decreases by 9.10%.Therefore, as far as the accuracy is concerned, RF-PMRS results have lower bias and better stability.
In a word, the above analysis demonstrates that compared with the simple quadratic

Conclusion
Among various satellite remote sensing methods for PM2.5 retrieval, the semiempirical physical approach has strong physical significance and clear calculation steps, and derives the PM2.5 mass concentration independently of in situ observations.
However, the parameters with the meaning of optical properties are difficult to express, which need to be optimized.Hence, the study proposes a method (RF-PMRS) that embeds machine learning in a physical model to obtain surface PM2.5: 1) Based on the PMRS method and select the Phy-DL FMF product with a combined mechanism; 2) Use the RF model to fit the parameter VEf, rather than a simple quadratic polynomial.
In the point-to-surface validation, RF-PMRS shows great optimized performance.
Experiments at two AERONET sites show that R reaches up to 0.8.And in North China, RMSE decreases by 39.95 μg/m³ with a 44.87% reduction in relative deviation.In the future, we will further explore the combination of atmospheric mechanism and machine learning, then research the PM2.5 retrieval methods with physical meaning and higher accuracy.At the same time, when verifying the RF-based VEf model, the dataset in the time period that did not participate in the training in Table 1 is used for isolated-validation.

A2. Statistical indicators
where m is the total number of observations, i is the number of measurements, yi is the i-th observation, fi is the corresponding estimation result.And  and  ̅ are the averages of all observations and estimates, respectively.

A3. Parameter adjustments of the RF model
The four parameters of RF are adjusted, that is the correlation coefficient r changes providing the MODIS products, the meteorological data, the ground aerosol data, and the surface PM2.5 concentration.We also thank other institutions which provide related data in this work.

Financial support
This research was funded in part by the National Natural Science Foundation of China (41922008) and the Hubei Science Foundation for Distinguished Young Scholars (2020CFA051).
https://doi.org/10.5194/egusphere-2022-946Preprint.Discussion started: 27 October 2022 c Author(s) 2022.CC BY 4.0 License.VEf) derived from FMF and related variables, thus optimizing the previous polynomial expression.Besides, to further improve the PM2.5 retrieval accuracy, the physical-deep learning FMF (Phy-DL FMF) dataset generated by a hybrid retrieval algorithm of ML and physical mechanisms is introduced.Ultimately, we comprehensively validate the performance of the PM2.5 obtained by our optimized approach.
Fig. 1(a)  shows its specific process.Each arrow refers to a step, respectively: size cutting (output:

Fig. 1 .
Fig. 1.Surface PM2.5 estimation flow of RF-PMRS.a) The five steps of the PMRS method.Gray boxes are the intermediate outputs, blue boxes are the input data, and orange ones denote the variables to be optimized.b) The specific optimization of RF-PMRS: FMF dataset replacement and VEf simulation by RF model.

Fig. 2 .
Fig.2.Specific steps for simulating VEf based on ML in our RF-PMRS method.The map used in the step 1 is from NASA Visible Earth (https://visibleearth.nasa.gov/images/57752/blue-marbleland-surface-shallow-water-and-shaded-topography).The red points in step 1 represent the distribution of the 9 AERONET sites.

T
and dew point temperature d T (referred to ERA-Interim: documentation).

Fig. 4 Fig. 6 .
Fig.4displays the PM2.5 value trends of different models at two sites.The blue line fits the red line better than the gray one, confirming that the PM2.5 results of RF-PMRS are closer to the true values.Within the range of the black circles at positions 1 and 2, the variation trend of RF-PMRS results has better consistency with the ground truth, while the PMRS results show dislocation and excessive growth.The overall performance of the RF-PMRS estimations can signify the effectiveness of our proposed method framework.As observed in the red boxes at positions 3 and 4, both models have a certain degree of deviation, which is found to be consistent with the time regularity of the AOD high values.It is worth noting that our method has well mitigated the apparent overestimation of the original model (PMRS) in the case of above-normal aerosol loadings.Furthermore, the average PM2.5 values from ground stations, PMRS, and RF-PMRS are compared.As for the two sites, the RF-PMRS results are satisfactory.As depicted in Fig.5, the RF-PMRS and station mean values are close, with a difference https://doi.org/10.5194/egusphere-2022-946Preprint.Discussion started: 27 October 2022 c Author(s) 2022.CC BY 4.0 License.
Fig. 7. Validation scatterplots of PM2.5 results from PMRS (left) and RF-PMRS (right).Red dashed lines are 1:1 reference lines, and blue solid lines stand for the linear fits.The right legends show the point densities (frequency) represented by different colors.
Accuracy comparison of PMRS using MODIS/Phy-DL FMF To confirm the superiority of the Phy-FMF data adopted in our method framework, taking the BJ and BC sites as examples, the experiment compares the PM2.5 accuracy https://doi.org/10.5194/egusphere-2022-946Preprint.Discussion started: 27 October 2022 c Author(s) 2022.CC BY 4.0 License.and the number of effective days calculated by PMRS based on different FMF.
-fold cross-validation and isolated-validation https://doi.org/10.5194/egusphere-2022-946Preprint.Discussion started: 27 October 2022 c Author(s) 2022.CC BY 4.0 License.The sample-based 10-fold cross-validation method is applied to test the fitting and predictive ability of our model.The original dataset is randomly divided into ten parts, nine of which are used as the training set for model fitting, and the remaining one is used for prediction, then the cross-validation process is repeated ten rounds until each data has been used as the test set.

Table 1 .
Data information of 9 AERONET sites classified by aerosol types.Location indicates the latitude and longitude, where '-' means the south latitude and west longitude.Two sites in bold fonts participate in the PM2.5 validation experiment.considering the spatiotemporal heterogeneity of VEf, the latitude, longitude (LAT, LON), and data time (month, day) of each site are added to the training.https://doi.org/10.5194/egusphere-2022-946Preprint.Discussion started: 27 October 2022 c Author(s) 2022.CC BY 4.0 License.

Table 2 .
Performance statistics of the RF model for training VEf.N represents the number of data, and VEf has no unit.

Table 3 .
Validation results of PMRS and RF-PMRS PM2.5 in North China.

Table 4
presents the overall day-level results.As can be seen, after the FMF replacement, the valid DOY turns out to become more (an increase of 113 days), which illustrates that the number of effective PM2.5 concentration has gone up by about 5 times.Moreover, the accuracy has been significantly enhanced, with R increased by about 0.30, RMSE and MAE decreased by 26.14% and 16.47% accordingly.On the whole, Phy-DL FMF contributes to the improvement of PMRS results, signifying the first step optimization of the proposed RF-PMRS method is effective.

Table 4 .
Validation results of the PMRS method using different FMF data.The valid DOY refers to the number of days that the AOD, FMF, and other data are not missing when calculating PM2.5.Note that since the valid days of the two schemes are different, the MB and RMB are not compared.The results of training VEf based on the above three DT models are presented in Table5 and Table 6.By contrast, RF performs best in CV and IV experiments, as indicated by the multiple accuracy indicators.Although ERT and GBDT models are comparable to RF in some indicators, there exists a certain degree of overfitting in the above two models, which is manifested in that their IV results are clearly worse than their respective CV ones.Thus, the RF model is applied to our study.

Table 5 .
Cross-validation results in comparison of the decision tree models for training VEf.N represents the number of data, and VEf has no unit.

Table 6 .
Isolated-validation results in comparison of the decision tree models for training VEf.The indicators are the same as those in Table5.