Random forest models for PM2.5 speciation concentrations using MISR fractional AODs

It is increasingly recognized that various chemical components of PM2.5 might have differential toxicities to human health, although such studies are hindered by the sparse or non-existent coverage of ground PM2.5 speciation monitors. The Multi-angle Imaging SpectroRadiometer (MISR) onboard the Terra satellite has an innovative design to provide information about aerosol shape, size and extinction that are more related to PM2.5 speciation concentrations. In this study, we developed random forest models that incorporated ground measurements of PM2.5 species, MISR fractional AODs, simulated PM2.5 speciation concentrations from a chemical transport model (CTM), land use variables and meteorological fields, to predict ground-level daily PM2.5 sulfate, nitrate, organic carbon (OC) and elemental carbon (EC) concentrations in California between 2005 and 2014. Our models had out-of-bag R2 of 0.72, 0.70, 0.68 and 0.70 for sulfate, nitrate, OC and EC, respectively. We also conducted sensitivity tests to explore the influence of variable selection on model performance. Results show that if there are sufficient ground measurements and predictor data to support the most sophisticated model structure, fractional AODs and total AOD have similar predicting power in estimating PM2.5 species. Otherwise, models using fractional AODs outperform those with total AOD. PM2.5 speciation concentrations are more sensitive to land use variables than other supporting data (e.g., CTM simulations and meteorological information).


Introduction
Numerous studies have reported the associations between adverse health effects and exposure to ambient fine particulate matter (particles with a diameter less than 2.5 μm, PM 2.5 ) [1][2][3]. It is also increasingly recognized that various chemical components of PM 2.5 might have differential toxicities to human health [4][5][6]. For example, combustion-related species such as elemental carbon (EC) and Potassium (K) were found to have a stronger association with mortality than other species [4]. However, information regarding the health effects of PM 2.5 components is limited, mainly hindered by the sparse or non-existent coverage of ground PM 2.5 speciation monitors. Meanwhile, the changes in PM 2.5 mass concentrations are driven by the variations of its chemical composition that are related directly to precursor emissions [7,8], thus, understanding the spatiotemporal changes of PM 2.5 speciation also helps designing more effective air pollution mitigation measures.
In recent years, many sophisticated models have been reported in the literature to incorporate information from ground monitors, satellite remote sensing and chemical transport models (CTMs) in order to obtain improved estimates of PM 2.5 mass concentrations [9][10][11]. However, attempts to estimate highresolution PM 2.5 speciation concentrations have been few. Some works [7,12,13] used satellite-based total aerosol optical depth (AOD) and composition information from the CTMs to estimate PM 2. 5  Any further distribution of this work must maintain attribution to the author(s) and the title of the work, journal citation and DOI. composition were inferred by combining AOD with conversion factors between AOD and PM 2.5 species provided by the GEOS-Chem model [12]. Such a method does not rely on ground observation data; however, without calibration against observation data, the estimated results had relatively low accuracy (e.g., r between 0.65 and 0.75 for different species in Geng et al [7]). Some studies [14,15] utilized ground observations, meteorological fields and land use variables to calibrate and downscale the CTM-simulated results and obtained high-resolution PM 2.5 speciation data. For instance, Di et al [14] used a backward propagation neural network to calibrate GEOS-Chem simulated PM 2.5 speciation concentrations and improved their spatial resolution from 0.50°× 0.66°to 1 km by incorporating land use variables. However, temporal variabilities in PM 2.5 species concentrations cannot be readily captured by land use terms, indicated by significantly lower temporal R 2 (by 0.02-0.26) compared to total R 2 in Di et al [14].
The Multi-angle Imaging SpectroRadiometer (MISR) onboard the Terra satellite has an innovative design of viewing the Earth simultaneously at nine widely spaced angles [16]. The change in reflection at different view angles affords the means to distinguish different types of atmospheric particles that are more related to PM 2.5 composition [17,18]. MISR provides fractional AODs defined by a size distribution, shape, complex index of refraction and scale height. Such data has been used in statistical models to predict PM 2.5 speciation and was proved to have better predicting power than the total AOD [19,20]. However, those studies either used MISR AOD at a coarse resolution of 17.6 km [19] or used experimental data products that have limited spatial coverage [20][21][22]. Meanwhile, the model structures developed for PM 2.5 speciation concentrations before were simple and only involved a few predictors [19,20]. For example, Meng et al [20] developed generalized additive models (GAMs) using MISR 4.4 km fractional AODs and some meteorological and geographical indicators to predict the concentrations of four PM 2.5 species. The leave-one-out-cross-validation (LOOCV) R 2 for different species varied from 0.49 to 0.62. The recently released operational MISR products at 4.4 km resolution make it possible to develop a high-resolution model with inputs from multiple sources and conduct comprehensive analyses to reveal the importance of fractional AODs in predicting PM 2.5 speciation concentrations.
Moreover, the availability of input predictors for the PM 2.5 speciation models varies among regions. In data-rich regions like the United States, there are established networks of PM 2.5 speciation monitors such as the Chemical Speciation Network (CSN) and the Interagency Monitoring of Protected Visual Environments network (IMPROVE) and high-quality supporting information for the model development. While in most other regions in the world, lack of access to and poor quality of input data might not support sophisticated model structures. Therefore, it is crucial to understand the impact of variable selection on developing PM 2.5 speciation models. The influence of different variable choices on model performance has been studied for PM 2.5 mass concentrations before [23], and results showed that models without land use variables lacked fine spatial details in predictions, but could still capture the temporal variability of PM 2.5 . However, the results might be different for PM 2.5 species due to their different spatial and temporal characteristics.
In this work, we selected the state of California as a case to develop PM 2.5 speciation models using the newly released MISR Level 2 AOD product, CTM simulations and other supporting information. Then we conducted a series of sensitivity scenarios to examine the impacts of different variable choices on model's performance to provide guidance for model development in data-poor regions. Finally, suggestions are made based on our results for other regions to build PM 2.5 speciation concentrations models.

Materials and methods
Our study domain covers the state of California with an 80 km buffer along its inland border, as shown in figure 1. We constructed a 1 km × 1 km resolution modeling grid for data integration. The study period is from 2005 to 2014. colostate.edu/fed). There are 55 PM 2.5 speciation monitors located in our domain and their spatial distribution is shown in figure 1. The sampling frequency is 1-in-3 or 1-in-6 d for CSN and 1-in-3 d for IMPROVE, and the total numbers of observations in each locations are shown in figure S1 (available online at stacks.iop.org/ ERL/15/034056/mmedia). CSN and IMPROVE have different samplers and analytical methods, which caused systematic differences in the measurements of carbonaceous components in PM 2.5 . Following our previous works [15,20], we adjusted OC and EC data from CSN for compatibility with IMPROVE data. More details of the adjustment could be found in Meng et al [15]. The monitors were assigned to 1 km grid cells where they fell.

MISR 4.4 km aerosol product
MISR views the Earth at nine separate angles simultaneously, which provides information to distinguish different types of aerosol particles. MISR has a global coverage of every nine days, with repeat coverage between 2 and 9 d based on the latitude. A new version (V23) of the MISR operational aerosol product has been released recently, which offers substantial improvements compared to the previous version (V22), including a higher spatial resolution of 4.4 km, improved performance evaluated against ground measurements [24], improved cloud screening and additional diagnostic fields. We downloaded the V23 Level 2 aerosol data for 2005-2014 from NASA Earthdata portal (https://search.earthdata.nasa.gov/), which contain the same 74 aerosol mixtures as in previous versions [25]. The eight fractional AOD components used to construct a total of 74 mixtures were utilized in this study and details of their descriptions could be found elsewhere [15,18,26,27]. These eight fractional AOD components (i.e., component 1, 2, 3, 6, 8, 14, 19 and 21) represent different particle shape, scattering property and effective radius for a number-weighted log-normal distribution (table S1). The fractional AOD components could be obtained from any MISR aerosol observation using the following equation All the AOD data were interpolated to the centroid of the 1 km grid cells using the inverse distance weighting (IDW) approach. Since AOD has missing values, we only kept data in those 1 km grid cells that fell inside the 4.4 km MISR pixels where AOD data was available.

CTM simulations
The CTM simulations used in this study were conducted by Community Multiscale Air Quality (CMAQ) model version 5.0.2 [28] driven by the Weather Research and Forecasting (WRF) model version v3.4. Details of the model configurations for WRF and CMAQ could be found in Zhang et al [29]. Briefly, the model was conducted over the continental United States from 2005 to 2014 at 12 km spatial resolution with 35 vertical layers from the surface to the top of the free troposphere. National Land Cover Database (NLCD) were used as inputs for the WRF model, and the simulated meteorological parameters were processed using the Meteorology-Chemistry Interface Processor version 4.1.3 [30] to create inputs for the CMAQ model. The chemical boundary conditions for the CMAQ model were taken from yearspecific simulations of the global GEOS-Chem model [31]. The anthropogenic emission inputs were based on 2005, 2008 and 2011 National Emission Inventory data. The biogenic emissions were provided by the Biogenic Emissions Inventory System version 3.14 [32].
Daily mean concentrations of sulfate, nitrate, OC and EC at the surface level were extracted to represent the ambient PM 2.5 speciation concentrations. All the concentration data were interpolated to the 1 km grid using IDW approach.

Meteorological fields
Temperature, wind speed and specific humidity data at ∼13 km spatial resolution were obtained from the hourly North American Land Data Assimilation Systems phase 2 (NLDAS-2, http://ldas.gsfc.nasa. gov/nldas/) datasets. Planetary boundary layer height at 32 km spatial resolution was taken from the 3 hourly North American Regional Reanalysis (NARR, http:// emc.ncep.noaa.gov/mmb/rreanl/) products. All meteorological data were averaged between 9 am and 12 pm over satellite overpass time and were also interpolated to the 1 km grid using IDW approach.

Land use variables
Elevation data at 30 m spatial resolution were extracted from the Advanced Spaceborne Thermal Emission and Reflection Radiometer (ASTER) Global Digital Elevation Map (GDEM) (https://asterweb.jpl. nasa.gov/gdem.asp) version 2. Road networks were obtained from ESRI StreetMap USA (Environmental Systems Research Institute, Inc., Redlands, CA). Impervious surface, forest cover, shrub cover and cultivated land cover information at 30 m spatial resolution were taken from NLCD (https://mrlc.gov) 2006 and 2011. Population data at 1 km spatial resolution were extracted from the LandScan Global Population Database (https://landscan.ornl.gov/). Elevation, impervious surface, forest cover, shrub cover and cultivated land cover were averaged while population and road lengths were summed within the 1 km grid.

Modeling
Random forest models were built separately for sulfate, nitrate, OC and EC over 2005-2014 using MISR fractional AODs, CMAQ simulations, meteorological fields and land use variables. Initially proposed by Breiman [33], random forest algorithm is an ensemble learning method based on decision trees, which has the advantage of allowing both continuous and categorical input variables and are quite robust to outliers. It also provides variable importance rankings as well as out-of-bag errors for variable selection and model evaluation. It has been used in previous studies to estimate air pollutant concentrations and shown excellent performances [10,15,34].
In this study, four out-of-sample cross-validation (CV) experiments were considered to evaluate the prediction performance of our models. First, R 2 values and root mean square errors (RMSE) between out-ofbag predictions and observations were calculated. Second, a temporal CV was conducted to evaluate the model's ability to perform temporal prediction when certain days were missing. After all the observation days were randomly divided into 10 subsets, we used data from nine subsets of the observation days to train the model and predicted PM 2.5 speciation data in the remaining subset. This process was repeated for 10 times after all the observation days were tested. We also did a spatial CV to evaluate the model's ability to make predictions at locations without monitors. Similarly, all the ground monitors were randomly divided into 10 subsets. The model was trained using data from nine subsets of the monitors, and then used to make predictions on the remaining subset. This process was also repeated for 10 times. Finally, a spatialclustered CV was performed following Murray et al [35] to test the model's robustness when a large spatial group of data were missing instead of a single location. A hierarchical clustering method by proximity of monitoring locations was used to define 10 clusters of all the monitors. Then data from nine clusters was used to predict PM 2.5 speciation data in the remaining cluster and this process was also repeated for 10 times.

Sensitivity tests of variable selection
In data-rich regions, there are sufficient input dataset to support the most sophisticated model structure (e.g., the full model in our study). While for data-poor regions, either the PM 2.5 speciation monitors or other supporting data are limited, which could only allow for a reduced form of the full model developed in the data-rich regions. As indicated in previous studies, different predictors affect the model's performance in predicting PM 2.5 mass concentrations [23]. In this study, we considered our models' robustness to different choices of input parameters when estimating PM 2.5 speciation concentrations. Previous works [20] compared model performance when using fractional AODs and total AOD, and found that models with fractional AODs outperformed models with total AOD when other predictors were the same. Meanwhile, CMAQ simulations, meteorological fields and land use variables are also important predictors because they could provide insights into local-scale spatial/temporal variability of PM 2.5 speciation concentrations. In this study, we designed different model structures to compare the predicting power of fractional AODs, total AOD and no AOD involved, and also test different combinations of other predictors, as shown in   Table 1 and figure S3 show the CV results of our random forest models at the daily level for the entire study period of the four PM 2.5 species, including the values of R 2 and RMSE between the CV estimates and observations. The out-of-bag predictions from the random forest models showed relatively good agreements with the observations, with out-of-bag R 2 of 0.72, 0.70, 0.68 and 0.70 for sulfate, nitrate, OC and EC, respectively. The model for sulfate had the best performance among all species, with RMSE of 0.54 μg m −3 , while the model for OC had a lower R 2 and a RMSE of 1.30 μg m −3 . Results also showed that our models had comparable performance in different seasons ( Table S3).

Model evaluation
The temporal CV R 2 values of the PM 2.5 speciation models were generally comparable to their out-of-bag R 2 , with slight decreases by 0.01-0.02 for different species, indicating that our models were stable and efficient in capturing the temporal variability of PM 2.5 speciation concentrations. The spatial CV R 2 values, however, decreased by 0.08-0.14 for the four species when compared to the out-of-bag results, with values for sulfate and nitrate higher than those for OC and EC. This has also been found in previous studies, and the reason is that the spatial distribution of sulfate is much smoother than OC and EC, which is easier for the models to predict at locations where ground monitors are absent. The models' prediction performance further decreased from spatial CV to spatial-clustered CV, because information from nearby monitors to aid spatial interpolation are missing. Figure 2 shows the importance rankings of all predictors used in model development for sulfate, nitrate, OC and EC. CMAQ simulations played important roles in all four models with high importance rankings. Meteorological fields, such as air temperature, was among the highest ranking predictors for the sulfate model, because it could affect the formation of sulfate via chemical reactions. The fractional AOD components, AOD2 and AOD6, were also important predictors for the estimation of sulfate concentrations. These two components both represent spherical and non-absorbing particles that are related to sulfate concentrations. For the nitrate model, top ranking variables included population and impervious surface, which are both related to anthropogenic fossil fuel combustion that could release NO x , precursor gases of nitrate. Shrub cover, which reflects the emission sources of soil NO x [36,37], also ranked high. Variables of high importance rankings were similar for OC and EC, which were mostly land use variables that could represent the spatial patterns of local-scale emissions sources. For example, impervious surface and population density are significant indicators of anthropogenic emissions. Length of major road could reflect the emissions from on-road transportation on OC and EC concentrations. Notably, shrub cover and forest cover also ranked high for OC and EC, which might represent the contribution of wildfires to OC and EC levels in California [38,39]. Figure 3 shows the spatial distributions of the pre-  major NO x sources. The spatial distributions of the carbonaceous species were more heterogeneous than sulfate, with hotspots located in urban centers. OC and EC mainly come from primary emission sources and the spatial patterns were highly related to human activities. Meanwhile, there were also some high values in the northern part of California, which might be contributed by wildfires. During 2005-2014, the concentrations of the four PM 2.5 species showed a decreasing trend, especially after the year 2009. Since that MISR AOD had a lot of missingness in its spatial and temporal coverage, which might affect the trends presented in our study, the temporal trends are not further discussed.  (REF and M1), which were slightly better than M2 without AOD data. This indicates that CMAQ simulations, land use variables and meteorological fields together could explain the majority of the variability in PM 2.5 speciation concentrations. When developing PM 2.5 speciation models in data-rich regions, using total AOD or even no AOD could obtain similar model performance as fractional AODs. Moreover, total AOD or no AOD have the advantage of greater spatial/ temporal coverage, since total AOD could be taken from MODIS and full coverage were achieved with no AOD involved. When the model structure was simplified and only incorporated some of the supporting data, fractional AODs usually had better performance than total AOD, which is consistent with the findings in previous studies [19,20]. For example, M6 with fractional AODs and land use variables within the model had an out-of-bag R 2 of 0.58 for nitrate, much higher than the R 2 of 0.39 from M7. When developing PM 2.5 speciation models at data-poor regions where only some of the supporting data are available, using fractional AODs would be a better choice to more accurately estimate PM 2.5 speciation concentrations, although MISR AOD have relatively low sampling frequency. Table 2 also compares the model's performance when using different combinations of supporting information, which could provide guidance for the data-poor regions about variable selections when developing their own models. Unlike the United States, many regions around the world do not have enough PM 2.5 speciation monitors that could support complex model structures. A simplified model is necessary and it is crucial to select the most useful parameters to be included in the model. In addition, some developing regions also lack access to high quality air quality model simulations since they do not have local emission inventories or sufficient computing resources. Therefore, understanding the influences from different kinds of supporting data on model performance are important. Results of our sensitivity scenarios showed that, in general, models with land use variables had higher R 2 than other models, which is consistent with the importance ranking results mentioned above. For example, M6 had an out-of-bag R 2 of 0.62 for sulfate, higher than 0.55 from M3, 0.52 from M9 and 0.33 from M12. Models with only CMAQ simulations (e.g., M3) had good predicting power for nitrate, with R 2 higher than M6 with land use variables and M9 with meteorological fields. When choosing supporting data for a simplified model, the prior choice should be land use data, followed by CTM simulations.

Sensitivity to variable selection
We further compare the relative differences between the predicted concentrations from M3, M6, M9, M12 and the REF case at different impervious surface levels, which were models using fractional AODs and different combinations of supporting predictors ( figure 4). For sulfate, M3 and M6 had very similar predicted results compared to the REF case, while the relative differences between M9, M12 and the REF case were generally within 25%. This confirms the findings in previous studies that sulfate is usually the best model among the four species. For nitrate, OC and EC, predictions from the models with land use variables (i.e., M6) are closest to those from the REF case, followed by the models with CMAQ simulations (i.e., M3). With these two kinds of predictors, the models could capture the majority of the spatial patterns of the PM 2.5 species, especially in urban areas with high percentages of impervious surface. When land use data and CMAQ simulations are absent, predictions from M9 and M12 had overestimated nitrate by ∼150%, OC by ∼100% and EC by ∼200%.

Comparison with previous studies
Our random forest models outperformed the GAMs developed in Meng et al [20] using an experimental version of MISR 4.4 km fractional AODs. Their LOOCV R 2 values were 0.62, 0.56, 0.49 and 0.53 for sulfate, nitrate, OC and EC, respectively, compared to the out-of-bag R 2 between 0.68 and 0.72 of our models. Our models included more predictors (~20 in total) than the GAMs, including CMAQ simulations and several land use variables. As indicated in section 3.5, CMAQ and land use data had great predicting powers in the estimation of PM 2.5 speciation concentrations. Besides, random forest models can better resolve the complex relationship between ground observations and multiple predictors [34] than the statistical models.
Compared to Di et al [14] and Meng et al [15], which were models without any inputs from satellite AOD, our models had relatively lower out-of-bag R 2 . Di et al [14] used a backward propagation neural network to calibrate the GEOS-Chem simulations and their mean total R 2 for sulfate, nitrate, OC and EC were 0.81, 0.83, 0.69 and 0.71, respectively. Meng et al [15] utilized random forest models and their out-ofbag R 2 varied between 0.71 and 0.86. Based on our sensitivity test results, when excluding AOD data from our full models, R 2 values decreased slightly. However, the number of paired data records in the training dataset would increase significantly if no AOD data were involved, which would improve model training results. Also, their studies had larger domain with greater contrasts in PM 2.5 species that could lead to higher R 2 . In addition, these two works used spatially and temporally lagged terms in their models to take advantage of spatial and temporal auto-correlation. By incorporating information from nearby monitors and adjacent days, the models' performance were highly improved. However, this might not work for datapoor regions where the monitors are too sparse or the observations are made far apart in time. Therefore, the spatial-temporal lagged terms were not considered as input variables in our study.

Limitation
Our work was subject to some limitations. MISR has a low sampling frequency, resulting in a lot of missingness in the data that might affect the spatial and temporal trends presented in our study. However, our work presented a feasible method to use fractional AODs, CMAQ simulations, meteorological fields and land use variables to estimate PM 2.5 speciation concentrations. Future improvements in the satellite instrument could bring more data and also improve the performance of our model. For example, the Multi-Angle Imager for Aerosols (MAIA) instrument [40], currently in development, has a legacy of multiangle imaging from MISR but includes 14 bands in the ultraviolet to shortwave infrared including three polarimetric bands to obtain higher sensitivity to aerosol composition [40,41]. MAIA will also make its optical and PM type retrievals at a higher spatial resolution and sampling frequency. The MAIA AOD retrievals are expected to have better ability in estimating PM 2.5 speciation concentrations than MISR.
In the sensitivity analyses, we compared the model performance between fractional AODs and total AOD based on an assumption that they have the same spatiotemporal coverage. However, total AOD from other instrument such as MODIS might have much higher data coverage than MISR, which could improve the model's performance as the number of input data are larger. The findings in our study are applicable in regions similar to California, however, for regions with different emission sources or patterns, results remain unknown. Therefore, more efforts are needed to further understand the influence of data from different sources in different types of places.