Estimation of ambient PM2.5 in Iraq and Kuwait from 2001 to 2018 using machine learning and remote sensing

Iraq and Kuwait are in a region of the world known to be impacted by high levels of fine particulate matter (PM2.5) attributable to sources that include desert dust and ambient pollution, but historically have had limited pollution monitoring networks. The inability to assess PM2.5 concentrations have limited the assessment of the health impact of these exposures, both in the native populations and previously deployed military personnel. As part of a Department of Veterans Affairs Cooperative Studies Program health study of land-based U.S. military personnel who were previously deployed to these countries, we developed a novel approach to estimate spatially and temporarily resolved daily PM2.5 exposures 2001–2018. Since visibility is proportional to ground-level particulate matter concentrations, we were able to take advantage of extensive airport visibility data that became available as a result of regional military operations over this time period. First, we combined a random forest machine learning and a generalized additive mixed model to estimate daily high resolution (1 km × 1 km) visibility over the region using satellite-based aerosol optical depth (AOD) and airport visibility data. The spatially and temporarily resolved visibility data were then used to estimate PM2.5 concentrations from 2001 to 2018 by converting visibility to PM2.5 using empirical relationships derived from available regional PM2.5 monitoring stations. We adjusted for spatially resolved meteorological parameters, land use variables, including the Normalized Difference Vegetation Index, and satellite-derived estimates of surface dust as a measure of sandstorm activity. Cross validation indicated good model predictive ability (R2 = 0.71), and there were considerable spatial and temporal differences in PM2.5 across the region. Annual average PM2.5 predictions for Iraq and Kuwait were 37 and 41 μg/m3, respectively, which are greater than current U.S. and WHO standards. PM2.5 concentrations in many U.S. bases and large cities (e.g. Bagdad, Balad, Kuwait city, Karbala, Najaf, and Diwaniya) had annual average PM2.5 concentrations above 45 μg/m3 with weekly averages as high as 150 μg/m3 depending on calendar year. The highest annual PM2.5 concentration for both Kuwait and Iraq were observed in 2008, followed by 2009, which was associated with extreme drought in these years. The lowest PM2.5 values were observed in 2014. On average, July had the highest concentrations, and November had the lowest values, consistent with seasonal patterns of air pollution in this region. This is the first study that estimates long-term PM2.5 exposures in Iraq and Kuwait at a high resolution based on measurements data that will allow the study of health effects and contribute to the development of regional environmental policies. The novel approach demonstrated may be used in other parts of the world with limited monitoring networks.


Introduction
Since 2001, many U.S. military personnel and coalition military personnel have been deployed in support of operations in the Southwest Asia and experienced with frequent exposure to high concentrations of fine particulate matter (PM 2.5 ) attributable to large arid and semiarid areas, local pollution from vehicles and industry, oil exploration and refining, and refuse-burning, including burn pits (Alolayan et al., 2013;Engelbrecht et al., 2009;Garshick et al., 2019). Epidemiologic studies based on military health encounter data have demonstrated more frequent post-deployment health encounters than non-deployed personnel for respiratory symptoms and for airway disease, predominantly asthma (Abraham et al., 2014;Pugh et al., 2016;Abraham et al., 2012;Sharkey et al., 2016). There is concern that deployers may have experienced reductions in pulmonary function or other health effects attributable to exposure (U.S. Institute of Medicine, 2011; Garshick et al., 2019). Moreover, exposure to air pollution has been directly linked to health effects worldwide (Pappin and Hakami, 2013;Stafoggia et al., 2017;Brauer et al., 2016;Wang et al., 2020;Zhang et al., 2021). Unlike health studies assessing PM-related health effects in the US, Europe, and parts of Asia, the lack of a comprehensive network of regional monitoring stations in countries such as Iraq and Kuwait has hindered the design of health effect studies, both in the native populations and previously deployed military personnel.
Visibility data are routinely collected at airports or meteorology stations throughout the world. The relationship between visibility and PM 2.5 is relatively stable due to the light Li et al. Page 2 Environ Int. Author manuscript; available in PMC 2022 June 01.

VA Author Manuscript
VA Author Manuscript extinction (scattering and absorption) effects of particles with sizes similar to the wavelengths of visible light, particularly in arid areas (Abbey et al., 1995;Zhao et al., 2017;Liu et al., 2017;Vajanapoom et al., 2001). Visibility has been used previously to estimate historical ground-level PM 2.5 exposures in China prior to the establishment of a PM 2.5 monitoring network . Masri et al. (2017) conducted a pilot study that demonstrated the feasibility of using airport visibility data in Southwest Asia and Afghanistan. Visibility-based approach takes advantage of the large database of historical visibility collected by airport or meteorology stations, but the spatial resolution of estimated PM 2.5 is still coarse for epidemiological studies.
Satellite-based aerosol optical depth (AOD), a measure of light attenuation in the column of air from Earth's surface to the top of atmosphere which is affected by meteorology and land conditions (Paciorek and Liu, 2009), has been widely used to estimate spatiotemporally resolved PM 2.5 exposure, given its extensive spatial coverage, high spatial resolution, and reliable repeated daily observations (Liu et al., 2005;Paciorek et al., 2008;Chudnovsky et al., 2012;Kloog et al., 2014;Beloconi et al., 2016;Brokamp et al., 2017;Huang et al., 2018;Di et al., 2019). The satellite-based approach has been applied to different study areas, such as the United States (Kloog et al., 2014;Di et al., 2016), China (Xue et al., 2019), and Europe (Beloconi et al., 2018).
Quantifying a relationship between AOD and PM 2.5 usually involves incorporating groundlevel PM 2.5 measurements, metrological variables, and geographic covariates at high spatial and temporal resolutions. However, this method has not been applied to areas with a paucity of PM 2.5 monitoring sites due to the lack of spatial coverage. There are no monitoring networks in the Middle East designed to provide comprehensive measurements. Instead of using ground-level PM 2.5 measurements, some studies use a Chemical Transport Model (CTM) or observational meteorological values to simulate the relationship between AOD and PM 2.5 globally (van Donkelaar et al., 2015;Lin et al., 2018). However, CTMs rely on accurate emission inventories, which are poorly developed in the Middle East. Therefore, current models that estimate PM 2.5 globally using CTM inventories are unreliable in this region. Until our efforts, there have been no previous studies assessing historical PM 2.5 exposures at high spatial resolution in the Middle East based on measurement data.
As part of a Department of Veterans Affairs Cooperative Studies Program health study of land-based U.S. military personnel, we predicted spatially and temporally resolved PM 2.5 concentrations by a multi-staged approach. First, we used a hybrid machine learning model and incorporated multiple variables (including satellite-based AOD data, land-use terms, and meteorological variables) to predict high spatiotemporal resolution visibility, as visibility data throughout the region were available as result of military airport activity. We then derived relationships between daily PM 2.5 and visibility using existing monitoring stations in the region to convert visibility to PM 2.5 . In the absence of extensive air quality monitoring sites, this approach leverages historical visibility data, satellite AOD, and existing regional PM 2.5 monitoring stations to understand the intensity and spatiotemporal variability in PM 2.5 exposures over Iraq and Kuwait.   Four sections (tiles) of the NASA global AOD database (tiles h22v05, h22v06, h21v05, h21v06) that included approximately 360,000 grid cells for each tile were used. We extracted AOD at 550 nm that met MODIS quality assurance criteria (flag of "good"), and data with extreme values (above 4) were excluded (Di et al., 2016). Due to cloud cover and technical factors such as imaging over some bright surfaces and cloud, AOD has missing approximately 40% (see Fig. S2 in supplement) (Xiao et al., 2017). Therefore, we calculated linear associations between the Aqua and Terra AOD for each day and imputed the missing MAIAC AOD values when values from only one instrument was available . Grid cells over large water bodies were excluded (https:// www.diva-gis.org/datadown) (Fig. S1). The average AOD from Aqua and Terra was used for modeling.

Visibility data-Hourly visibility data for
Kuwait and Iraq were provided by the U.S. Air Force 14th Weather Squadron for all airports and meteorological stations in Kuwait and Iraq. We also obtained additional hourly visibility data from National Oceanic and Atmospheric Administration (NOAA) Integrated Surface Database (ISD, https:// www.ncdc.noaa.gov/isd). Seventy-five percent of the visibility data in this study was from U.S. Air Force and 25% was from NOAA. We calculated average daily visibility from the available hourly data. The distribution of the 263 visibility monitoring stations available in Iraq and Kuwait during the study period is shown in Fig. S3.
2.1.3. PM 2.5 monitoring data-We obtained PM 2.5 monitoring data in the study regions that were available over 3 different time periods and sites. A total of 1942 daily PM 2.5 and concurrent visibility data were used in our model (Table S1). These included daily PM 2.5 concentrations from U.S. Embassy Kuwait air quality monitor for 2017-2018 (https://kw.usembassy.gov/embassy/kuwait-city/air-quality-monitor/), PM 2.5 concentrations we previously collected in three monitoring sites in Kuwait during 2004-2005(Brown et al., 2008, and daily PM 2.5 observations from two monitoring sites from 2017 to 2018 our group collected in Kuwait City.

2.1.4.
Meteorological data-Meteorological data were obtained from the fifth generation European Centre for Medium-Range Weather Forecasts reanalysis (ERA5) (https://cds.climate.copernicus.eu/cdsapp#!/search?type=dataset). ERA5 is a widely used global reanalysis, which has been verified by comparisons with observations (Hersbach et al., 2020). We considered 18 metrological variables, including temperature and dew point temperature at 2 m, u-wind (east-west component of the wind) speed at 10 m, v-wind (north-south component of the wind) speed at 10 m, instantaneous 10 m wind gust, total precipitation, surface pressure, downward UV radiation at the surface, cloud cover (high, medium, low, and total) evaporation, planetary boundary layer height, relative humidity at 1,000 hPa, vegetation cover (high and low), and forecast albedo (a measure of solar radiation) 2.1.5. Dust and particle data-Since Iraq and Kuwait were largely covered by the desert and are impacted by intense and frequent dust storms, we assessed a number of surface dust related parameters as predictors of PM 2.5 levels. Dust-related (e.g. "Dust surface Mass Concentration", "Dust Extinction AOD", "Dust Angstrom Parameter") and additional particle parameters (e.g. "Organic Carbon Column Mass Density", "Black Carbon Column Mass Density") (Supplemental Table S2) were obtained from the Modern-Era Retrospective analysis for Research and Applications version 2 (MERRA-2) dataset (https:// disc.gsfc.nasa.gov/datasets/M2TUNXAER_5.12.4/summary). MERRA-2 is an atmospheric reanalysis platform provided by NASA, which is agreed well with the results of other satellite observations and those of ground-based measurements (Ginoux et al., 2001;Rienecker et al., 2011;Colarco et al., 2010). MERRA-2 has a spatial resolution of 0.625°×0.5° (approximately 62.5 km × 50 km) and temporal resolution of one day.

Land use data-Land use parameters including Normalized Difference
Vegetation Index (NDVI), elevation, distance to the nearest industrial area, and traffic density were included as model predictors (Table S2). NDVI data were obtained from the NOAA Climate Data Record (CDR) of Normalized Difference Vegetation Index (NDVI), Version 4 (https://data.nodc.noaa.gov/cgi-bin/iso?id=gov.noaa.ncdc:C00813). This dataset was acquired by the Advanced Very High Resolution Radiometer (AVHRR) sensor at a spatial resolution of 5 km × 5 km and a temporal resolution of 1 day. Elevation of each grid cell was extracted from ETOPO1 Global Relief Model (https://www.ngdc.noaa.gov/mgg/ global/global.html). ETOPO1 is a 1 arc-minute global relief model of Earth's surface that integrates land topography and ocean bathymetry. Information regarding industrial areas was provided by the U.S. National Geospatial-Intelligence Agency. Since industrial emissions may influence the relationship between AOD and visibility, we calculated the distance from the center of each grid cell to the nearest industrial area for use as a potential adjustment variable. Road data were obtained through OpenStreetMap (https://www.openstreetmap.org) and traffic density in each 1 km2 grid cell was calculated by the total Highway road length in the grid cell.

Model
In order to estimate PM 2.5 concentrations in each 1 km × 1 km grid cell, we developed a 4stage model (Fig. 1).  Di et al., 2019;Brokamp et al., 2017;Hu et al., 2017). The detailed predictors are shown in Supplemental Table S2. A total of 134,265 paired AOD-visibility observations were used for model training. We optimized model input parameters using a grid search approach (Di et al., 2019), with the input parameters listed in Supplemental Table S3. The final trained model was then used to predict daily visibility for each 1 km × 1 km grid cell with AOD.
We performed a 10-fold cross-validation by randomly dividing the visibility stations into 10%−90% splits (10 random groups). For each split, the model was trained with data from 90% of the visibility sites and was used to predict visibility in the 10% of sites not included in the training. This process was repeated 10 times, and the cross-validation R 2 (CV R 2 ) were computed.

2.2.2.
Stage 2-For grid cells with AOD but without visibility measurements, we used the Stage 1 model to estimate daily visibility for each grid.

Stage 3-For grid cells missing both AOD and visibility measurements, a
generalized additive mixed model (GAMM) was used to predict daily visibility based on the output from Stage 2 model (Kloog et al., 2014). This method was previously used in (Kloog et al., 2012). The model contains a smooth function of longitude and latitude, and a random intercept for each grid cell. A separate spatial surface was fit for each two-month period of each year to capture the temporal variations (Kloog et al., 2014). In addition, we included the mean daily visibility and a random cell-specific slope. The semi parametric regression model is shown as Eq. (1): where PredVis ij is the predicted visibility at grid-cell i on a day j from the stage 2 model; MVis j is the mean value of predicted visibility from stage1 on a day j across the study region; α and β are the fixed intercept and fixed slope, respectively;μ i and v i are the random intercept and grid-cell specific random slope, respectively. X is longitude, and Y is latitude; "Smooth" is a thin plate spline function; and k(j) is the two-month period in which day j falls. To estimate the goodness of stage 3 model, we use the same cross-validation validation method as in stage 1.

Stage 4-First,
we fit a linear mixed effects model to describe the relationship between PM 2.5 and visibility Because of the paucity of PM 2.5 data in this region, we used data during 2017-2018 to train the model and used data during 2004-2005 for validation. Inverse of visibility and squared form of relative humidity were used as variables in the model based on the comparison of models using different combinations of variables (Text S1 and Table S5). The model can be expressed as: where PM 2.5i,j is the measurement daily PM 2.5 concentrations for a PM 2.5 site i on a day j; VIS is visibility; RH is relative humidity; μ is the fixed intercept; μ′ m is the month -specific random intercept, that is, the regression intercept in month m; β 1 is the fixed slope for the inverse of visibility; β 1 ′ m is the random slope for the inverse of visibility range in month j; β 2 and β 3 are the slopes for RH and squared form of relative humidity, respectively; ε i,j is the error term at site i on day j. Ψ is the unstructured variance covariance matrix for the random effects.
Finally, we predicted PM 2.5 concentration using Eq.
(2) based on daily visibility for each 1 km × 1 km grid cell during 2001-2018 in Kuwait and Iraq. We used the results from the daily prediction model to calculate monthly and yearly PM 2.5 averages.
All programming was implemented in R software version 3.5.0.

Results
The spatial distribution of yearly average MAIAC AOD downloaded from NASA for the study region is shown in Supplemental Fig. S4. The CV R 2 between fitted and predicted daily visibility was 0.71 (Table S4) (Table S5). The predicted and measured PM 2.5 concentrations for stage 4 model are shown in Fig. S5.
We found considerable spatial and temporal variations over the study region. Fig. 3 shows the spatial distribution of the predicted PM 2.5 concentrations in the study area, averaged over the full study period. The predictions for each year are shown in Fig. 4 and mean values for Iraq and Kuwait were 37 and 41 μg/m 3 , respectively. Annual and monthly median PM 2.5 predictions for Iraq and Kuwait are shown in Fig. 5. Monthly average PM 2.5 concentrations for each year are shown in Fig. S6.
In general, the PM 2.5 concentrations in Kuwait were slightly higher than Iraq during 2001-2018. The annual PM 2.5 concentration for Kuwait ranged from 36 to 49 μg/m 3 , and the values for Iraq ranged from 33 to 44 μg/m 3 . As expected, the predicted PM 2.5 concentrations were higher in urban areas compared to rural areas. Some big cities such as Bagdad, Karbala, Najaf, and Diwaniya have annual average PM 2.5 concentrations above 45 μg/m 3 . In contrast, Al-Anbar, the largest governorate west of Iraq (138,501 km 2 ) which is mostly desert had low average PM 2.5 concentrations, except for large urban areas in the east (Al-Fallujah and Al-Ramadi). PM 2.5 levels in the study region also showed distinct temporal variability. The highest annual PM 2.5 concentration for both Kuwait and Iraq were observed The monthly average predictions for Kuwait ranged from 32 to 49 μg/m 3 , and for Iraq ranged from 29 to 44 μg/m 3 . July had the highest predicted concentrations, and November had the lowest values. This is consistent with seasonal patterns of PM 2.5 observations and indicates that PM 2.5 pollution in the region has been a long-term problem.
As demonstrated by the considerable spatial and temporal variability over the region, deployers in Iraq and Kuwait experienced variability in PM 2.5 exposures. Some bases were located in regions with lower PM 2.5 concentrations such as in the Syrian Desert, which have average PM 2.5 concentrations below 30 μg/m. However, many U.S. bases located in urban area and had very high PM 2.5 concentrations. The weekly average PM 2.5 at Baghdad International Airport, the US base in Balad, and Kuwait City International Airport are shown in Fig. S7. The weekly PM 2.5 concentrations ranged to more than 150 μg/m 3 for the bases in Baghdad and Balad in summer of 2008 and 2009.

Discussion
We developed a model to predict daily PM 2.5 exposures at 1 km × 1 km resolution for Iraq and Kuwait, where the monitoring system is very limited. Our results indicated that this region had very high PM 2.5 levels, which varied by location and year.
The annual mean predicted PM 2.5 concentrations for Kuwait ( Our study has some major findings that are of public health significance. In previous studies, the AOD-PM 2.5 relationship was directly used to examine the spatial distribution of PM 2.5 . However, for many countries, there are no available historical PM 2.5 measurements. Our novel exposure assessment approach indicates it possible to assess air pollution exposures in countries without extensive PM 2.5 monitoring locations, and without extensive PM 2.5 historical data. And the other predictors used in our model are predominantly from global public datasets, thus our method could be applied in other countries. Secondly, the random forest model approach provided the relative importance of each factor associated with historical PM 2.5 concentrations to assess exposures in this region or similar areas. For example, in addition to AOD, we found that surface dust related variables are also associated with PM 2.5 in arid environments (Li et al., 2020). The seasonal and monthly trends of PM 2.5 are consistent with the seasonal occurrence of dust storms in the region, demonstrating the impact of dust storms on the PM 2.5 model predictions. (Al-Hemoud et al., 2017;Al-Hemoud et al., 2018a;Li et al., 2020).
A limitation of our approach is that the stage 4 model regression results was based on PM 2.5 measurement data collected at three sampling sites for two years because of the paucity of PM 2.5 data in this region. We used the PM 2.5 data collected 2017-2018 to train the model and used data collected 2004-2005 for validation, which could have potentially influenced the accuracy of historical predictions. However, the relationship between PM 2.5 and visibility is relatively stable under specific conditions of relative humidity as in Iraq and Kuwait (Zhao et al., 2017;Liu et al., 2017). It has also been shown that PM 2.5 particles in Iraq and Kuwait exhibit similar hygroscopic properties (Abdeen et al., 2014). In support of our approach is that the cross validation of the 2004-2005 PM 2.5 data indicated good model fit. Another limitation is that the AOD data used in this study has nonrandom missing data (such as days with heavy cloud clover) that was accounted for by modeling, and the local crossing time of the AOD data were relatively fixed between 10:30 am and 1:30 pm.
Although these factors could result in systematic bias of the PM 2.5 predictions, the model would still be internally consistent.
To the best of our knowledge, this is the first model to estimate historical PM 2.5 concentrations at a high resolution in Iraq and Kuwait that takes advantage of all available measurement data. This high-resolution model makes it possible to assess the health effects of PM 2.5 exposures both in the native populations and the military personnel. This study is also helpful to the construction of regional environmental policies in areas with different pollution levels.

Supplementary Material
Refer to Web version on PubMed Central for supplementary material. Top 30 most importance variables for the random forest model predicting daily visibility (xaxis: variable importance relative to the importance of AOD, calculated by function "h2o.varimp_plot" of R Package 'h2o').