A Methodological Comparison on Spatiotemporal Prediction of Criteria Air Pollutants

Air pollution monitoring devices are widely used to quantify at-site air pollution. However, such monitoring sites represent pollution of a limited area, and installing multiple devices for a vast area is costly. This limitation of unavailability of data at non-monitoring sites has necessitated the Spatio-temporal analysis of air pollution and its prediction. Few commonly used methods for Spatio-temporal prediction of pollutants include - ‘Averaging’; ‘Best correlation coefficient method’; ‘Inverse distance weighting method’ and ‘Grid interpolation method.’ Apart from these conventional methods, a new methodology, ‘Weighted average method,’ is proposed and compared for air pollution prediction at non-monitoring sites. The weights in this method are calculated based on both on the distance and directional basis. To compare the proposed method with the existing ones, the air pollution levels of NO 2 (Nitrogen dioxide), O 3 (Ozone), PM 10 (Particulate matter of 10 microns or smaller), PM 2.5 (Particulate matter of 2.5 microns or smaller), and SO 2 (Sulphur dioxide) were predicted at the non-monitoring site (test stations) by utilizing the available data at monitoring sites in Delhi, India. Preliminary correlation analysis showed that NO 2 , PM 2.5 , and SO 2 have a directional dependency between different stations. The ‘average’ method performed best with the mode RMSE of 18.85 μg/m 3 and R 2 value 0.7454 when compared with all the methods. The RMSE value of the new proposed method ‘weighted average method’ was 21.25 μg/m 3 , resulting in the second-best prediction for the study area. The inverse distance weighting method and the Grid interpolation method were third and fourth, respectively, while the ‘best correlation coefficient’ was the worst with an RMSE value of 41.60 μg/m 3 . Results also showed that the methods that used dependent stations had performed better when compared to methods that used all station data.


INTRODUCTION
Pollutants, the undesired substances, degrade the ecosystem with their adverse effects and cause pollution.When concerned with the imbalance in the atmosphere, this pollution is called air pollution.It kills an estimated seven million every year, with low-and middle-income countries suffering the most (Osseiron and Lindmeier, 2018;WHO, 2018).A significant relationship between atmospheric pollutants and health hazards is reported, and their exposure increases the risk of cardiovascular diseases, fertility issues, and mental health in particular (Chen et al., 2018;Manan et al., 2018;Merklinger-Gruchala et al., 2017;Curtis et al., 2006;Mortimer et al., 2002;Wong et al., 2001).Particulate matter variation studies in India have shown Delhi as the most polluted city when compared to Kolkata, Mumbai, Hyderabad (Singh et al., 2021), with values of PM 2.5 (Particulate matter of 2.5 microns or smaller) and PM 10 (Particulate matter of 10 microns or smaller) exceeds National Ambient Air Quality Standards (NAAQs) 60 mg/m 3 (Guo et al., 2019).Studies have attributed agricultural fires as one of the reasons for this (Cusworth et al., 2018).Furthermore, the existence of the relationship between the outbreak of COVID-19 and air pollution is also reported (Roy, 2021).Moreover, an increase in levels of NO 2 (Nitrogen dioxide), O 3 (Ozone), and SO 2 (Sulphur dioxide) have potential health concerns (WHO, 2018).
Considering the harmful nature of these pollutants, it is important to monitor these pollutants.However, due to the high cost of placing monitoring instruments everywhere, sometimes the pollutants can be predicted at a non-monitoring site using various available methods.There have been a number of studies around the globe that have predicted these criterion air pollutants using machine learning (Qi et al., 2019;Wen et al., 2019;Yeganeh et al., 2018;Fan et al., 2017;Zou et al., 2015;Papaleonidas and Iliadis, 2013;Rigol et al., 2001) and Regression models (Kerckhoffs et al., 2021;Boaz et al., 2019;Wang and Song, 2018;Alam and McNabola, 2015;Russo and Soares, 2014;Dominick et al., 2012;Crouse et al., 2009).
There have also been some studies, wherin a comparison of various methods to predict these pollutants have been done.The spatial interpolation methods such as geostatistical methods (various kriging methods), local interpolators (Thiessen polygons, IDW, splines), global interpolators (trend surfaces or regression models), and mixed methods were analyzed by Vicente-Serrano et al. (2003).Spatial interpolation methodologies were summarized for urban air pollution modeling based on the application for the greater area of metropolitan Athens, Greece (Deligiorgi and Philippopoulos, 2011).The proposed methodologies include Nearest neighbor method, Triangulated irregular network method, Natural neighbor method, Inverse distance weighting (IDW) method (with linear and squared IDW), Radial basis function (RBF) method, Thin plates splines method, Kriging method, and Artificial neural network (ANN) method.The application of spatial interpolation methods was given to many disciplines (Li and Heap, 2011).A total of 72 spatial interpolation methods were analyzed, and comparative performances were provided in their study by Li and Heap (2014).They included the most frequently used methods as Inverse distance weighting (IDW), ordinary kriging (OK), and ordinary co-kriging (OCK) also.The spatiotemporal prediction at the non-monitoring site was presented by Alimissis et al. (2018) using interpolation methods to determine the air pollution at a new location.
Some previous studies in India focused on the missing data prediction using previous data of pollutants or meteorological data or their combinations as predictors.Singh et al. (2012) provides a comparison of the linear (PLSR) and nonlinear (MPR, MLPN, RBFN, and GRNN) models for urban air quality prediction (RSPM, SO 2 , NO 2 ) using data of temperature, relative humidity, wind speed, SPM, NO 2 , and SO 2 data and shown more remarkable performances of GRNN models than the MLPN and RBFN.Nagendra andKhare (2006, 2005) compared among three choices of input data sets for the prediction of NO 2 firstly, using meteorological and traffic data, secondly, using only meteorological data, and lastly using only traffic data.MLR and PCA + ANN models were evaluated with statistical analysis by Mishra and Goyal (2015) for NO 2 forecasting models at Taj Mahal, Agra using NO 2 , SO 2 , temperature, CO, O 3 , RH, WS, and WDI (wind direction index) as predictors.The performances of PCA approaches were found better than the MLR in their analysis.The performance for the MLP model was found better as compared to RBF and GRNN models for all seasons (Kumar et al., 2017).
Thus, the prediction studies of air pollution have been done for a particular location to evaluate the missing data using the pollution data and meteorological data of that location or a combination of both.The studies have also been done to predict air pollution at a location using pollution data of other sites and other predictors.However, the studies predicting the air quality at a new location (other than monitoring site) using spatial interpolation methods have not been done in India.
In this study, the prediction of air quality parameters at a location of a non-monitoring site is presented over Delhi, India.The air pollution data of the neighboring monitoring sites have been used for the predictions of NO 2 , O 3 , PM 10 , PM 2.5 , and SO 2 .The specific objectives of the present study are: (i) Prediction of criteria air pollutants (NO 2 , O 3 , PM 10 , PM 2.5 , and SO 2 ) at non-monitoring sites in Delhi using five different methods (average method, best correlation coefficient method, weighted average, inverse distance weighting method, and grid interpolation method).(ii) To compare the performance of these five methods on the basis of prediction efficiency.

STUDY AREA AND DATA USED
The present study is conducted for Delhi, India; located between 28.42°N to 29.00°N latitude and 76.86°E to 77.36°E longitude (Fig. 1).Delhi is one of the major cities in India, with rapid growth in population, traffic, industrialization and construction activities.This has led to higher energy consumption.Moreover, the availability of alternate energy sources is limited, thus further increasing the air pollution in Delhi.The air quality observation data for the study area is obtained from CPCB (Central Pollution Control Board of India) at 'https://app.cpcbccr.com/AQI_India/' , and the same is also provided at 'https:// openaq.org/' .Table 1 shows the statistical summary of fifteen minutes scaled dataset of the five pollutants (PM 2.5 , PM 10 , NO 2 , SO 2 , and O 3 ) for the 46 stations over Delhi in the period of January 1, 2018, to October 30, 2020.
Analysis showed that, only 25 of these stations had missing data less than 75%, which were used for prediction and validation.The top six monitoring stations (Table 2) having maximum missing data out of the 25-monitoring station was reserved for validation, while the remaining 19 monitoring station with the lowest missing per-    centage was used for prediction model development purposes (Table 3).

METHODOLOGY
In this study five different methods (Simple Averaging, Best Correlation Coefficient Method, Weighted Average Method, Inverse Distance Weighting method, and Grid Interpolation Method) have been evaluated in terms of prediction efficiency.Theses are the methods which are commonly used in the previous studies (Jin et al., 2011).The correlation analysis was performed to check the dependency among the chosen monitoring stations.The correlation analysis was performed between 19 monitoring stations using the entire data at 15 min intervals at seasonal (Winter: January and February), Summer: March, April, and May, Monsoon: June to September, and Postmonsoon period: October to December) and annual scales.

1 Correlation Analyses
The correlation coefficients between all possible pairs of 19 observation monitoring stations were evaluated for each season and pollutant.The value of the correlation coefficient to be satisfactory depends on the purpose for which it is used and the nature of raw data.The broad classification of correlation coefficients for their correlation strength is given in Asuero et al. (2006).The pair of stations showing average or high correlation was treated to have a positive dependency.A linear model was established among the correlation coefficients between each pair as output variable, distances between two stations, and direction of the line joining the two stations as input variables.
(1) where, r(O i , O j ) is the correlation coefficient between the observation of the i th and j th observation monitoring stations.
If the spatial coordinates of the i th and j th observation monitoring stations are given as is the linear distance between the i th and j th observation monitoring stations whose mathematical expression given by is the direction of the line joining i th and j th observation monitoring stations, whose mathematical expression is given as The constants C 1 , C 2 , and C 3 are determined by multiple linear regression among correlation coefficient, linear distance, and direction of the line joining two observation stations for each season and pollutants.Thus, the possible correlation coefficient between any test station and observation station for any pollutant can be evaluated in any season using the established linear model.
is the linear distance, and θ(S T i , S O j ) is the direction of the line joining between i th test station and j th observation monitoring station.

2. 1 Averaging Method
The model established in the correlation analysis was used to evaluate the dependency of pollutants concentration of observation stations (monitoring stations) on the test station pollutants concentration based on the spatial characteristic of the test station with respect to observation stations.The correlation coefficient was evaluated between test stations and observation stations for each pollutant using the established linear model (Equation 4).
This method gives the average value of the pollutant of nearby monitoring stations, which shows positive dependency (R>0.5) with the site to be predicted.It can be mathematically expressed as where, C(S T i , t 0 ) is the pollutant concentration at i th test station on time t 0 , and C(S O j , t 0 ) is the pollutant concentration at the j th dependent observation station on time t 0 .

2. 2 Best Correlation Coefficient Method
The linear model was established among the correlation coefficients as output variable, distances between two stations, and direction of the line joining the two stations as input variables (Equation 4).Using this linear model, the correlation coefficients between the desired point location and available data point locations are predicted from the distances and directions.
This method states that the data point of the best correlation station with the desired location should be considered the predicted values for the desired location.( , ) max( ( , ), ( , ),......
where, C(S T i , t 0 ) is the pollutant concentration at i th test station on time t 0 , C(S O k , t 0 ) and is the pollutant concentration at k th dependent observation station having maximum correlation coefficient on time t 0 .

2. 3 Weighted Average Method
The correlation coefficient was evaluated between test stations and observation stations for each pollutant in each season using the established linear model (Equation 4).The pollutants concentrations were assumed to be linearly dependent on the nearby dependent observation stations.
. .( , ) where C(S T i , t 0 ) is the pollutant concentration at i th test station on time t 0 , and C(S O j , t 0 ) is the pollutant concentration at the j th dependent observation station on time t 0 .C f is the observation station data, and M D is its dependent station data matrix.B is the coefficients of proportionality between C f and M D .The coefficients of linearity (weights for each dependent observation station) for any observation station are assumed to be linearly dependent on the distance of the observation station and the direction of the line joining that observation station.
Now the coefficients B between test station and observation stations are back-calculated using K 1 and K 2 .
Thus, in this method, mass concentration pollutants at test are given as a sum of multiplication of pollutants concentration data of dependent observation stations with their weights (proportionality coefficients determined earlier for each observation station).

2. 4 Inverse Distance Weighting (IDW) Method
The principle of this method is that the observation station having more distance from the test station will affect the least to test station and vice versa.In this method, the weights are distributed among the dependent observation stations according to their inverse distance from the test station.The dependency of the observation stations is determined from the correlation analysis (Equation 4).The generic equation for the Inverse Distance Weighting (IDW) method (Bartier and Keller, 1996) was given as where C(S T i , t 0 ) is the pollutant concentration at i th test station on time t 0 , and C(S O j , t 0 ) is the pollutant concentration at j th dependent observation station on time t 0 .

2. 5 Grid Interpolation Method
In this method, the grid interpolation (using the nearest neighbor method) has been done for query points (entire grid) over the scattered pollutants data by the fitting surface over the area.The mass concentration data obtained by interpolation at the query point (location at which the pollution concentrations are to be predicted) is considered the predicted data.

a. Grid interpolation with dependent station data
The observation stations dependent on the particular test station (R>0.5) were used as input data for interpolation in a specific season for a specific pollutant.

b. Grid interpolation with all station data
The data of all observation stations were used as input for the interpolation in a particular season for a specific pollutant.

1 Correlation Analyses
The performance analyses in this study were carried using the Root Mean Square Error (RMSE) and Correlation Coefficient (R 2 ).Considering O and P as the observed and predicted concentrations, the formula for RMSE and R 2 are as shown below:  The correlation plots between 19 observation stations for different pollutants are shown in Figs. 2(a) to (e).Lower value of correlation coefficient does not provide sufficient relation, hence, in this study a threshold value of 0.5 was considered.All values having correlation coefficient less than 0.5 were removed in this study.The color and width of the line represent the magnitude of the correlation between the two observation stations.The larger width and red color signifies higher correlation whereas, the smaller width and blue color represent a lower correlation.
It can be seen that the correlations are strongest for PM 10 (Fig. 2(c)) and PM 2.5 (Fig. 2(d)), whereas weakest for SO 2 (Fig. 2(e)).However, a weak directional dependency was also observed for SO 2 along the NW-SE direction, as the strength of the correlation is always lower in this direction.
The number of pairs of stations showing a good correlation with their frequency of occurrence along with five pollutants and five seasons is shown in the histogram (Fig. 3).Dwarka-Sector 8, Delhi -DPCC shows a good correla-tion in 24 out of 25 cases (for five pollutants along five seasons) with Jawaharlal Nehru Stadium, Delhi -DPCC always a good correlation with the Jawaharlal Nehru Stadium monitoring station.The monitoring station pairs having good correlation in 23 out of 25 cases are given in Table 4.

2 Prediction of Pollutant's Concentration
The predictions of the pollutant's concentration at 15 minutes intervals have been made for three years by five various methods in all seasons for five pollutants at six validation stations.For the purpose of inter-comparison between different pollutants, instead of RMSE, the PRMSE (Percentage RMSE) was used to prepare the box plots.These box plots were prepared both for 15-minute interval and daily averages.

2. 1 Nitrogen dioxide (NO 2 )
The box plots showing PRMSE for NO 2 in different seasons and by different methods are shown in Fig. 4. Overall, the weighted-average prediction method shows the best performance having the lowest median PRMSE and least interquartile range for three seasons.However, on the annual scale, even the Average method shows better consistency and accuracy.
Due to the conversion of the 15 minutes interval prediction data into daily data, the interquartile range and median RMSE have been reduced to 12.92 μg/m 3 on average (Fig. 5).It can still be seen that the pattern of performance among the methods of prediction is similar to 15-minute interval case.

2. 2 Ozone (O 3 )
The presence of outliers shows the variations of predictions at different test stations are more in the summer sea-   and 7).All the methods of prediction have almost similar behavior for the prediction of O 3 .The missing box plot for any method shows the unpredictability of that method due to the lack of dependent observation station data (R>0.5) in the respective season.

2. 3 Particulate Matter (PM 10 )
The RMSE of predictions is greater for PM 10 than NO 2 and O 3 .The weighted average method has worse prediction (highest median PRMSE and interquartile range in all seasons) than other methods (Figs. 8 and 9) in all seasons for the prediction of PM 10 .

2. 4 Particulate Matter (PM 2.5 )
The presence of outliers shows the more significant variations of predictions at different test stations in every season for the prediction of PM 2.5 (Figs. 10 and 11).All the predictions methods have almost similar behaviour for the prediction of PM 2.5 except the Weighted average method in some seasons (winter and post-monsoon).The prediction by the Weighted average method is the worst among all the methods for PM 2.5 prediction.

2. 5 Sulphur dioxide (SO 2 )
The missing box plot for any method shows the unpre-  dictability of that method due to the lack of dependent observation station data in the respective season.All the methods of prediction have almost similar behaviour for the prediction of SO 2 (Figs. 12 and 13).

3 Performance of Methodologies
The RMSE and R 2 of predictions were calculated for each prediction method in each season for each pollutant and at each test station.These RMSE and R 2 were grouped for each methodology, and histogram plots  were shown representing the frequencies of occurrence of RMSE with mode, median and average values.
The performance of the Average method was found to be having the best predictions most of the time with mode RMSE of 18.85 (Fig. 14), R 2 of 0.74 (Fig. 15), and the performance of the Best correlation coefficient (BCC) method was found to be having the worst predictions most of the times with mode RMSE of 41.60 (Fig. 14).These results are similar to some of the previous studies (Alam et al., 2015: max R 2 = 0.66; Crouse et al., 2009: max R 2 = 0.8; Qi et al., 2019: max R 2 = 0.72).
The scatter plot of the prediction having least and  The least RMSE has been achieved to predict PM 2.5 by Grid interpolation using all observation station data in monsoon season (Fig. 16(d)) with a high correlation coefficient of 0.88.In contrast, the least RMSE has been achieved to predict SO 2 by Grid interpolation using all observation station data in monsoon season (Fig. 16(e))

SUMMARY
The increasing availability of historical data, the num-ber of monitoring stations, and computing resources have facilitated us to develop more advanced models for air pollution prediction.In this study, a methodological comparison to predict the mass concentration of NO 2 , O 3 , PM 10 , PM 2.5 , and SO 2 at any non-monitoring site was carried out.
A correlation analysis was done to check the interdependency of pollutants concentration data among the different observation stations.The correlation matrix was calculated based on the pollutant's concentration data of three years.Further, a model is established based on the correlation coefficients between any pair, their distance, and the direction of the line joining among the 19 monitoring stations of Delhi.This model predicts the strength of correlation between non-monitoring sites and 19 monitoring stations of Delhi.
Five different methods were adopted to predict mass concentrations at non-monitoring sites using the mass concentration data of those monitoring sites, showing a good correlation with non-monitoring sites.Six additional monitoring stations' mass concentrations data were used to validate predicted mass concentrations at those sites.The percentage errors and RMSE were calculated, and comparison was carried out among the methodologies in different seasons for each pollutant.
The correlation between the two monitoring stations has no relation with the direction of their spatial position (direction of the line joining them).The results showed that a simple average on the dependent station model performed best over other methods.The performance of the proposed method is also consistent both for optimal prediction at all stations and has the second minimum mode RMSE among different prediction methods.These methods can be used in future studies and other regions for air pollutant prediction.

CONCLUSIONS
The conclusions of the present study can be enumerated as: 1) Based on the correlation analysis, we found that the stations showed higher correlations for the Particulate matter (PM 2.5 and PM 10 ) compared to other pollutants.2) Overall, the Grid interpolation method with dependent station was found the best (lowest median RMSE = 5,107.5μg/m 3 ) whereas the Weighted Average was the worst (maximum RMSE = 9,734.9μg/ m 3 ).
3) Averaging is found to be the best prediction method based on mode RMSE( = 18.85 μg/m 3 ), whereas the Best correlation coefficient method was found to be the worst prediction method (mode RMSE 41.60 μg/m 3 ).4) Overall, the prediction for SO 2 was the best among all the pollutants whereas for O 3 , it was the worst.Considering different seasons, it was easier to predict the pollutants in the monsoon season, whereas it was most difficult in the post monsoon season.This can be attributed to corresponding level of pollution in these seasons.5) The methods that used dependent station data (Average, IDW, and GI DS) always have greater mode R 2 , whereas the methods that used all station data (BCC, WA, and GI AS) always have mode R 2 smaller as 0.1 (Fig. 15).

LIMITATIONS
One of the major limitations of this study is the lack of available information about point source pollution.If emission data about these major point sources were available, proper weights could be assigned to these stations based on distance and direction from the source of pollution.Moreover, the distances between the stations is small, therefore the outcomes of the study may be limited only up to the spatial extent of the study area.

Fig. 1 .
Fig. 1.The location of Delhi (the capital) in India showing 19 observation stations and 6 test stations.

Fig. 2 .
Fig. 2. Spatial plot of observation stations showing the strength of correlation coefficient calculated on annual data over three years using colour and weight of the line for NO 2 , O 3 , PM 10 , PM 2.5 and SO 2 .

Fig. 3 .
Fig. 3. Histogram showing the frequency of the combination of observation station pairs with the number of times having good strength of correlation out of 25 cases (for five pollutants in 5 seasons).

Table 4 .
Pair of observation stations showing good strength of correlation 23 times out of 25 cases.son for the prediction of O 3 (Figs.6

Fig. 4 .
Fig. 4. Box plot of PRMSE of predictions by various methods at six validation stations in various seasons for NO 2 .

Fig. 5 .
Fig. 5. Box plot of PRMSE of predictions by various methods at six validation stations in various seasons for NO 2 data (15 minutes data) converted to daily data.

Fig. 6 .
Fig. 6.Box plot of PRMSE of predictions by various methods at six validation stations in various seasons for O 3 .

Fig. 7 .
Fig. 7. Box plot of PRMSE of predictions by various methods at six validation stations in various seasons for O 3 data (15 minutes data) converted to daily data.

Fig. 8 .
Fig. 8. Box plot of PRMSE of predictions by various methods at six validation stations in various seasons for PM 10 .

Fig. 9 .
Fig. 9. Box plot of PRMSE of predictions by various methods at six validation stations in various seasons for PM 10 data (15 minutes data) converted to daily data.

Fig. 10 .
Fig. 10.Box plot of PRMSE of predictions by various methods at six validation stations in various seasons for PM 2.5 .

Fig. 11 .
Fig. 11.Box plot of PRMSE of predictions by various methods at six validation stations in various seasons for PM 2.5 data (15 minutes data) converted to daily data.

Fig. 12 .
Fig. 12. Box plot of PRMSE of predictions by various methods at six validation stations in various seasons for SO 2 .

Fig. 13 .
Fig. 13.Box plot of PRMSE of predictions by various methods at six validation stations in various seasons for SO 2 data (15 minutes data) converted to daily data.

Fig. 14 .
Fig. 14.Histogram plot showing frequencies of RMSE over various pollutants, seasons, and test stations for each prediction method.

Fig. 15 .
Fig. 15.Histogram plot showing frequencies of R 2 over various pollutants, seasons, and test stations for each prediction method.

Table 1 .
Statistical Summary of the five pollutants used in the study area.

Table 2 .
Test stations' names and locations.

Table 3 .
Observation Stations' names and locations.