Modelling PM2.5 for Data-Scarce Zone of Northwestern India using Multi Linear Regression and Random Forest Approaches

ABSTRACT PM2.5 (Particulate matter with aerodynamic diameter <2.5 m) concentrations above permissible limit causes air quality deterioration and hampers human health. Due to the lack of a good spatial network of ground-based PM monitoring sites and systematic checking, the availability of continuous data of PM2.5 concentrations at macro and meso scales is restricted. Present research estimated PM2.5 concentrations at high (1 km) resolution over Faridabad, Ghaziabad, Gurugram and Gautam Buddha Nagar, a data-scarce zone of the highly urbanized area of northwestern India for the year 2019 using Random Forest (RF), Multi-Linear Regression (MLR) models and Hybrid Model combining RF and MLR. It included Aerosol Optical Depth (AOD), meteorological data and limited in-situ data of PM2.5. For validation, the correlation coefficient (R), Root-Mean-Square Error (RMSE), Mean Absolute Error (MAE) and Relative Prediction Error (RPE) have been utilized. The hybrid model estimated PM2.5 with a greater correlation (R = 0.865) and smaller RPE (22.41%) compared to standalone MLR/RF models. Despite the inadequate in-situ data, Greater Noida has been found to have a high correlation (R = 0.933) and low RPE (32.13%) in the hybrid model. The most polluted seasons of the year are winter (137.28 µgm−3) and post-monsoon (112.93 µgm−3), whereas the wet monsoon (44.56 µgm−3) season is the cleanest. The highest PM2.5 level was recorded in Noida followed by Ghaziabad, Greater Noida and Faridabad. The findings of the present research will provide an input dataset for air pollution exposure risk research in parts of northwestern India with sparse monitoring data.


Introduction
Air pollution has become a worldwide concern in the era of modern-age cities. Besides natural sources of air pollution, intense anthropogenic activities play an important role in degrading the air quality. Particulate matter (PM) is the combination of liquid and solid particles with varying sizes, origins and structures suspended in the air and is one of the major air pollutants. Fine particulate matter with a diameter less than 2.5 m (PM 2.5 ) affects human health the most by directly getting into the respiratory system and can cause lungrelated disorders and cardiovascular diseases (Dockery et al. 1993). Besides the impact on human health, PM yields an adverse impact on the vegetated surface by increasing leaf temperature and reducing the supply of sunlight for photosynthesis (Grantz, Garner, and Johnson 2003;Diener and Mudu 2021). Further, it hampers aquatic life by causing acid rain, which disrupts the functioning of the ecosystem (Wu and Zhang 2018;Upadhyay 2020).
Considering the adverse impact of PM 2.5 on living organisms, the abiotic environment and human life, numerous ground-based monitoring stations have been established to measure PM 2.5 concentration in urban areas by the Central Pollution Control Board (CPCB), State Pollution Control Board (SPCB) and other agencies. However, the deployment of air quality monitoring instruments in the ground stations has been restricted in urban areas of any district. Although, suburban and rural areas of the districts located mainly in the arid and semi-arid zones also suffer from air pollution (Sharma et al. 2021;Ranjan, Sharma, and Ghosh 2022). Therefore, it has become vital to estimate PM 2.5 concentration on mesoscale to assist urban planners and policymakers in making air quality management plans both for urban and rural areas of the district. The present research also attempts to bridge the gap for irregular monitoring of PM 2.5 with inadequate monitoring stations, which could be a step to resolve the issue of missing data in air quality studies.
The satellite-based observation could play a major role and complement existing limited in-situ data. As the satellite imageries cannot directly acquire PM 2.5 data, satellite-derived aerosol optical depth (AOD) is routinely utilized to estimate and forecast PM 2.5 concentrations, particularly in places where ground-based monitoring stations are scarce or unavailable (Chitranshi, Sharma, and Dey 2015a;Chelani 2019;2020;Guo et al. 2021). High spatially resolved AOD with daily worldwide coverage from satellite sensors (Moderate Resolution Imaging Spectroradiometer (MODIS) and Multi-Angle Imaging Spectrometer (MISR), Visible Infrared Imaging Radiometer Suite (VIIRS), etc.) has been utilized in several studies to estimate PM 2.5 concentration (Li, Carlson, and Lacis 2015;Meng et al. 2018;Pal et al. 2018;Yao et al. 2018;Chelani 2019;Chowdhury et al. 2019;2020). MODIS-based AOD has been retrieved using multiple algorithms (Deep Blue (DB) and Dark Target (DT)) and successfully predicted ambient PM 2.5 concentration at 10 km and 3 km on a regional scale (Bilal, Nichol, and Spak 2017) but not on a local scale. Therefore, currently, a recent algorithm to retrieve MODIS aerosol known as Multi-Angle Implementation of Atmospheric Correction (MAIAC) retrieves AOD at high spatial resolution (1 km grid) with global coverage and a daily revisit period for each instrument. Such improved resolution retrieval of AOD benefited estimation of PM 2.5 concentration at the micro-level using numerous modelling approaches.
Among such models, a univariate regression model is widely used exhibiting linear regression relationships between satellite-derived AOD and limited in-situ measurements of PM 2.5 . The linear regression equation is used to estimate PM 2.5 concentrations (Banerjee, Ghose, and Pradhan 2018;Chelani 2019;Harper et al. 2021). Although, earlier research reported that temporal variation of several parameters (relative humidity (RH), temperature, wind speed, land use, etc.) could influence the relationship (Fung and Wu 2014;Chitranshi, Sharma, and Dey 2015b;Chelani 2019;2020;Guo et al. 2021). Therefore, to improve the accuracy of the estimated PM 2.5 , besides AOD, other significant elements, viz. meteorological variables and land use characteristics, are suggested to incorporate into the PM modelling.
Combining AOD and other variables, Multivariate linear regression (MLR) has been used in numerous studies to estimate PM 2.5 (Chitranshi, Sharma, and Dey 2015a;Li and Wang 2017;Zhao et al. 2018;Chelani 2019). MLR has been used because of its simple form, easy computation and interpretation. However, MLR is not appropriate to identify the nonlinear link that exists between response variables and driving factors. Consequently, machine learning approaches viz. artificial neural network (ANN), support vector machine (SVM) and Random Forest (RF) gain visibility in the assessment of PM 2.5 (Hu et al. 2017;Li et al. 2022). Machine learning approaches have a simple structure, good fitting ability and high estimation accuracy and quantify linear and nonlinear correlations between a response variable and driving factors (Guo et al. 2014). ML-based models can combine many predictors with a few prior assumptions and deal with complex AOD-PM 2.5 interactions. Compared to the aforementioned ML-based approaches, a highly accurate RF model comprises effective algorithms with fewer data distribution assumptions that resist overfitting, deal with data noise and has a low error rate (Hu et al. 2017;Wei et al. 2019;Xu et al. 2020). PM 2.5 has been effectively estimated and predicted using Random Forest (RF) models worldwide, including China, the United States, Italy, etc. Contrary to the individual models, hybrid models have been used in limited research which demonstrates the effectiveness of the hybrid models both in the data-scarce zone and areas having adequate data (Zimmerman et al. 2018;Yuchi et al. 2019). However, in the context of India, limited research is available on the use of the RF model and hybrid model to estimate PM 2.5 (Maheshwarkar and Raman 2021;Gulia, Nagendra, and Khare 2017;Kumar, Mishra, and Singh 2020;2020). Consequently, the present research has tested three models (MLR, RF and Hybrid) in the data-scarce zone of northwestern India. In northwestern India, major studies are focused on Delhi (Kumar, Mishra, and Singh 2020;Mandal et al. 2020). As Delhi is highly polluted and holds strategic importance, it comprises a large network of ground stations that facilitates systematic monitoring and data generation of PM 2.5 . Contrary to this, the current study area suffers from low ground station coverage, which restricts in-situ data collection. It is important to note that Delhi's pollution is also affecting the present study area due to the interconnection of meteorological variables (wind direction, wind speed, temperature, relative humidity, etc.). Thus, there is an urgent need to study the PM 2.5 concentration in these areas.
Despite high pollution levels (average annual mean of PM 2.5 for Faridabad (85µgm −3 ), Ghaziabad (110.2 µgm −3 ), Noida (97.7 µgm −3 ), Greater Noida (91.3 µgm −3 )) and poor air quality (varied from moderate to poor) (Kumari, Lakhani, and Kumari 2020), no study has been carried out focusing on the current study area independently for prediction of PM 2. 5 using an advanced statistical or machine learning methodology, to the best of our knowledge. Therefore, the present research attempts to estimate PM 2.5 by establishing a relationship of PM 2.5 concentration with AOD MAIAC and corresponding meteorological parameters using MLR, RF and Hybrid models by combining RF and MLR. After validating the model, we have used the best model to predict ambient PM 2.5 concentrations and exhibited spatiotemporal distribution of PM 2.5 concentrations which are crucial for understanding urban and rural air quality.

Study area
Present study region (28.07°N-28.92°N and 76.65° E 78.21°E) covers districts (Faridabad and Gurugram in Haryana, Ghaziabad and Gautam Buddha Nagar in Uttar Pradesh) in the northwestern part of India, which are highly urbanized and densely populated ( Figure 1). Rapid economic and demographic expansion in the region is linked to various anthropogenic activities. At least one ambient air quality observatory is operated by the Central Pollution Control Board (CPCB) and the State Pollution Control Board (SPCB) in the city area only for all the districts except Gurugram, where the station data is not available for the current study period. Due to rapid urbanization, the present study area and surroundings are dominated by severe industrial pollution, automobile emissions, fossil fuel burning, human activities and other issues (Sharma and Joshi 2016;Horo and Punia 2018;Ghosh et al. 2021;Kumar et al. 2021;Sharma et al. 2021;Ranjan, Sharma, and Ghosh 2022). Ghaziabad is the most polluted city in the South Asian area, followed by Noida (5 th ), Gurugram (6 th ), Greater Noida (8 th ) and Faridabad (11th) (WAQR 2019). Despite its massive development and severe levels of air pollution, scientific study related to particle pollution is still scarce in this region.

Data used and methodology
Description of satellite-derived AOD, meteorological data and locations of ground-based stations for measuring PM 2.5 have been provided in Table 1.
The spatial resolution of the MAIAC aerosol products is greater compared to operational MODIS aerosol products based on other algorithms. According to several validation experiments, the MAIAC algorithm enhances aerosol retrieval accuracy, notably over bright surfaces like urban concrete and over dry soil. In this study, AOD has been extracted at 550 nm using the combined Terra and Aqua MAIAC product (MCD19A2; https://ladsweb.modaps.eos dis.nasa.gov/) for the morning (10:00 am local time) and afternoon (2:00 pm local time). Highest-quality data as indicated by the cloud mask value 'clear' for Quality Assurance (QA) has been used. AOD MAIAC has first been validated using AOD AERONET from two AERONET stations (Gual Pahari and Amity University, Gurugram) available in the current study area. Further, to fill the data gaps, multiple imputations by chained equations (MICE) have been applied in the current study. Further, outlier removal has been performed by the quartile method through the first and third quartiles of data.

In-situ PM 2.5 and meteorological data
Ground-based monitoring stations are being distributed sparsely dispersed across the study region. Daily hourly mean PM 2.5 data has been collected from four air quality monitoring sites (Figure 2) of the Central Pollution Control Board (CPCB) from 1 January 2018, to 31 December 2019. After excluding anomalous observations, we estimated the daily mean PM 2.5 concentrations. Moreover, data of meteorological variables (ambient temperature (temp), wind speed (WS), wind direction (WD) and relative humidity (RH)) has been collected from the four ground stations of the same period (1 January 2018, to 31 December 2019), filtered and averaged across the 10 am-2 pm period, corresponding to the combined terra and aqua overpass satellite periods.
Finally, after the matchup, the total number of PM 2.5 , AOD and meteorological data were 2038 (1019 for 2018 and 1019 for 2019) distributed from 1 January 2018 to 31 December 2019. Multi-linear regression (MLR) model, random forest (RF) model and hybrid model have been applied to estimate PM 2.5 by using 80% of the 2038 in-situ data for model calibration and 20% for model validation.
The performance of all three models has been examined based on the correlation of the estimated PM 2.5 with the observed PM 2.5 . Finally, a PM concentration map has been generated using the estimation by the best-performing model.

Multi-linear regression (MLR) model
A multi-linear regression relationship between PM 2.5 and its driving forces (AOD, temperature (Temp), relative humidity (RH), wind direction (WD) and wind speed (WS)) is established using the day-wise data available for the years 2018 and 2019 from the ground stations and displayed as regression equation (1).
Where β = intercept, ε = error term, Temp = temperature in degrees, RH = relative humidity in per cent, WD = wind direction in degrees, WS = wind speed in kilometres/hour, d ¼ day α 1 toα 9 = the coefficients.  Figure 2. Flow chart of the workflow.

Random Forest (RF) model
A random forest regressor is a machine learning-based algorithm that accumulates multiple-decision tree regressors. In the RF model, random sampling has been used to create bootstrapped datasets. With bootstrapped datasets, multiple-decision trees have been trained individually by the best split of randomly selected variables at each node and subsequent bifurcation of one node to two sub-nodes. Two specifications (m try denotes the number of variables for bifurcation of each node and n tree denotes the number of trees in the model) are required in the RF model. In the current study, model accuracy was obtained for m try = 3, n tree = 1500. In the end, the response of a given dataset is estimated as the average of the regression outcomes from all decision trees. Such aggregation of multiple decision trees minimizes the error due to bias and variance by combining weak learners with strong learner. RF model also incorporates in-situ PM 2.5 , AOD MAIAC , Temp, RH, WD and WS as the variables. To examine the estimation accuracy of the model, several statistical indices viz. correlation coefficient (R), mean absolute error (MAE), root mean squared error (RMSE) and relative percentage error (RPE) has been derived between estimated and observed PM 2.5 .

Hybrid model
A hybrid model has been developed using a combination of MLR and RF models. The average estimation from primary MLR and RF models has been performed to form a new model known as the Hybrid model. The hybrid model captures the advantages of both MLR and RF by considering the linearity and nonlinearity of the association of dependent variable (PM 2.5 ) and independent variables (AOD, Temp, RH, WD and WS).
Statistical measures such as R (correlation coefficient), MAE (mean absolute error), RPE (relative percent error) and RMSE (root mean square error) have been used to examine the results. Mean absolute error (MAE) is the average of all absolute errors where the sum of all the absolute errors (difference between modelled estimated and observed value of PM 2.5 ) is divided with the number of errors. Relative percent error (RPE) is the ratio of the RMSE to mean observed PM 2.5 expressed in per cent. Root mean square error (RMSE) is the square root of the summation of squared difference between the values of estimated and observed PM 2.5 . Due to the limited availability of insitu PM 2.5 in the study area (ground stations available in the study area: Sector 16 A in Faridabad, Vasundhara in Ghaziabad, Sector 125 in Noida and Knowledge Park III in Greater Noida), prediction of PM 2.5 can be possible for the station locations only. Therefore, with these few predicted values, inverse distance weighted (IDW) interpolation technique has been used to generate the predicted PM 2.5 map for the whole study area.

Results and discussion
In the present research, PM 2.5 has been estimated using three models considering AOD MAIAC in-situ PM 2.5 and meteorological variables (Temp, RH, WD and WS). Before the model building, AOD MAIAC has been validated with AOD AERONET with a correlation >0.8. Approximately 79% of the total AOD MAIAC retrievals exist within the expected error envelope (EE) through point-based validation. Further, missing values in the AOD MAIAC , in-situ PM 2.5 and meteorological have been filled with the imputation method, as it has been proved in previously published research that a model with imputed data performs better than the model with missing values (Chelani 2019). AOD MAIAC for Faridabad, Ghaziabad and Noida were missing at random, which has been filled with imputation. In Greater Noida data AOD MAIAC is missing from January 2018 to April 2018 and available from May 2018. As a result, imputation in Greater Noida is implemented from May 2018. The results of imputation have been synchronized with observed data. After outlier removal, the total remaining data (N) is given in Table 2. Figure 3 provides the histograms for AOD MAIAC , in-situ PM 2.5 , Temp, RH, WD and WS. The majority of PM 2.5 values range from 40 to 80 µgm −3 . However, few values above 200 µgm −3 have been observed in winters only. Annual averages of PM 2.5 lie between 70 and 100 µgm −3 for the years 2018 and 2019. AOD varies from 0.2 to 0.9 with an overall annual mean of 0.56. The AT frequency distribution is tilted towards the right as the temperature ranges from 25°C to 40°C throughout the year. Relative humidity and wind direction are dispersed in their distributions throughout the year.
Using AOD MAIAC and in-situ PM 2.5 , Temp, RH, WD and WS obtained by CPCB, three different models (MLR, RF and Hybrid) have been calibrated to estimate PM 2.5 . In the MLR model, meteorological variables and AOD MAIAC (independent variables) and in-situ measurement of PM 2.5 (dependent variable), in 2038 points for the years 2018 and 2019, have been taken at random for the 80% training set, and coefficients of the independent variables have been obtained in the model. Further, 20% of the data observed has been considered as the validation set. MLR equation has been calculated by the leastsquares method. Regression analysis depicts a good correlation of PM 2.5 with meteorological variables and AOD MAIAC and leads to the R-value greater than 0.6 for each of the four stations Faridabad, Noida, Greater Noida and Ghaziabad (Table 2, Figure 4). However, RMSE is 36.29 with an overall R-value of 0.788 for MLR. In the MLR model, significant underestimation has occurred in PM 2.5 estimates. In the MLR model, the general hypothesis is that the factors and the PM 2.5 are linearly correlated to each other which is not true in every case.
To capture the non-linearity of the correlation between PM 2.5 and driving factors, the machine learning-based algorithm RF has been preferred over the traditional MLR. In the Random Forest model, meteorological variables, AOD MAIAC , and in-situ PM 2.5 data of 2038 points for the years 2018 and 2019 have been divided into 80% training set, and 20% testing sets randomly and regression has been done with multiple trees and nodes. The RF model has overestimated PM 2.5 . Overall, when compared to the MLR model (R 2 = 0.62, RMSE = 36.29), the RF model performed significantly better achieving a better correlation of PM 2.5 with meteorological variables and AOD MAIAC (R 2 of 0.71 and an RMSE of 31.56). Similar findings have been observed for each station independently, where RF performs better than MLR (Table 2, Figure 5). A similar observation was made in the IGP region, where RF estimate PM 2.5 with good accuracy (2020).
The hybrid model has used the linear and non-linear characteristics of MLR and RF models. It estimates PM 2.5 simply by averaging the values of MLR and RF derived PM 2.5 . The outputs from this model have shown that there is less overestimation and underestimation of PM 2.5 concentrations than in the other models. Figure 6 represents the scatterplot of the Hybrid model estimated PM 2.5 . Over the IGP and Delhi region, hybrid models estimated PM 2.5 with great accuracy (Kumar, Mishra, and Singh 2020;2020). The hybrid model outperformed the other two models with the lowest RMSE of 23.29 and had a high correlation coefficient (R = 0.865) (Table 2, Figure 6). Despite the substantial missing data, the hybrid model with imputed data is proved worthy for estimating PM 2.5 concentration in Greater Noida which is an area with a mix of rural and urban land use with agricultural activities, vehicular emissions due to highways and dust pollution due to recent infrastructure improvements, including the Metro train network. Overall, the Hybrid model has performed better than the remaining two models with outliers as well as for the data-scarce region incorporating linear and non-linear correlations among the factors and the PM 2.5 .
Generally, in the present study, all the models underestimate ground-level PM 2.5 concentrations, particularly on severely polluted days (PM 2.5 >100 µgm −3 ). However, the underestimation error is more prominent in MLR compared to RF and Hybrid models. With the lower RPE (Hybrid Model: 22.41% RF: 23.42% -MLR: 31.5%), the Hybrid model has proven to be a better model than independent MLR and RF. Overall, the Hybrid model performed well and captured the majority of the variation in PM 2.5 across the region and maintain a good correlation with in-situ data having R 2 >0.75. The hybrid model worked well with the remaining outliers in the data and the data estimated has been scattered compared to the other two models. The highest correlation (R = 0.933 Hybrid , 0.876 RF ) and low RPE (32.13 Hybrid , 33.50 RF ) were found for Greater Noida despite having a low amount of data. The advantage of the Hybrid model to deal with data scarcity has proven to be true which was reported in research published earlier (Zimmerman et al. 2018). The highest concentration of PM 2.5 has been found in the winter and the lowest in the monsoon (Figure 7) while AOD MAIAC exhibits a peak during the post-monsoon and the lowest retrievals occurred in pre-monsoon seasons. Similar findings have been observed over the IGP where winter PM 2.5 was the highest (2020). During the monsoon, low PM 2.5 concentrations combined with high AOD are due to the abundance of water vapour in the atmosphere, which favours the hygroscopic growth of particles, enhancing scattering and increasing the size of AOD . The meteorological factors and the AOD parameters employed in this investigation are all substantially linked (p 0.0001) with PM 2.5 .     After the validation of the three models, the best performing model (Hybrid Model) has been used to estimate PM 2.5 for the current study area for the year 2019 using inverse distance weighted (IDW) interpolation method. Maximum and minimum PM 2.5 concentrations has been observed as 210.44 µgm −3 and 56.70 µgm −3 in the yearly mean of estimated PM 2.5 ( Figure  8). Annual (from 1 January to 31 December 2019) mean PM 2.5 was 76.71 µgm −3 , exceeding the 40 µgm −3 which is the Indian National Ambient Air Quality Standards (NAAQS). In 2019, around Ghaziabad, PM 2.5 values range from moderate to high (95-172 µgm −3 ). PM 2.5 concentrations were very high (172-210 µgm −3 ) near Noida and low to moderate (56-133 µgm −3 ) in Greater Noida. The concentration in Faridabad was moderate (95-133 µgm −3 ). Although Gurugram does not have any data for 2019, it can be observed that the concentration is in the moderate to high (95-172 µgm −3 ) category based on extrapolated data (Figure 8). In all places, the interpolated and model-estimated findings match the measured PM 2.5 levels for the year 2019. The model has a high degree of accuracy in estimating PM 2.5 . Because of major construction operations, traffic emissions, deforestation, and urban growth, elevated levels of PM 2.5 concentrations have been seen in Ghaziabad and Noida.
These findings imply that the model may account for increased PM 2.5 concentrations in some regions by capturing local particle sources and long-range aerosol transport, particularly during stubble burning season. The model's performance was less accurate in low PM 2.5 (cleaner circumstances) compared to polluted settings (Figure 8) RMSE (23.29). At Greater Noida, where the data were scarce, the hybrid model produced the highest value of R of 0.933 and the lowest value of MAE, RPE and RMSE of 17.29, 32.13 and 24.1, respectively. Estimated PM 2.5 concentration through hybrid model is very close to in-situ observations. Similarly, estimation through RF also matches with the original value. Although, MLR presents maximum deviation between the modelled value and the original value of PM2.5 concentrations. Based on the present analysis, it can be inferred that the hybrid model is a better model to be considered for prediction of PM 2.5 . Although the study was limited to a few places in India, it might include other cities having similar climatic setting and having undergone similar patterns of urbanization. The research may be used to fill in the geographical and temporal gaps in ground-level PM 2.5 data, which can assist in regulating particulate matter pollution in a location.

Conclusion
The present study modelled the PM 2.5 concentrations using Hybrid, RF and MLR models with in-situ PM 2.5 , AOD and meteorological data, in the selected district of Haryana and Uttar Pradesh in northwestern India in 2019. Moreover, the distributions of the PM 2.5 concentrations have been mapped for the current study area. Across the study area, the modelled PM 2.5 map revealed substantial spatial and seasonal variance. PM 2.5 concentrations were greater than the Indian NAAQS (annual average: 40 g/m3) in all seasons. Winter and postmonsoon are the most polluted seasons of the year compared to monsoon. During the winter and postmonsoon, human activities (open burning of stubble and biomass, coal usage for cooking) with a shallow height of air boundary layer, resulting in elevated PM 2.5 concentrations. Wet deposition, vigorous convection and greater boundary layer heights contribute to less PM 2.5 concentrations in monsoon.
In the present research, the combination of satellite aerosol products and meteorological data has been used to estimate PM 2.5 . The hybrid model outperformed RF and MLR in terms of calibrating PM 2.5 concentrations.
Overall, the hybrid model has produced a better correlation (R = 0.865) and least RMSE (31.56) and RPE (23.41). When analysing all of the stations separately, Greater Noida has achieved the highest correlation (0.876) and lowest RPE (33.50) among all stations, despite the station's scanty in-situ data which proves the applicability of the hybrid model in the data-scarce region. Such a model can be helpful to estimate PM 2.5 in areas where there is a fewer amount of monitoring stations or the station's operation started recently. Further, it can be concluded that the hybrid model is a better choice compared to the stand-alone statistical or machine learning-based model. The hybrid model's simplicity allows it to be applied globally to locations with similar climatic settings as the current study area (semi-arid regions).
Although, data related to land use, road density, altitude, etc., and emissions (agricultural burning, industrial emissions, etc.) may increase the accuracy of the PM 2.5 estimation outside the vicinity of the monitoring sites and explain the variance more clearly. Furthermore, missing data on PM 2.5 and meteorological variables due to a lack of monitoring resources in the present study area can be considered the biggest challenge.
The precision of the current result may be enhanced by increasing ground monitoring stations spatially and increasing the supply of data related to pollution emission and other meteorological variables. Despite such limitations, the application of such a hybrid approach combining MLR and RF to predict PM 2.5 in an area of India with scant monitoring data is of potential interest as it fills the existing gap of in-situ data and can help the agencies to formulate policies for reducing and regulating air pollution accordingly. Present findings will be beneficial to fill in the spatial and temporal gaps in the ground PM 2.5 measurements, which will assist in managing particulate matter pollution in any region. People can also benefit from this type of research by being informed about the hotspots of PM2.5 concentrations to plan future quality of life, either by using advanced technological devices to eliminate pollution or by moving to a less polluted location. In future, several additional parameters can be used in modelling PM 2.5 to improve the accuracy of the results of this work, including satellite derived AOD, meteorological parameters, and LULC parameters, as well as emission inventory data. Additionally, the use of such models can provide long-term predictions of PM 2.5 which helps to plan the adaptation and mitigation measures.