Evaluation of Machine Learning Models for Estimating PM2.5 Concentrations across Malaysia

Southeast Asia (SEA) is a hotspot region for atmospheric pollution and haze conditions, due to extensive forest, agricultural and peat fires. This study aims to estimate the PM2.5 concentrations across Malaysia using machine-learning (ML) models like Random Forest (RF) and Support Vector Regression (SVR), based on satellite AOD (aerosol optical depth) observations, ground measured air pollutants (NO2, SO2, CO, O3) and meteorological parameters (air temperature, relative humidity, wind speed and direction). The estimated PM2.5 concentrations for a two-year period (2018–2019) are evaluated against measurements performed at 65 air-quality monitoring stations located at urban, industrial, suburban and rural sites. PM2.5 concentrations varied widely between the stations, with higher values (mean of 24.2 ± 21.6 μg m−3) at urban/industrial stations and lower (mean of 21.3 ± 18.4 μg m−3) at suburban/rural sites. Furthermore, pronounced seasonal variability in PM2.5 is recorded across Malaysia, with highest concentrations during the dry season (June– September). Seven models were developed for PM2.5 predictions, i.e., separately for urban/industrial and suburban/rural sites, for the four dominant seasons (dry, wet and two inter-monsoon), and an overall model, which displayed accuracies in the order of R2 = 0.46–0.76. The validation analysis reveals that the RF model (R2 = 0.53–0.76) exhibits slightly better performance than SVR, except for the overall model. This is the first study conducted in Malaysia for PM2.5 estimations at a national scale combining satellite aerosol retrievals with ground-based pollutants, meteorological factors and ML techniques. The satisfactory prediction of PM2.5 concentrations across Malaysia allows a continuous monitoring of the pollution levels at remote areas with absence of measurement networks.


Introduction
Air pollution has become an acute environmental and health issue in developing countries during the last decades due to intense industrialization and urbanization processes [1][2][3]. It is estimated that about 7 million people die every year worldwide because of exposure to fine particulate pollution < 2.5 µm (PM 2.5 ), while about 91% of the world's population live in areas with PM 2.5 concentrations above the allowable limits of 10-20 µg m −3 [4]. Southeast Asia (SEA) is not an exception to high pollution levels and experiences persistent haze conditions, especially during the dry season (June to September) due to extensive forest, agricultural and peat fires [5][6][7][8][9]. Malaysia is located in the main pathway of the SEA pollution the PM 2.5 mass, but not the meteorological factors (rainfall, wind speed and wind direction). Based on the prediction model of [80], temperature, Normalized Difference Vegetation Index (NDVI), humidity and residential area were found to be important parameters affecting the spatial variation of PM 2.5 in Jakarta, Indonesia. The same study [80], also showed that several parameters (PM 10 , NO 2 , SO 2 , UV, rainfall, land use and NDVI) influenced the distribution of PM 2.5 in Taipei, Taiwan. More studies are therefore needed to find out the role of gases and meteorology in affecting the spatial and seasonal patterns of PM 2.5 , even using meteorological normalization techniques in order to exclude the effect of changing meteorology on PM concentrations trends [82]. Recently, ML approaches (random forest regression models) were implemented to predict the large reductions in air pollutants, i.e., PM 10 , NO 2 , O 3 , during the COVID-19 lockdown period [83]. Furthermore, PM 10 , NO 2 and carbonaceous aerosols (organic carbon, elemental carbon) were also used in ML techniques (Lasso, Random Forest, AdaBoost, Support Vector Machine and Partials Least squares) for analysing air pollutants at street canyons [84]. Studies dealing with PM estimations in Malaysia are rather limited [18,30]. Shaziayani et al. [85] has reviewed PM 10 modelling studies in Malaysia, and only four studies used ML techniques in predicting PM 10 . On the other hand, PM 2.5 studies are even fewer and most of them in Malaysia have been performed at small spatial scales [86,87].
The current study is the first in Malaysia and one of the very few works conducted worldwide aiming to estimate the PM 2.5 concentrations at a large (national) scale using pollution gases, AOD and meteorological factors based on machine-learning techniques. In order to extend the spatial coverage to the whole country (both Peninsular and Island Malaysia) and aiming to improve the accuracy of PM 2.5 estimates, this study integrates hourly AOD products from Himawari-8 satellite sensor, along with meteorological parameters and gaseous pollutants using machine learning techniques, i.e., random forest (RF) and Support Vector Regression (SVR). The models were developed separately for urban/industrial, suburban/rural sites and for the four dominant seasons in order to better represent the spatial (between sites) and temporal (between seasons) variation of the PM 2.5 concentrations. Variable importance analysis was conducted to identify the primary parameters that affect the PM 2.5 concentrations in Malaysia in a way to develop regression models for estimations of PM 2.5 . The performance of RF and SVR was evaluated at different seasons and locations against measured PM 2.5 concentrations. The results will assist in representing the spatial and temporal evolution of PM 2.5 in Malaysia and for establishing measures in a way to improve air quality across the country.
Section 2 briefly describes the study area; Section 3 refers to the dataset that was used as variables in the prediction models; Section 4 refers to the PM 2.5 measurements across Malaysia. The results and model evaluation are included in Section 5, while Section 6 summarises the conclusions.

Study Area
Malaysia is one of the developed countries in SEA region with a rapid urbanization rate since 1970 [88]. Consequently, air pollution has become one of the serious environmental and human health concerns across the country [5,89], particularly in urban, industrialised and congested traffic areas such as Klang Valley (in the west coast of Peninsular Malaysia), Johor Bahru (southern tip of the Peninsula) and Georgetown, Penang (north of Peninsula) [90,91]. Air quality deteriorates at several parts of Peninsular Malaysia and in Borneo Island during the dry season mainly due to trans-boundary haze from neighbouring countries and regional/local forest fires. The concentrations of aerosols and air pollutants display a distinct seasonality, influenced by local meteorological conditions, i.e., rainfall, wind speed, relative humidity (RH) and temperature [86], being lower during the monsoon rainy season (November-March). In this study, 65 air quality monitoring stations ( Figure 1) distributed across the Peninsular and Island Malaysia (Labuan, Sabah and Sarawak) were used to analyse the air pollution levels (PM 2.5 , SO 2 , NO 2 , CO and O 3 concentrations). The stations are representative of industrial (7 stations), urban (10 stations), suburban (36 stations) and rural (12 stations) areas.

Dataset
The dataset used in this study consist of PM 2.5 and air pollutant concentrations, along with meteorological parameters from ground stations, as well as AOD data from Himawari-8 satellite.  Figure 1). Furthermore, these stations also measure meteorological parameters (e.g., ambient temperature, TEMP; RH, wind speed, WS; wind direction, WD) and gaseous pollutants (e.g., nitrogen dioxide, NO 2 ; carbon monoxide, CO; sulphur dioxide, SO 2 ; ozone, O 3 ). The stations are strategically distributed to represent urban, industrial, suburban and rural areas [92]. PM 2.5 measurements were performed via the TEOM 1405DF, which is a continuous dichotomous ambient air monitoring system with two Filter Dynamics Measurements Systems [93], able to measure PM 2.5 and PM 10 . SO 2 , NO 2, CO and O 3 are measured using Thermo Scientific model 43i, model 42i, model 48i and model 49i, respectively [93,94]. RH and TEMP were recorded using a Climatronic AIO 2 Weather Sensor (Climatronic Corporation) [95]. All the ground data were obtained on an hourly basis covering the period from January 2018 to December 2019.

Ground Measurements
All air quality and meteorological measurements went through quality assurance and quality control (QA/QC) procedures. Instruments for the detection of gases were manually calibrated once a fortnight. Flow verification for PM 10 and PM 2.5 measurements using TEOM was conducted once a month. The data removal during the QC check was predominantly due to insufficient measurements and instrument failure, while some perturbed data were also excluded as outliers in a second-level QC check [95].

Satellite Data
Himawari-8 is a geostationary satellite operated by the Japan Meteorological Agency. It was launched on 7 October 2014 and carries the Advanced Himawari Imager (AHI) sensor, which is equipped with 16 bands from visible to infrared [36]. Himawari-8 releases AOD products at two levels, namely Level 2 (10 min temporal) and Level 3 (hourly and daily), which have been used for various applications including estimation of PM [96][97][98][99], dust detection [100] and aerosol data assimilation [101]. The L3 product is an improved version of the L2 AOD product that minimized cloud contamination [102] and has a 5 km spatial resolution. Himawari-8 AOD at 500 nm is associated with quality assurance levels namely "very good", "good", "marginal" and "no confident (or no retrieval)" [99]. In this study, only the "very good" L3 AOD 500 retrievals were considered for PM 2.5 estimations, downloaded from the Japan Aerospace Exploration Agency (JAXA) website: available online: http://www.eorc.jaxa.jp/ptree/index.html (accessed on 10 May 2021) for the period January 2018-December 2019. Recently, Himawari-8 AODs were used to estimate the PM 2.5 concentrations over Hubei province, China [103]. Application of the Himawari-8 L2 AOD data over Malaysia revealed an overestimation by 24.2% [104], while the L3 AOD products displayed a better agreement with the Aerosol Robotic Network (AERONET) AODs with a coefficient of determination R 2 = 0.81, root mean square error (RMSE) of 0.13 and an overall overestimation of only 1% [92]. Moreover, Himawari-8 AODs

PM 2.5 Estimation
The overall methodology used for the PM 2.5 estimations over Malaysia is illustrated in Figure 2. The model inputs consist of hourly AOD, SO 2 , NO 2 , CO, O 3 , WS, WD, TEMP and RH values. Hourly AOD data from Himawari-8 were extracted at 5 × 5 km over the air quality monitoring stations and temporally collocated with ground measurements. Wind direction was used in the model because wind blowing from a highly polluted area can influence air quality in other downwind places. Wind speed enables to accelerate pollutants travelling from other places but also contributes to the dilution processes at local level [108]. On the other hand, temperature can trigger biogenic emissions, photochemical reactions and secondary aerosol formation over the region [109] and also control the temperature inversions, which can trap the pollutants near the surface [45,108,110]. Finally, RH may affect the hygroscopic growth of particles and enhance the aerosol scattering [111][112][113]. In this study, we utilized and evaluated two ML models, namely, Support Vector Regression (SVR) and Random Forest (RF) to estimate PM 2.5 concentrations at the 65 air quality monitoring stations across Malaysia. The input variables to the SVR and RF models are selected based on our previous study [18,31] and other literature [114,115]. In total, we developed 7 different models to represent the spatial and seasonal effects on PM 2.5 distributions. Model 1 considers all data from the 65 stations, but other models, i.e., models 2 and 3, only represent urban/industrial and suburban sites, respectively, while models 4 to 7 represent different seasons (wet, dry and two inter monsoon). The models are described in the subsequent sections.

Machine Learning (ML) for PM 2.5 Estimation
Nowadays, several studies have used machine learning (ML) techniques, including RF model, aiming to increase the accuracy in prediction of PM concentrations, since these models are flexible in nonlinear approaches [72,[116][117][118][119]. The SVR and RF techniques were particularly selected in this study to achieve more accurate PM 2.5 estimations using satellitederived AOD, meteorological parameters and gaseous pollutants as predictor variables. The Classification And REgression Training (CARET) package was used to perform the RF and SVR modelling. The data splitting, pre-processing, model tuning and variable importance analysis were executed in a R environment. SVR depends on the kernel function and due to its excellent generalization capability, it is able to minimize the overfitting [120], and therefore, it has been used for PM estimations [69,108,121]. SVR can fit the errors within a certain threshold by finding an appropriate boundary line (between hyperplane) to suit the data. The flexibility of SVR depends on the selection of the parameter such as kernel function, cost function and epsilon value. There are four types of kernel functions namely linear, polynomial, sigmoid and radial basis function (RBF) that were used in this study for capturing the non-linear dynamics [69], whilst cost function was used to avoid any overfitting of the data, as small cost value leads to large margin (or wide boundary line) and causes overfitting in the model. The epsilon value controls the number of support vectors used to develop the regression function, while the smaller epsilon value indicates an optimum accuracy. Initially, the SVR parameters are selected based on trial-and-error values, but we found that the default values (Supplementary Materials Table S1) included in the R package "e1071" provided the most promising results.
Random Forest (RF) is a tree-based ML technique proposed by [122]. Theoretically, RF model is an ensemble of multiple decision trees and uses the majority vote/decision of the trees as the final RF model [123,124]. The algorithm becomes more robust when more decision trees are constructed. RF randomly selects parameters in order to develop each tree, and therefore, it reflects potentially complex effects of predictors on the prediction [125]. The purpose of selecting random predictors instead of all predictors is to reduce the correlation between trees in order to make them disparate [126]. Thus, the variance of the RF prediction can reduce any overfittings. The number of decision trees can be modified to reduce the training time according to a required accuracy and computing capability [127]. In this study, RF model was run using the "Random Forest" in the R package. Since, RF is a non-parametric algorithm, here we only set the two most important parameters-although RF can have more parameters-which are mtry and ntree. Parameter mtry is a number of predictors sampled for splitting at each node while ntree is a number of trees in the forest. If mtry value is too small, it might be none of significant parameters included in the subset, and the insignificant parameters would be selected for a split. Therefore, the trees have poor predictive ability [126]. In this study, we set the mtry = 3 (as default: mtry = p/3, where p is the number of parameters used in the model), while ntree is set as 500 in the model. For tuning, we only tune mtry because the CARET package has automatic tuning for mtry only. Therefore, in this study, for ntree, we used the default value [128,129]. The results were obtained based on best mtry tuning accuracy. It should be mentioned that a limitation of this study is that models were not broadly optimized.

Model Validation
The total number of the matching dataset, covering all parameters at the 65 stations in Malaysia from 2018 to 2019, is 13,376. The matching data were randomly partitioned at a fraction of 70% for model calibration (model development) and 30% for model validation. In the model development, a sample based 10-fold cross validation technique was used, where the calibration data were randomly divided into 10 subsets; at any single moment, one subset was used for validation and the remaining subsets were used for calibrating the model. The average value of the results of the 10 subsets was adopted as the model accuracy. The sample based 10 cross validation (CV) performed validation with matchup sample from both spatial and temporal dimension. This is a commonly used CV-based technique to reveal the overall predictive ability of PM 2.5 estimation models [130]. Then, the final models were validated using the 30% of the remaining data. Statistical indicators such as the coefficient of determination (R 2 ), Root Mean Square Error (RMSE), mean bias error (MBE) and Nash-Sutcliffe Efficiency (NSE) were used to evaluate the accuracy of the models. The NSE is a normalized statistic, which can determine the magnitude of the residual variance to the measured data variance and indicates how well the measured PM 2.5 versus estimated PM 2.5 data fits the 1:1 line (best fit line).
where I m and I c are the measured and estimated PM 2.5 concentrations, respectively, I m is the average of the measured PM 2.5 and N the total number of measurements. NSE = 1 indicates an ideal model performance, while NSE = 0 shows model predictions as accurate as the mean of the observed data. Lower RMSE and MBE values correspond to better performance and to lower biases from the ML models.

Variable Importance
Variable importance statistic was used to analyse the contribution of each variable in PM 2.5 estimations. Since SVR is a kernel-based model, and we do not know the concrete form of its nonlinear mapping function, and the weight vector (ω) cannot be computed directly [128], it is complicated to analyse the variables importance statistic. On the other hand, there are two famous measures for RF, which are mean decrease accuracy (MDA) and mean decrease Gini (MDG). The MDG is based on Gini importance which measures the average gain by splits of a given variable, whilst MDA is based on out of bag (OOB) samples. In RF model, each tree is grown based on a bootstrap sample of the training data, and those data that were not used in the bootstrap sample are known as out of bag (OOB) samples [126]. The MDA measures the accuracy of the model losses by permuting each variable. This technique is considered as most efficient variable importance for random forest [129,131], and it was preferred in this study as less bias compared to Gini importance. The higher percentage value of the variable importance indicates higher influence of the corresponding variable to PM 2.5 estimations. In R script, we used the "varImp" function, which can automatically scale the importance scores in values between 0 and 100.

Descriptive Statistics
The descriptive statistics of the measured variables that are used in SVR and RF models for all stations in Malaysia are summarized in Table 1. The columnar AOD 500 over the Malaysian sites during 2018-2019 exhibits a mean of 0.69, which is above the median value 0.46 due to many episodic aerosol events with AODs above 2, representing thick smoke plumes from extensive fires in Indonesia and Indochina [10,132,133]. The measured PM 2.5 concentrations at the 65 examined sites follow a similar distribution with a higher mean (21.9 µg m −3 ) than median (17.1 µg m −3 ) and a maximum value of 230 µg m −3 (Table 1, Figure 3). These PM 2.5 levels are similar to those reported at several sites in Southeast Asia [86,134]. NO 2 and CO exhibit means of 5.2 ppb and 0.6 ppm, respectively, while tropospheric O 3 levels (25.2 ppb) are considered rather high with deleterious effects on human health [135,136]. Table 1. Statistical values for the measured parameters in all air-pollution monitoring sites.  The box-whisker plots for AOD 500 and air pollutants, separated for urban/industrial and suburban/rural sites and for the four seasons, are shown in Figure 3. The analysis showed that the seasonal-mean PM 2.5 at the urban/industrial sites displayed higher values in all seasons. During the dry season, the maximum PM 2.5 levels were found to be 230.3 µg m −3 for the urban/industrial and 226.3 µg m −3 for the suburban sites, with means of 31.26 µg m −3 and 26.38 µg m −3 , respectively. This is attributed to the prevailing southwest wind carrying biomass-burning aerosols from Indonesia due to extensive forest fires in this season [90]. However, the seasonal mean PM 2.5 concentrations do not notably differ in the other seasons (lying between 17.80 µg m −3 and 22.33 µg m −3 for both urban/industrial and suburban/rural sites), indicating a year-long PM 2.5 laden atmosphere across Malaysia. Regarding the Himawari-8 AOD, it is higher during the dry season, while slightly lower mean values (~0.6 to 0.8) were observed in the other seasons, with marginal differences between urban/industrial and suburban/rural sites (Figure 3b). The seasonal and site variations of the columnar AODs are in agreement with ground PM 2.5 concentrations, indicating that the industrial and traffic emissions are the main pollution sources for urban centres and surrounding areas. In general, heavy precipitation during the wet season only marginally reduced the aerosol levels since severe pollution episodes with PM 2.5 > 100 µg m −3 and AODs > 3 were also present. However, the columnar AOD displayed a very low correlation (R 2 = 0.09) with the surface PM 2.5 , indicating (i) a significant aerosol loading aloft and (ii) different sources and temporal variability between surface PM 2.5 and AODs [137]. In addition, the mean concentrations of CO, NO 2 and O 3 (Figure 3c-e) are higher over the urban/industrial areas compared to suburban/rural sites in each season. Motor vehicle and power plants emissions are the major contributors to CO and NO 2 concentrations in Malaysia with about 95.7% and 66%, respectively [11], thus explaining the higher NO 2 and CO levels in urban/industrial areas. The stronger correlation was found between CO and PM 2.5 (R 2 = 0.33), revealing that the particulate pollution in Malaysia is mostly related to local sources of fossil fuel and biofuel combustion, which enhance CO emissions [86,138]. NO 2 , which is mostly related to vehicular emissions, was negligibly associated with PM 2.5 concentrations (R 2 = 0.1), The overall mean SO 2 concentration was found to be 1.2 ppb, with slightly larger levels in the dry season, exhibiting means of 1.5 ppb and 1.3 ppb for the urban/industrial and suburban sites, respectively.

Models for PM 2.5 Estimation
In this study, seven models were developed for PM 2.5 estimations in Malaysia using ML techniques, namely, SVR and RF. These models were developed in order to better capture the remarkable spatial (between stations of different characteristics) and temporal (between seasons) variations of PM 2.5 and to examine the model's capability in representing the levels and evolution of PM 2.5 . The developed models are:    Table S2). The validation dataset displayed also small differences between the two models and comparable statistics with the calibration datasets i.e., R 2 = 0.66 and RMSE = 12.11 µg m −3 for SVR and R 2 = 0.62, RMSE = 11.40 µg m −3 for RF ( Figure 4). Furthermore, other models for estimating PM 2.5 were also developed by splitting the entire datasets initially into two categories namely urban/industrial (3357 data points) and suburban/rural (N = 9798). The RF model performed slightly better compared to SVR for urban/industrial (model 2) and suburban (model 3) sites, with R 2 = 0.76, RMSE = 11.47 µg m −3 , NSE = 0.735 and R 2 = 0.67, RMSE = 12.47 µg m −3 , NSE = 0.661, respectively. Regarding the calibration datasets, the NSE values of all the RF models were usually above 0.9, indicating an accurate model's performance, compared to the SVR models, which exhibited NSE values in the range of 0.55 to 0.79 for the calibration datasets (Supplementary Materials Table S2). The model validation revealed that RF models performed slightly better than the SVR models for both locations. The validation of SVR and RF models for suburban sites was considered satisfactory with R 2 = 0.61, RMSE = 11.53 µg m −3 and R 2 = 0.64, RMSE= 10.76 µg m −3 , respectively (Figure 4b,c). Both models performed better at urban/industrial sites compared to the suburban sites, while all models underestimated the large PM 2.5 concentrations (PM 2.5 > 60 µg m −3 ). For PM 2.5 below 50 µg m −3 , where the vast majority of the data points lie, the underestimated and overestimated data points are almost equal for all the models and the regression line tends to coincide with the 1-1 line (Figure 4).
In addition, four temporal models were also developed for each season in Malaysia, namely, dry season (June-September; Figure 5a Table S2; Figure 5).
The statistical evaluators of the developed models in Malaysia are mostly comparable to those found from multi-variate models including AOD and several meteorological parameters (Temp, RH, WS, Dew point, mixing height) for PM 2.5 estimations in Indian cities [139]. More recently, [39]  ). MODIS-MAIAC AODs and columnar water vapor (CWV), along with meteorological parameters and land-use data, were included in a linear mixed effect model (LME) and a RF model for daily PM 2.5 estimations at high spatial resolution (1 km × 1 km) over the Indo-Gangetic Plains (IGP), India [118]. The RF model exhibited higher accuracy with R 2 = 0.87 and relative RMSE of 24.5%, compared to LME [118]. The spatial distributions of the R 2 (~0.6 to 0.9) and RMSE (~20 to 40 µg m −3 ) values from PM 2.5 estimations across the IGP [118] were mostly comparable to those observed over Malaysia.
The hourly time series of the measured and predicted PM 2.5 concentrations via the SVR model, separately for the station characteristics and seasons and for the overall model, are shown in Figure 6. The results verify the good performance of the SVR model in predicting the PM 2.5 concentrations across Malaysia-RF performed slightly better with very similar results-also revealing an underestimation at the highest PM 2.5 values. However, the model's underestimation in representing PM 2.5 peaks is not systematic, and in many cases, models reproduce satisfactorily the high PM 2.5 concentrations, even overestimating them ( Figure 6). The residual analysis for the model validation datasets revealed that the frequency of residuals approached the normal distribution peaking around zero for all the models (Supplementary Materials Figure S1). However, the frequency distributions were slightly shifted towards negative values, as the highest model underestimations may reach to −60%, but for very rare cases. In the vast majority of the cases, the predictions were quite accurate, indicating that the used ML models are satisfactory for PM 2.5 estimations in Malaysia.
Atmosphere is a complex system and composed by various substances like air molecules and solid and liquid particles of various sizes, shapes and chemical composition [141,142]. Therefore, combining many auxiliary data such as meteorological factors, aerosols, gases and land use allows for a better estimation of PM 2.5 . To date, most PM prediction studies found that inclusion of meteorological factors has improved the PM estimations, because each meteorology parameter may modulate the PM concentrations in a different way [43,45,71]. Nowadays, ML techniques and RF models are frequently used in estimating PM concentrations at many regions around the world [39,116,119]. The statistical indicators from the models' calibration and validation in this study (Supplementary Materials Table S2; Figures 4 and 5) are mostly comparable to those presented in other studies using various methods and ML approaches for estimates of PM concentrations around the world (Table 2). It should be noted that this study obtained reasonable results at national scale without including land use information compared to previous works [128,[143][144][145]. Besides that, the validation techniques may be also different, as for instance [143] used sample and site based 10 CV in order to assess the spatial performance, whilst our study only used sample based 10 CV since it can be used to reflect the overall predictive ability [130].   Determining the strength of the correlation between PM 2.5 and all the parameters used for its prediction is very important because it can indirectly portray the pollution process and the source of the pollution. The results of the variable importance analysis for the RF model have been included in Supplementary Materials Table S2 and are shown in Figure 7. For the overall model 1, CO is the highest contributor to the PM 2.5 estimations, similar to the other models, as discussed above, and is followed by AOD, O 3 , NO 2 , SO 2 and the meteorological parameters. This is because both PM 2.5 and CO are originated from common sources in Malaysia like biomass burning and traffic which enhance the CO emissions [86,138]. Besides that, CO may stay in the atmosphere for a long period (weeks or months), being able to get transported in high concentrations from biomass-burning areas in Indonesia [152]. Parameters with the least importance in PM 2.5 estimations are RH, WD, TEMP and finally the WS with a zero score (Figure 7a). Similar to the overall model (model 1), the contributions of the meteorological parameters were relatively weak in the spatial models as well, which also revealed CO, AOD and O 3 as the most important variables (Figure 7b,c). CO remains the most important predictor in the seasonal models as well, with minimal contributions from the meteorological variables (Figure 7d-g). The WS and WD have minimal contributions, in agreement with [81], who found that both parameters were not significantly correlated with PM 2.5 in the Klang Valley region in Peninsular Malaysia. Although AOD usually exhibited a high correlation with PM 2.5 [153], in our case, there was not a direct association with PM 2.5 concentrations, implying complicated pollution conditions in the vertical layer over Malaysia [141]. Generally, PM 2.5 is affected more by gaseous pollutants and not so much by the columnar AOD (missing values due to cloud cover and elevated aerosol layers) and meteorological conditions since Malaysia has rather uniform weather conditions throughout the year. Therefore, influence of the meteorological parameters is minimal towards seasonally changing PM 2.5 concentrations. However, previous studies in Malaysia showed that the meteorological parameters affected the coarse particles, e.g., PM 10 concentrations [81], indicating a meteorological-dependent character of the coarse-mode aerosols. In our previous study [18], we found that the estimations of PM 10 concentrations based on satellite AODs were significantly improved after inclusion of the meteorological parameters in the model. Similar to the current results, [154] also found that the parameters with the highest importance in predicting PM 2.5 concentrations are CO, NO 2 , SO 2 and AOD. Inclusion of the pollutant gases improved the performance of their RF model from R 2 = 0.69, RMSE = 41.63 µg m −3 to R 2 = 0.81, RMSE = 32.74 µg m −3 , in a similar way as in the present study. These results were also in agreement with [65] who found that CO was the most important variable that explained 20.65% of the variation in estimated PM 2.5 concentrations in Xi'an, China, exhibiting a strong correlation with AOD. Furthermore, [81] studied the PM 2.5 composition in Klang Valley, Malaysia, and concluded that CO, NO 2 , NO and SO 2 mostly affected the PM 2.5 concentrations. This is the first study conducted in Malaysia aiming to estimate the PM 2.5 concentrations based on machine-learning techniques. The satisfactory accuracy of the estimates, despite the biases and challenges in representing PM 2.5 pollution episodes, is especially important for the development of models aiming to systematically monitor PM 2.5 over the country, especially at remote areas with unavailability of measurements. However, the only 65 operational stations are still insufficient to cover the whole Malaysian territory with an area of 330,290 km 2 . Establishing more air quality monitoring stations is very costly, and a certain station is only capable to satisfactorily represent the pollutant concentrations within a radius of about 15 km [155]. Alternatively, remote sensing data encourages more studies to be conducted on atmospheric particulates and air quality, since satellite technology considers AOD as a key predictor of PM over a large area [156]. This would also help in evaluating the influence of the local/regional emissions from anthropogenic activities against those attributed to natural causes or long-range transported aerosols, mostly smoke, from Indonesia and other parts in northern Indochina.

Conclusions
The current study developed new machine-learning models, namely, Random Forest (RF) and Support Vector Regression (SVR), to estimate PM 2.5 concentrations across Malaysia for the first time covering the years 2018 and 2019. Satellite (Himawari-8) AOD, groundmeasured air pollutants (NO 2 , CO, SO 2 , O 3 ) and meteorological parameters (temperature, relative humidity, wind speed and direction) were used as input variables. Due to the high spatial (between stations with different characteristics like urban/industrial and suburban/rural) and temporal (between seasons) evolution of the PM 2.5 levels across Malaysia, seven sub-models were developed separately for the different sites (urban/industrial, suburban/rural) and seasons (dry, wet and two inter-monsoons (April-May and October)), and one overall model. Of the available dataset, 70% was randomly selected for the model calibration, and the remaining 30% for the model validation. The PM 2.5 predictions of each model are compared to those measured at 65 air pollution monitoring stations, using standard statistical estimators. For the overall model, SVR calibration performed slightly better than RF with R 2 = 0.69 and RMSE = 10.62 µg m −3 against measured PM 2.5 concentrations. Whilst for the spatial models, the RF validation performed slightly better than SVR, with statistical indicators of R 2 = 0.76, RMSE = 11.47 µg m −3 for urban/industrial, and R 2 = 0.64, RMSE = 10.76 µg m −3 for the suburban/rural sites. Therefore, both RF and SVR models displayed slightly higher performance for PM 2.5 estimations at urban/industrial sites with higher levels of AOD and air pollutants. Furthermore, the estimation accuracy of SVR and RF models was lower in the wet (November-March) and inter-monsoon (April-May) seasons compared to the dry (June-September) season. Based on the model accuracy and variable importance analysis, CO was always the most influential predictor variable for PM 2.5 estimations in Malaysia, followed by AOD, O 3 , NO 2 , SO 2 and meteorological parameters but with different order depending on the dataset and model. An important finding was the very weak correlation and contribution of the meteorological variables to PM 2.5 estimations. Furthermore, very low correlation was found between PM 2.5 and columnar AOD, indicating that surface pollution followed a different temporal pattern than AOD and the presence of a significant aerosol layer aloft due to transported smoke plumes from wildfires in southeast Asia. The current results showed that the use of machine-learning techniques for PM 2.5 estimations over Malaysia was promising as these models can satisfactorily represent the values and temporal evolution of PM 2.5 concentrations over both urban/industrial and suburban/rural sites although the underestimation of the highest PM 2.5 levels. In a next step, gaseous pollutants from satellite remote sensing observations will be included in ML approaches in order to estimate PM 2.5 concentrations over large areas, aiming to cover the whole Malaysian territory.
Supplementary Materials: The following are available online at https://www.mdpi.com/article/10 .3390/app11167326/s1, Figure S1: Residual analysis (residuals = predicted PM 2.5 -measured PM 2.5 ) from SVR and RF for the developed Models 1-7. The fitted curve represents the normal distribution, Table S1: T Parameters/functions used for the SVR model, Table S2 Data Availability Statement: Himawari-8 data can be accessed via http://www.eorc.jaxa.jp/ptree/ index.html. Ground-based pollution data can be accessed after request.

Acknowledgments:
The authors would like to thank the Japan Aerospace Exploration Agency (JAXA) and Department of Environment, Malaysia, for providing the Himawari-8 AOD data and surface air pollutant data, respectively. D.G.K. acknowledges support of the project PANACEA (PANhellenic infrastructure for Atmospheric Composition and climatE change; MIS 5021516), under the Action "Reinforcement of the Research and Innovation Infrastructure", funded by the Operational Programme "Competitiveness, Entrepreneurship and Innovation" (NSRF 2014-2020) and co-financed by Greece and the European Union (European Regional Development Fund). Valuable comments from two anonymous reviewers are highly appreciated.

Conflicts of Interest:
The authors declare no conflict of interest.