Comparison of Suspended Particulate Matter Prediction Based on Linear and Non-Linear Models

Air pollution has been a serious problem in recent years. Air pollutants consist of gaseous pollutants, odours, and suspended particulate matter (SPM) such as dust, fumes, mist, and smoke. SPM has the potential to cause environmental and health problems. With the aim to anticipate the impact, SPM prediction from time to time is needed. In this research, we compared four models for predicting the SPM data. The two linear models selected were ARIMA and wavelet whereas the two non-linear models were neural networks based models, i.e. Feed Forward Neural Network (FFNN) and General Regression Neural Network (GRNN). All four models are built with the same input, which were the past data at the same lagged time based on the best ARIMA model. By using lagged time data as input, the goal is to predict the current of SPM data. Model accuracy is measured based on RMSE values, both in training and testing data. Data processing has provided interesting results that show the superiority of nonlinear models over linear models, especially in the training data.


Introduction
Air pollution is defined as the anthropogenic emission of harmful chemicals that alter the chemical composition of the natural atmosphere and have an adverse effect on the health of living things, an adverse effect on anthropogenic or natural non-living structures or reduce the air's visibility [1]. the Characterization of particulate matter is influenced by meteorological conditions, including temperature, humidity, rainfall and wind speed [2]. Suspended particulate matter (SPM) is an important component of the air pollution mixture. It is the most complex pollutant of the six pollutants regulated by the US EPA as criteria of air pollutants [3]. SPM is one of the most important indoor air pollutants [4]. It also led to serious health hazards in human beings [5]. It has been connected to various detrimental health outcomes as well as having general environmental effects [6]. The source of the particles plays a role in determining the health effects of particles. As further effect, it also responsible for changing global climate [7]. By definition, particulate matter are small solid particles which are able to move freely in the atmosphere [8]. In the simplest terms, particulate air pollution is defined as anything solid or liquid suspended in the air [9].
Development of statistical modeling for air pollution as SPM was very important for predicting the future values. Many studies have been carried out to model SPM data [10][11]. In time series fields, various modelling also have been developed in both linear [12] and non-linear models [13]. The main problem in time series modelling is how to obtain the best model between the various types available. Therefore, it is important to develop experiments for comparing the two basic methods i.e linear and non-linear models so that the more efficient one would be obtained. This research discusses the use of ARIMA and wavelet models as parts of linear models and Feed Forward Neural Network (FFNN) and General Regression Neural Network (GRNN) as representative of non-linear models. Several previous studies relating to the use of ARIMA [14,15] and wavelet for air pollution prediction [16]. Some experiments of the using neural network for air quality also have done [17,18]. In this paper, the use of the two types of time series model were applied on the monthly SPM data in Central Java province, Indoesia. The comparative studies were analyzed by using RMSE criteria in both in-sample and outsample predictions.

Methods
The monthly data of SPM in Semarang City, Central Java province from January 2008 to December 2017 taken from the Meteorology Climatology and Geophysics Agency of Central Java Province was analyzed. The data is divided into two parts for goodness of fit purposes. The first 108 data is used as training and the remaining as testing. The training data set is needed for obtaining the best model based on the in-sample prediction whereas the testing data set is required for reaching the best accuarcy based on the out-sample prediction. We have investigated the data using four linear and nonlinear models i.e ARIMA, Wavelet, FFNN and GRNN. A brief explanation of the four models is given in the following section.

ARIMA model
ARIMA model is a practical importance in forecasting which inclusive Autoregressive, Integrated and Moving Average. The Autoregressive (AR) model is suitable for stationary time series data patterns and the Interated (I) is needed for nonstationary data through differencing. In AR(p) model, the current value of the series xt is a linear combination of the p most recent values of itself. Combination of AR(p) and MA(q) forms ARMA(p,q) model.
By considering d = 1, we can write ARIMA (p,1,q) process as: In the case where the seasonal components are included, the model is called as seasonal ARIMA (SARIMA). The generalized form of SARIMA model can be written as: We can write f (x) as:  are collectively known as an approximation and detailed coefficients. A signal can be decomposed as in simplest form; = + , where and are called Approximation and Details of the given signal . Approximation is average of the signal and hence represents low frequency components, while Detail is the difference of the signal and hence represents high frequency components [19].

Feed forward neural network
FFNN is the main class of neural network model. There are three layers in this architecture namely input layer, hidden layer and output layer. In time series modelling, the input consists of lagged variables of the series whereas the output is current values. The hidden layer contains a non-linear activation function which is transfer signal from input layer to output layer. The architecture of FFNN can be seen in figure 1.

Results and discussion
The results of the investigation by using ARIMA model was developed first. The best model for the intended data was SARIMA (0, 1, 0) (0, 1, 1) 12 . The result of ARIMA model determine the input of the three other models, i.e lagged variables of 1, 12 and 13. In wavelet model, we used Maximal Overlap Discrete Wavelet Transform (MODWT) as decomposition with haar wavelet. The level of decomposition and the number of coefficients in each level are determined by using Murtagh et al. [20]. In FFNN model, the number of units in hidden layer is determined by some experiments of 1-6 units. The activation function in hidden layer is logistic sigmoid whereas linear function is used as activation function in output layer. All other components in FFNN is determined as in MATLAB routine. In GRNN model, the number of units in pattern layer is the same with the input. The experimental results of the four chosen models are presented in table 1. The main results as shown in table 1 can be described as follows. GRNN has obtained the best insample prediction but is not able to get a good out-sample prediction. Its whereas FFNN has given the best out-sample prediction. In fact, that value is the worst, worse than the linear models. On the other hand, FFNN has given the best out-sample prediction. It also has provided a good result for in-sample prediction. Both predictions were better than the linear models. In this case, what is seen as consideration for determining the best model is not only the result of in-sample prediction, but also the out-sample prediction. With these considerations in mind, we have chosen FFNN for the selected model. Based on the analysis of table 1, overall, we also noticed that the non-linear models, or specifically neural network models, showed the superiority over the non-linear models. The goodness of the neural network model for predicting air pollution data or more specifically particulate matter can also be found in the previous studies [21 ,22]. The non-linear techniques that have been tried for predicting SPM data outperform the linear techniques. The similar results also stated by Shahraiyni and Sodoudi [23], where neural network was one of the recommended techniques. Masih [23] concluded that neural network and Support Vector Machine (SVM) based approaches are preferred for air pollution forecasting. Contrary but not extremely, Ibarra-Berastegi [24] stated that simple linear models work as well as more complicated neural networks or slightly worse. The statement was contradiction with this research that resulting the extreme different between linear and non-linear models. In the comparison of different machine learning approaches, Karimian et al. [25] concluded that the LSTM model -a new hybrid model based on long short-term memory-achieved the best results, better than multiple additive regression trees (MART) and deep feedforward neural network (DFNN). On this side, air pollution modelling or specifically particulate matter becomes more interesting if it involves various machine learning-based methods. Now, we display the goodness of the chosen model visually. A good prediction is the one that approaches the actual. Figures 3 and 4 show plots of the in-sample and out-sample predictions againt the actual, respectively. In the plots, the actual is plotted in the blue line whereas the prediction is the red one. Figure 3 shows that the results of the in-sample prediction are closest to the actual for all of training data. Prediction results at most observation points are very close to the actual. Fluctuation patterns from the data can also be followed by the model very well. This proves that the model has succeeded in making in-sample predictions. Figure 4 shows plot of the comparison of out-sample prediction with actual values. As in modelling, the out-sample prediction can also approach the pattern of the actual value for the next few stages. Therefore, the model chosen can be said to be robust for modeling and forecasting.

Conclusion
Comparison of some linear and non-linear time series models for SPM data have been developed. The non-linear models are more suitable for predicting the SPM data. GRNN given the best in-sample prediction whereas FFNN given the best out-sample prediction. It can be done by varying the gradient based optimization methods and make a comparison between them. As an alternative for the future works, hybrid of linear and non-linear models and hybrid of neural network with other non-linear models are recommended. The using of heuristic optimization for various neural network models like RBFNN, CFNN, RNN and the hybrid with other technique can be used as consideration. The use of external variables can also be considered as one way to improve the performance of the model.