Hybridization Model for Air Pollution Prediction Using Time Series Data

In recent years, data science analysis, particularly time series predictions, has been widely employed across various industrial sectors. However, time series data presents high complexity, especially in seasonal patterns such as monthly, daily, or hourly fluctuations. Irregular fluctuations and external factors increasingly challenge accurate predictions. Therefore, this research proposes a hybrid approach combining SVR-SARIMA, SVR-Prophet, LSTM-SARIMA, and LSTM-Prophet to enhance time series prediction accuracy. This study followed the OSEMN methodology approach: gathering data, cleaning data, exploring data, developing models, and interpreting crucial aspects of problem-solving. Seasonal effect predictions indicated a rise in SO 2 and NO 2 during dry and rainy seasons until the next two years (average daily increments of 0.0831 μg/m3 for SO 2 and 0.0516 μg/m3 for NO 2 ). Estimates suggest a decrease in the order of three particles. The evaluation showed that the SVR model performed better compared to the other three models (RMSE 7.765, MAE 5.477, and MAPE 0.261). The best-performing hybrid model was LSTM-Prophet (99.74% accuracy) with RMSE 12.319, MAE 12.057, and MAPE 0.259 values.


INTRODUCTION
Reducing air pollution levels can lessen symptoms of heart, lung, and acute respiratory disorders such as hay fever, asthma, pneumonia, bronchopneumonia, and others.Air pollution is one of the major environmental dangers to health.According to a 2018 WHO report, 90% of people on Earth breathe contaminated air, with Southeast Asian and Eastern Mediterranean regions having average air pollution levels that are five times higher than WHO guidelines [1].Numerous things, including burning fossil fuels in power plants, industrial smoke and exhaust, burning agricultural land, and vehicle exhaust emissions, can contribute to air pollution.The degree of urbanization is another element influencing the amount of air pollution.
The WHO provides guidelines with thresholds for major air pollutants that are harmful to health.Particulate matter (PM), ozone (O3), nitrogen dioxide (NO2), carbon monoxide (CO), and sulfur dioxide (SO2) are some of these contaminants [2].There are two forms of PM: PM2.5 and PM10.The size of these two particles sets them apart.PM2.5 is less than 2.5 µm in size.Because PM2.5 and PM10 particles can enter the lung cavity directly, they are extremely hazardous particles [2].According to the Air Quality Index, Indonesia's PM2.5 and PM10 pollution levels are currently 6.1 times higher than the WHO norm.Jakarta, with an average AQI of 124, ranks among the top 10 most polluted cities in the world as of September 22, 2022, at 16:11.The severity of the situation makes it necessary to apply sophisticated analytical tools like time series analysis to accurately predict and manage pollution levels.A methodical way to identify patterns, trends, and seasonal fluctuations in pollution data across time is using time series analysis.By utilizing this analytical methodology, decision-makers and environmental authorities can lower health risks associated with prolonged exposure to polluted air, attenuate pollution spikes, and make wellinformed decisions and focused responses.

◼ 423
This demonstrates the importance of using data science analysis, particularly time series analysis for predictive analysis of periodically recorded data, to better comprehend and investigate events.Since the standard statistical technique cannot be optimized, this scientific approach is necessary for a fuller understanding of the dynamic nature of air pollution and its alterations over time.Numerous researchers have employed a range of machine learning models and techniques, such as Prophet, SARIMA, Vector Autoregressive Model (VAR), Support Vector Regression (SVR), and Long Short-Term Memory (LSTM), to aid in the predictive analysis of time series data.Research contrasting these models, specifically LSTM with VAR, ARIMA, and SVR, has consistently demonstrated LSTM's most accuracy over the other models [3][4][5].Furthermore, research focusing on air pollution prediction in California utilizing SVR and a Radial Basis Function (RBF) kernel showcased an impressive accuracy rate of 94.1% [6], Similarly, in a comparative study between SARIMA and Prophet, the Prophet model exhibited higher accuracy levels [7].These results demonstrate the effectiveness of sophisticated analytical methods in predicting air pollution levels, particularly when combined with time series analysis and machine learning.By using these approaches, researchers and decision-makers can reduce the harmful consequences of air pollution on the environment and public health by implementing focused treatments and making better-informed decisions.
Some researchers have attempted to suggest hybridized models for prediction in other studies, such as research [8] on the LSTM-ARIMA hybrid model and research [9] on the Prophet-SVR hybrid model, with the hybrid model's accuracy showing a significant value higher than that of the single model test.The findings suggest that hybridized models are more accurate than single models; however, they are all based on time series data and do not attempt to identify seasonal trends within them.Understanding the seasonal trends within time series data is crucial for accurate predictions, emphasizing the need for a comprehensive approach that integrates various models to address the dynamic nature of air pollution.To produce predictions, hybridized models, which combine elements of both linear and non-linear models, can yield results that are more accurate than those of a single model [9].Non-linear models can be utilized in hybrid model implementations to capture relationships that linear models are unable to capture [10].Moreover, the latest study emphasizes how well hybrid models capture both linear and non-linear connections in time series data [11].Hybridized techniques, which incorporate components from both linear and non-linear models, have the potential to produce forecasts that are more accurate than single-model approaches.In hybrid model implementations, non-linear models in particular are essential because they capture complex linkages that linear models could miss [12].
Time series data can be divided into three main categories: residual, trend, and seasonal [13].Data influenced by periods, such as the wet and dry seasons of the year, or by seasonal characteristics, like the days of the week, months, or quarters of the year, clearly show seasonal trends.We can simplify the model by modeling each component independently.The study intends to 1) examine the trend of air pollution levels and forecast future pollution levels by detecting seasonal patterns; 2) conduct experiments to evaluate and compare the predictive abilities of Prophet, SARIMA, SVR, and LSTM models; and 3) develop a hybridization model that combines the best features of each model to improve their performance when combined.The results of the study can be used as a basis for choices about mitigating the risks brought on by air pollution, especially in large cities like Jakarta.

Research Design
By breaking down the data and incorporating seasonal influences, the suggested hybridized model can be utilized to forecast air pollution.The models that were employed with time series data were put through several trials and tests as part of the experimental study design.After normalizing the data, the time series data was decomposed into residuals, trends, and seasonal patterns using the Seasonal Decomposition of Time Series (STL) method.In this study, ◼424 ◼ISSN: 1978-1520 the data was divided into seasonal periods, as is common in Indonesia, where seasonal patterns comprise the rainy and dry seasons.The methodology of splitting the dataset into training and testing sets and then breaking the data down into seasonal periods is explained in Figure 1.Using the Prophet, SARIMA, SVR, and LSTM prediction models, prediction analysis is carried out once the data has been prepared.) 2 (1) Where   ^ is the predicted value and   is the actual value.

Model
This study employed four models that were hybridized to enhance the accuracy of the prediction model in supporting the predictions.
Facebook developed the Prophet model, which is a member of the generalized additive model [13].Three elements make up this model: trend, seasonality, and holiday [7].
where () is a trend function that represents non-periodic variations in the time series' value?A seasonal or changing function, such as a daily, weekly, or annual function, is represented by ().
The effect of holidays that fall on a potentially erratic schedule for one or more days is represented by ℎ(), while changes that the model is unable to account for are represented by ∈  .With  being normally distributed, the expected value is represented by the value of ().

B) SARIMA
Similar to ARIMA, the GAM model seasonal auto-regressive integrated moving average (SARIMA) is applied to time series data that exhibits seasonal characteristics [14].ARIMA(, , )(, , ), where , ,  and , ,  stand for continuity difference and seasonal auto-regression, respectively, are typically used to describe seasonal expressions [10].
The SARIMA model makes it possible to distinguish data with seasonal frequencies (e.g., 12 months, 24 hours).

C) SVR
Encouragement An expansion of SVM used to solve regression issues is vector regression.To find the optimal function, (), the data can be transformed to a higher dimension using the kernel function in SVR [9].The function () can be modified by adding the kernel function to create the liner equation function shown below: It is possible to use sigmoid, polynomial, radial basis, and linear kernel functions.

D) LSTM
An expanded version of the RNN model, Long Short-Term Memory is linked in a temporal sequence and features an intricate recursive structure.The hidden layer state (), which varies over time, and the cell state (), which preserves long-term memory, are two crucial characteristics of LSTM.The input gates (), forgotten gate (), and output gate (), which contains the preceding layers () and (), define the cell state ().The input data and the state of cell () define the state of () [5,15].
Where  and  stand for the weight matrix and bias vector, respectively, that were acquired during model training.E) Hybrid Model

◼426 ◼ISSN: 1978-1520
The hybrid model suggested in this study utilized a modified model configuration [10,16].It incorporated the weighted average of the two prediction models, as represented below: Where: − y is the value to be predicted (output).− x is the input or feature.− f(x) is the result of model 1 (non-linear).− g(x) is the result of model 2 (liner).− w1 and w2 are weights used to determine the contribution of each model.− e is a constant.
This configuration combined the outputs of the two models and determined the relative contributions of each model to the result by assigning a weight to each model.In this experiment, we used  = 0 and assigned a weight of 50% to both 1 and 2 to ensure a balanced preference for each model.This equal weighting technique tries to prevent bias towards any one model and encourages fairness in the entire evaluation process since each prediction model contributes the same amount.
The selection of models for air pollution prediction is a crucial decision that should be based on the strengths and capabilities of each model in addressing specific challenges.The Prophet algorithm is chosen for its effectiveness in handling data with strong trends and seasonal patterns, as well as its ability to deal with missing or irregularly spaced data [17].SARIMA is selected as a classical model capable of handling time series data with complex seasonal and trend patterns [18].SVR is preferred for its capacity to handle nonlinear relationships in data, often encountered in air pollution prediction contexts [19].LSTM is chosen for its capability to capture complex temporal patterns and nonlinear relationships within time series data, making it suitable for modeling air pollution influenced by multiple factors [20].
Analyzing the strengths and weaknesses of each model reveals that Prophet excels in handling complex trends and seasonality but lacks flexibility with non-periodic patterns [17].SARIMA performs well with complex seasonal and trend patterns but struggles with data exhibiting unstable or changing trends [18].SVR is adept at handling nonlinear relationships but can be sensitive to parameter tuning and computationally intensive [19].LSTM, while effective in capturing complex temporal patterns, is prone to overfitting and requires substantial data for effective training [20].
To enhance the accuracy and stability of air pollution forecasts, a hybridization process is proposed, which combines predictions from Prophet, SARIMA, SVR, and LSTM using ensemble methods or weighting each prediction based on the confidence in the respective model [17].By leveraging the strengths of each model through hybridization, the accuracy and stability of air pollution forecasts can be improved.In conclusion, the selection of models such as Prophet, SARIMA, SVR, and LSTM for air pollution forecasting is based on their unique strengths and capabilities in handling different aspects of the data.By combining these models through hybridization, it is possible to create a more robust forecasting system that capitalizes on the strengths of each model.

OSEMN Framework
A framework called OSEMN can be used to make data analysis easier [21].The utilization of OSEMN in Figure 2 of this study aims to facilitate the appropriate planning and management of the previously specified research design and activities.The steps involved in conducting the research include gathering data, cleaning it, analyzing, and visualizing it, and modeling and interpreting the outcomes of any predictive analysis that has been done.

RESULT AND DISCUSSION
The findings of the analyses that were conducted following the identified stages of the study were discussed in the section that follows.The following are the outcomes:

Obtain
The Air Pollution Standard Index (ISPU) data, which is saved in the dataspku.csvfile, is a time series of data spanning from 2016 to 2021.It is sourced from the DKI Jakarta Provincial Environment Office and is made available through Jakarta Open Data (https://data.jakarta.go.id).The data set spans from December 31, 2021, to January 1, 2016.Table 1 displays the 10,960 observation rows and 10 columns that make up the variables of the ISPU dataset.Since the missing value data is quite minimal, the deletion approach is employed to deal with missing data.Numerical types were applied to some particle variable data types.

Scrub
The scrub process is used to clean up and modify the data so that the analysis process may be completed correctly based on the features of the variables and data types in the dataset.During this process, the variables pm10, so2, co, o3, and no2 were converted to numeric types, while the date variable was changed to a date type to reflect daily data.Among these variables, several have missing values or are unavailable (NA).Specifically, the pm10, so2, co, o3, no2, ◼428 ◼ISSN: 1978-1520 max, critical, and category variables, totaling 1365 rows, contain NA values.We reduced the original dataset with 10,960 rows to 9595 observation rows and 10 columns, covering the period from January 1, 2016, to December 31, 2021, after removing the NA values.The percentage of data deleted due to NA is 12.47% or 1365 rows of data.Notably, the variables with the most missing values are max and critical, but their absence does not interfere with the analysis process.

Explore
The prediction model chooses variables after cleansing the data.The Particle data distribution from the obtained dataset is shown in Figure 3. Figure 4 indicates that SO2 particles rank higher from 2016 to 2021, which directs the emphasis of this study's prediction analysis.High amounts of SO2 are associated with long-term respiratory health problems, which is in line with the goal of the study.Although PM10 and NO2 were considered, Figure 3 highlights the persistent presence of SO2, giving it priority in the analysis.This study seeks to examine the seasonal attributes of SO2 particle concentrations by dividing the data into wet and dry seasons.In Indonesia, the dry season occurs from April to September, while the rainy season spans from October to March, forming the basis for this division.The seasonal data division aims to determine whether the rainy or dry season influences the increase in air pollution particles in Jakarta.The average amount of SO2 particles in the air from 2016 to 2021, along with the lowest and highest values (1.19 μg/m 3 and 50.87 μg/m 3 , respectively), and the standard deviation (11.81 μg/m 3 ), show that the levels are highest from June to September, which could mean that they reach their highest point during the dry season.This study aims to determine whether seasonal fluctuations have an impact on air pollution levels in Jakarta.The visualization in Figure 7 demonstrates that the SVR model exhibits predictions that closely align with the real values, whereas the Prophet model displays findings that deviate significantly from the actual values in comparison to SARIMA and LSTM.This demonstrates that SVR exhibits higher accuracy in predicting the actual values.According to Table 2's assessment, the SVR model demonstrates superior performance compared to the other three models, showcasing lower average MAE and RMSE values.With significantly reduced MSE and RMSE values, the SVR model outperforms others and exhibits a lower prediction error rate.A lower MAE signifies a smaller deviation between expected and actual values, while a reduced MAPE implies a modest prediction error.The analyses of air pollution-causing particles, considering Indonesia's dry and rainy seasons, projected an increase in SO2 and NO2 particles over the next two years, accompanied by a decline in the other three particles.When predicting particles, it's essential to consider seasonal fluctuations and long-term trends, which may affect model accuracy and result reliability.However, it's crucial to acknowledge that these model assessments might not capture the full complexity of real-world air pollution patterns, considering factors like seasonal variations, unpredictable weather, regulatory changes, and human behavior shifts, such as transportation habits.Table 3 summarizes the test results and predictions for the five particles.To ensure consistency in prediction lengths across the SVR, SARIMA, and Prophet models, linear interpolation is employed before merging them with the LSTM model.Subsequently, all models have the same prediction object length and a hybrid approach with equal weights (50%) is applied.After creating hybrid models, their predictions are compared against actual values to assess accuracy.Hybrid models are favored for their ability to mitigate individual model flaws and enhance forecast precision by combining multiple models to alleviate bias and variation.Leveraging the strengths of each model, such as LSTM's temporal pattern handling and Prophet's seasonality simulation, contributes to improved predictions.A comparison of the expected values for the SVR-SARIMA and LSTM-SARIMA hybrid models is shown in Figure 8.The results suggest that the LSTM-SARIMA model demonstrates higher accuracy and better alignment between predictions and actual values compared to the SVR-SARIMA model.The LSTM-Prophet hybrid model outperforms the other three hybrid models in terms of prediction, as shown by lower values for RMSE, MAE, and MAPE in Table 4.

iNterpret.
The following explanation can be given considering the outcomes of several tests that have been conducted on the model: a. Estimations for the next two years suggest a growth in SO2 and NO2 particle concentrations, particularly evident in the dry and wet seasons.During both seasons, the average daily rise in SO2 particles is 0.0831 μg/m 3 , and for NO2 particles, it is 0.0516 μg/m 3 .The increase in both particles is more pronounced during the dry season, likely attributed to heightened fuel combustion, industrial activities, and energy consumption.However, the precise influence of these factors on elevated NO2 and SO2 levels during the dry season warrants further investigation, contingent upon comprehensive dataset support.b.Based on independent testing of the Prophet, SARIMA, SVR, and LSTM models, the results indicate that SVR performs better, with an RMSE value of 7.765, MAE of 5.478, and MAPE of 0.261.c.The LSTM-Prophet hybrid model demonstrates excellent accuracy, achieving a prediction performance of 99.74%.With an RMSE value of 12.319, an MAE value of 12.057, and a MAPE value of 0.259, it outperforms the other three hybrid models.d.Hybridization with non-linear models like SVR and LSTM can enhance the performance of the SARIMA and Prophet models.The results showed that the Prophet and SARIMA models alone were not as effective as the SVR-SARIMA, SVR-Prophet, LSTM-SARIMA, and LSTM-Prophet models.e. SVR and LSTM excel in short-term predictions and pattern detection across various time intervals, while Prophet and SARIMA are adept at analyzing long-term data, especially Prophet's automatic detection of seasonal patterns.Combining LSTM with Prophet effectively addresses seasonal variations by leveraging LSTM's capacity for capturing nonlinear relationships, resulting in improved predictions for datasets with complex temporal and seasonal patterns.

Figure 1 .
Figure 1.Research Design Hybridization is the next step after testing each prediction model.The evaluation metrics of mean squared error (MSE), root mean square error (RMSE), mean absolute error (MAE) and mean absolute percentage error (MAPE) were used for each prediction model.

Figure 3 .
Figure 3. Particle Data Distribution The categories in Figure 5 help to explain why, from 2016 to 2021, DKI Station 4 (Lubang Buaya) saw the highest concentration of SO2 particles, with 55973 μg/m 3 in the medium group.

Figure 6 .
Figure 6.Seasonality Index Particle SO2 Figure 7 presents the comparison results of a single model that forecasts SO2 particles.

Figure 8 .
Figure 8.Comparison of Hybrid Model (SVR-SARIMA and LSTM-SARIMA) The following Figure 9 presents a comparison of the prediction values of the hybrid SVR-Prophet and LSTM-Prophet models.The results indicate that the LSTM-Prophet model outperforms the SVR-Prophet model in terms of accuracy and alignment between predicted and actual values.

Table 2 .
Comparison of Single Model Evaluation Metrics

Table 3 .
Estimated Average/Day Particles in Dry and Rainy Seasons

Table 4 .
Comparison of Hybrid Model Evaluation Metrics