Prediction of air quality in Jakarta during the COVID-19 outbreak using long short-term memory machine learning

Air pollution is one of the world’s problems, not just one location. This air pollution is caused by pollutants that are harmful to human health and the environment. Some pollutants are most influential, namely particulate matter, ground-level ozone, carbon monoxide, sulfur dioxide, and nitrogen dioxide. Several countries decided to lock down when the COVID-19 outbreak was announced simultaneously throughout the world like a pandemic. In Jakarta, Indonesia applies large-scale social restrictions (PSBB). The resulting impact is a drastic reduction in air pollution on air quality. This paper aims to predict air quality during the COVID-19 outbreak in Jakarta using long short-term memory (LSTM) machine learning. The evaluation of the LSTM model used in this paper is the root mean square error (RMSE). The results obtained show that the Adam optimizer can bring the prediction results closer to the dataset used.


Introduction
Air pollution is an unresolved world problem and will always exist. It is a form of pollution that refers to air contamination causing physical, biological, and chemical changes to the atmosphere [1]. Measurement of pollution levels in an area can be measured with a sensor [2,3]. The sensor is then processed to be sent via the internet of things technology to determine the levels of pollution of particles and gases produced by certain pollutants [4]. The resulting pollution can also be displayed on a map that can be accessed via a browser [5]. Not only that, to analyze the effect of the number of vehicles on the level of pollution produced, it can detect images or videos related to the number of vehicles at a place in traffic lights [6]. So that the setting of traffic lights can be done automatically according to the busiest lane for vehicles to run first. This can regulate the level of pollution produced by the vehicle. Sources of air pollution can be classified into stationary sources and mobile sources. The stationary sources consist of industry, power plants, and households.
Meanwhile, the mobile sources are motor vehicle activities and sea transportation. There are two main types of air pollutants, which are gaseous compounds and solid compounds. It triggers air pollution in industrial areas and big cities, which impacts health including cancer of various organs of the body, heart disease, hypertension, reproductive disorders, respiratory disorders, and so on [7]. There are five most influential pollutants released by United States Environmental Protection Agency (US EPA), namely particle pollution or known as particulate matter, including PM2.5 and PM10, ground-level ozone (O3), carbon monoxide (CO), sulfur dioxide (SO2), and Nitrogen Dioxide (NO2). These pollutants that pollute the air were reported to have had a very drastic decrease during the COVID-19 outbreak [8]. The COVID-19 outbreak has almost affected human life and the environment around the world, including Indonesia. Even though the cure rate for COVID-19 in Indonesia is relatively high than death [9], if it is not watched out for and the spread of it is anticipated, the increase is relatively large in addition to the limitations of hospitals that accommodate sufferers. The predicted increase in COVID-19 sufferers in Indonesia will continue to increase over time if there are no proper regulations. One of the ways to prevent the widespread of COVID-19 in several countries is a lockdown. In Indonesia, the way to anticipate its spread is through large-scale social restrictions (PSBB). Jakarta government did this. Jakarta is one of the big cities in Indonesia, which certainly has an immense contribution to pollution. Various methods were used to solve congestion and to enforce even and odd number plate of vehicle regulations. The existence of this PSBB has reduced the level of pollution in Jakarta. This paper aims to predict the air pollution during the COVID-19 outbreak in Jakarta using machine learning Long Short-Term Memory (LSTM). The model obtained will be evaluated using the root mean square error (RMSE).

Related Works
Air quality prediction is a complex issue. Air quality in an industrial area is a challenge for policymakers in making decisions. Of course, these decision-makers need the right information to predict it. The development of machine learning has a place in determining future information on a case, including air quality [10]. Conditions for normal daily activities can be used to predict and monitor air quality in urban areas. One of the methods used to model hourly predictions can use support vector regression (SVR), as was done by [11]. The selection of the right dataset and variables needs to be considered to model an accurate air pollution prediction. There are two main classes that differ in the perspective of air pollutant concentrations, namely estimation, and forecasting. The ensemble learning algorithm and linear regression are suitable for pollution estimation, while the neural network (NN) and support vector machines (SVM) are suitable for forecasting [12]. Air pollution prediction systems PM10 and PM2.5 have been carried out by [13] to predict Korea's occurrence. The method used in this prediction is deep learning using stacked autoencoders for learning and training data. It proves that the predictions used in air quality modeling using machine learning are quite acceptable.
To suppress the spike in the COVID-19 outbreak, various government regulations in the world have been taken. This is quite effective in reducing the rate of people traveling, and economic activity has also decreased. But on the other hand, the air quality during COVID-19 had a pretty good impact, marked by a reduction in air pollution during the months that these prohibitions were implemented [14]. The study conducted by [15] used the LSTM method to predict air quality in Madrid. Each pollutant has a different behavior, so it is necessary to pay attention to the machine learning implementation. Each type of pollutant behaves differently at each location. It is different from the research in this paper even though the method used is the same, namely LSTM. The research in this paper takes the air quality in Jakarta at the time of COVID-19, which uses data from 6 locations. The rate of death cases caused by COVID-19 is predicted to increase over time [16], so the necessary mitigation must be taken to anticipate it.

Research Methodology
The principles used in this research are preprocessing data, initializing parameters, training LSTM networks, and testing the test data. The root mean square error (RMSE) was used to determine the error value between the model and the original data. This paper compares three optimizers, consisting of adaptive moment estimation (Adam), stochastic gradient descent (SGD), and root mean square propagation (rmsprop), to get the RMSE and epoch values for 10, 20, 30, 50, and 100, respectively. The dataset used in this paper is obtained from the website https://data.jakarta.go.id/dataset/indeksstandar-pencemaran-udara-ispu-tahun-2020 [17]. This dataset consists of the air pollution standard index (indeks standar pencemar udara (ISPU)) gotten from the six locations of the air quality monitoring in Jakarta, which are Bunderan HI, Kelapa Gading, Jagakarsa, Lubang Buaya, Kebun Jeruk, and DKI Jakarta. The dataset used in this paper is taken from January 2020 until early June 2020, which is the COVID-19 outbreak has emerged. Some of the data contained in the dataset is missing. Because the missing data has no significant effect, the disappeared datasets's rows are deleted to obtain continuous data. Likewise, for double data, the data selected is the first data only. Information on the dataset of air quality at the Bunderan HI can be seen in Table 1.  Table 1 shows the amount of data for each pollutant. The amount of data used is 148 data. In contrast, the information on the air quality dataset in Kelapa Gading is shown in Table 2.  Table 2 shows the amount of data used in Kelapa Gading for observation amounted to 147 data. At the same time, the information on the air quality dataset in Jagakarsa is shown in Table 3.  Table 3 shows the amount of data used in Jagakarsa for observation totaling 133 data. In comparison, the information on the air quality dataset in Lubang Buaya is shown in Table 4.  Table 4 shows the air quality information used at Lubang Buaya is 122 data. Meanwhile, the dataset information used in Kebun Jeruk is shown in Table 5.   Table 5 shows the data used for observation in Kebun Jeruk is 141 data. Whereas the air quality dataset information in the Province of DKI Jakarta is shown in Table 6. The air quality data number used in DKI Jakarta is 168 data, the largest number of the others. The air quality levels can be seen in Table 7. Dangerous levels for all exposed populations Table 7 shows that the higher the air quality index value, the greater air pollution level and the greater the health risk for the people and environments. There are differences in the standard levels used between the United States Environmental Protection Agency (US EPA) and Indonesia. However, they are almost the same, namely the index value between 101-200. At that level, the US EPA divides it into two air pollution levels, namely unhelpful for sensitive groups and unhealthy, while in Indonesia, there is only one level, namely unhealthy.  The LSTM module has four gates: input gates, cell gates, forget gates, and output gates. The memory cell is designed using linear and logistic units with multiplicative interactions. As in the RNN, this LSTM network consists of modules with repetitive processing formed from the LSTM. The computational working principle of LSTM can be stated as follows,

Long short-term memory network machine learning
x The value of the input can only be stored in the memory cell if the gate input allows. The activation function in forget gates is a sigmoid activation function that has outputs 0 and 1. If the output is 0, all data will be discarded, whereas if the output is 1, all data will be stored.
Calculation of the input gate Dt and candidate values 9t of the cell state is done using equations 1 and 2, respectively.
Where V is a sigmoid function, ZD is a weight of input at time t, xt is an input value at time t, XD is a weight of output at time t, [t-1 is an output value at time t-1, and ED is a bias in the gate input. Where tanh is a hyperbolic tangent function, Z9 is a weight of input at the cell 9, XD is a weight of output value from cell to 9-1, [9-1 is an output value from cell to 9-1, and E9 is a bias in the cell to 9.
x The forget gate value \t is calculated as shown in equation 3.
Where Z\ is a weight of input at time t, X\ is a weight of output at time t-1, [t-1 is an output value at time t-1, and E\ is a bias in the gate input.
x The memory cell state ]t could be calculated using equation 4.
Where ]t-1 is a memory cell state at the previous cell.
Where ZM is a weight of output at time t, XM is a weight of output at time t-1, [t-1 is an output value at time t-1, and EM is a bias in the gate output.
x The final output [t can be calculated as equation 6.

Optimizer
The adaptive moment estimation (Adam) optimization algorithm is an extension to stochastic gradient descent (SGD) which is used also in deep learning. It has been introduced by Kingma and Ba [20]. The Adam is an optimization algorithm that develops by leveraging the advantages of the adaptive gradient (AdaGrad) and root mean square propagation (RMSProp) algorithms. Learning parameters based on the first mean as in RMSProp, Adam also uses the second mean of the gradient or uncentered variance. This algorithm calculates the exponential moving average of the gradient and its quadratic gradient and the parameters J1 and J2 control the moving average decay rate. The pseudo code of Adam can be written as below,

Root Mean Square Error (RMSE)
The system performance of the error rate of the prediction results is calculated using the root mean square error (RMSE). In this calculation, the best model of prediction accuracy is the model with the smallest RMSE value or close to 0. This equation is shown in equation 7. where Ωi is the value of the predicted model, Λi is the value of the observed i th data point, and Ν is the data amount.

Result and Discussion
The data used in this paper only uses PM10 and O3 data from six locations in Jakarta. The graphs of the PM10 data for these six locations are shown in Figure 2. While the graphs of the O3 data are shown in Figure 3.  From Figures 2 and 3, it can be seen that in the period March-April, there was a drastic decrease in PM10 and O3 pollution. It is because, at that time, a large-scale social restriction (PSBB) was implemented in Jakarta, which took effect Friday (10/4/2020) until Thursday (23/4/2020) [21]. It was done after the global COVID-19 pandemic announcement by the world health organization (WHO) and patients were found and affected in Jakarta. However, after the PSBB period ended, the pollutions of PM10 and O3 increased just like before. The model testing in this paper analyzes the impact of the parameters on the accuracy obtained. The parameters tested were the composition of the data, the number of time series patterns, the number of hidden neurons, and the amount of epoch to determine the LSTM weight that has the lowest RMSE. The parameter values tested were 70% training data and 30% test data. The number of hidden neurons in the LSTM is 4, the magnitude of the epoch used is 10, 20, 30, 50, and 100, respectively. The results of the RMSE obtained for the PM10 pollutant can be seen in Table 8. Meanwhile, the results of the RMSE obtained for the O3 pollutant can be seen in Table 9. The validation of the model output with the best RMSE value for each data from PM10 and O3 can be shown in Figure 4.

Conclusion
This paper has produced a prediction model for air quality in Jakarta during the COVID-19 pandemic using machine learning LSTM networks. The best prediction model is obtained with Adam's optimizer. The RMSE value generated from the Adam and rmsprop optimizer has almost the same value but is superior to Adam.