PM 2.5 Concentration Forecasting Using Weighted Bi-LSTM and Random Forest Feature Importance-Based Feature Selection

: Particulate matter (PM) in the air can cause various health problems and diseases in humans. In particular, the smaller size of PM 2.5 enable them to penetrate deep into the lungs, causing severe health impacts. Exposure to PM 2.5 can result in respiratory, cardiovascular, and allergic diseases, and prolonged exposure has also been linked to an increased risk of cancer, including lung cancer. Therefore, forecasting the PM 2.5 concentration in the surrounding is crucial for preventing these adverse health effects. This paper proposes a method for forecasting the PM 2.5 concentration after 1 h using bidirectional long short-term memory (Bi-LSTM). The proposed method involves selecting input variables based on the feature importance calculated by random forest, classifying the data to assign weight variables to reduce bias, and forecasting the PM 2.5 concentration using Bi-LSTM. To compare the performance of the proposed method, two case studies were conducted. First, a comparison of forecasting performance according to preprocessing. Second, forecasting performance between deep learning (long short-term memory, gated recurrent unit, and Bi-LSTM) and conventional machine learning models (multi-layer perceptron, support vector machine, decision tree, and random forest). In case study 1, The proposed method shows that the performance indices (RMSE: 3.98%p, MAE: 5.87%p, RRMSE: 3.96%p, and R 2 :0.72%p) are improved because weights are given according to the input variables before the forecasting is performed. In case study 2, we show that Bi-LSTM, which considers both directions (forward and backward), can effectively forecast when compared to conventional models (RMSE: 2.70, MAE: 0.84, RRMSE: 1.97, R 2 : 0.16). Therefore, it is shown that the proposed method can effectively forecast PM 2.5 even if the data in the high-concentration section is insufﬁcient.


Introduction
Particulate matter (PM) refers to materials scattered throughout the atmosphere. PM with diameter 2.5 micrometers or less is defined as PM 2.5 [1]. PM 2.5 is readily absorbed during breathing owing to their small size and lightweight. PM 2.5 absorbed into the body can cause bronchitis, pneumonia, chronic obstructive pulmonary disease, heart disease, stroke, and respiratory diseases [2][3][4]. Therefore, many industrialized countries have made significant efforts to reduce the risk of PM exposure. The U.S. Environmental Protection Agency defines daily average PM 2.5 concentrations of 35 µg/m 3 or more as highconcentrations and regulates daily average concentrations that do not exceed 12 µg/m 3 . The European Environment Agency regulates PM 2.5 emissions when the daily average PM 2.5 concentration exceeds 25 µg/m 3 . The South Korean Government defines high concentrations of PM 2.5 as 36 µg/m 3 or more. When the PM 2.5 concentration is 75 µg/m 3 or higher and lasts for 2 h, the government issues a PM 2.5 alert and implements air pollution The effectiveness of PM 2.5 forecasting models heavily relies on the distribution of the training data. When training a model with imbalanced data, the problem arises that the model is trained biased towards the data of the majority class. Training on data from minority classes is insufficient, resulting in incorrect predictions for minority classes. Furthermore, since the model is trained biased toward the majority class, its generalization ability may suffer. These data imbalance issues can limit the model's performance and reduce its reliability. Various methods have been proposed to solve this data imbalance problem. Solving imbalanced data requires sampling-based methods and cost-sensitive learning methods [33]. Sampling-based methods adjust the proportion of the data via data sampling. The sampling-based methods are divided into undersampling and oversampling. Undersampling only uses a portion of the data from a large number of classes to balance the ratio of data from a large number of classes to a small number of classes, such as Tomek links and cluster centroids. These methods are more accessible to scale than oversampling; however, they result in data loss because they reduce the existing data. Oversampling addresses data imbalance by augmenting the minority class data and includes synthetic minority oversampling, adaptive synthetic sampling, borderline synthetic minority oversampling techniques, and Kriging [34]. Unlike undersampling, oversampling does not result in data loss. However, because it replicates the data from fewer classes, it may overfit the training data and degrade the performance of the test data. Moreover, the replicated data for minority classes may not resemble the existing data. When this occurs, the newly generated data can be considered as noise, which may negatively affect the overall forecasting performance of the model. Cost-sensitive learning allocates greater weight to minority classes to improve the classification performance of minority classes in imbalanced data. The advantage of this method is that it preserves existing data and does not result in information loss. Additionally, it avoids the problem of reducing the generalization ability of the model owing to duplicate data. The contributions of this study are summarized as follows:

1.
Traditional time-series forecasting methods, such as ARMA, ARIMA, and MLR, often have limited capabilities in accounting for non-linear relationships. Meanwhile, machine learning models, such as SVM and decision trees, cannot incorporate past time points during the forecasting process. In contrast, Bi-LSTM, a recursive model that uses past output values as inputs to the hidden layer, can effectively solve the above-mentioned limitations. When comparing recursive methods, LSTM and GRU, which are unidirectional models that consider past output values in the hidden layer, have concerns that prediction performance deteriorates as the forecast time point becomes longer. On the other hand, Bi-LSTM, which is trained bidirectionally, can reflect more information than unidirectional-based models.

2.
Selection of input variables is necessary to accurately forecast PM 2.5 . In general, the wrapping method, in which input variables are selected depending on experience, is time-consuming and requires a lot of computational costs. RF can select variables that are effective for prediction by calculating the importance of each variable. In particular, RF can reduce the time cost compared to heuristically reliant wrapping methods. In addition, unlike the filter method, it is effective when implementing a PM 2.5 concentration forecast model considering non-linearity.

3.
To address the data imbalance problem, the proposed method utilizes the weighting method, which is a cost-learning method used during model training. Unlike the sampling method, which adjusts the proportion of data, the weighting method does not result in information loss. Accordingly, the approach prevents any bias towards the main class data during model training, which is a common problem with imbalanced datasets.
The remainder of this paper is organized as follows. In Section 2, we provide detailed descriptions of the study site, data used in this study, and the proposed method for forecasting PM 2.5 concentration. The experimental results are presented in Section 3, where the performance of the proposed method is evaluated using various performance index.
Finally, Section 4 discusses the results and their implications, as well as conclusions drawn from this study. Figure 1 shows a flowchart of the PM 2.5 concentration forecast method proposed in this study. Figure 1a shows a flowchart for training a model to forecast PM 2.5 concentrations, and Figure 1b shows a flowchart for testing the trained models to forecast PM 2.5 concentrations. In the preprocessing step, the outliers were removed from the air pollution and meteorological datasets. Any missing values in the data were obtained using linear interpolation. The preprocessed data were then normalized using the min-max normalization method. To select the input variables for the PM 2.5 concentration forecast model and generate weight variables to handle the data imbalance problem, we employed a random forest model to classify the data into four grades (Good, Normal, Bad, and Worst). Classified data were assigned weight variables, and a Bi-LSTM model was applied to the selected input variables to train the PM 2.5 concentration forecast model.

Study Sites and Data
In South Korea, air quality monitoring stations operated by the South Korean Ministry of Environment are used to measure the average air quality concentrations in urban areas. This helps understand the air pollution status, changes, and whether air quality standards are met. In this study, we used air quality monitoring stations operated by the Korean Ministry of Environment in Seoul's Gangnam-gu, Geumcheon-gu, Seocho-gu, and Songpagu neighbourhoods. The station locations are shown in Figure 2, and information on their locations are listed in Table 1. The stations represent the air quality in the southern region of Seoul, specifically in the area located south of the Han River.   Table 2 presents the sampling times and data units used in this study. Meteorological data were provided by the Republic of Korea Meteorological Administration [35], which provides eight variables (precipitation type, relative humidity, precipitation, sky condition, temperature, thunderbolt, wind direction, and wind speed) at 1 h intervals for each region. Three of the eight variables included in the dataset (precipitation type, sky condition, and thunderbolts) are graded on a scale. Precipitation type is indexed from 0 to 3, where 0 indicates clear skies, 1 represents rain, 2 represents sleet, and 3 represents snow. The sky condition is indicated by an index ranging from 1 to 4, which represents sky visibility index. A value closer to 1 indicates clearer skies, whereas a value closer to 4 indicates flowing weather conditions. Thunderbolt is represented by a Boolean, which indicates the presence or absence of thunder. Air pollution data were provided by Airkorea [36], a service of the Korean Ministry of Environment, which measures the concentration of six pollutants (PM 2.5 , PM 10 , sulphur dioxide (SO 2 ), ozone (O 3 ), nitrogen dioxide (NO 2 ), and carbon monoxide (CO)) at monitoring stations every hour. Air pollution data are used in both data and meteorological data, except for thunderbolts.  Table 3 presents the range of data observed at each monitoring station. Precipitation type, sky conditions, and wind direction exhibit ranges of 0-3, 1-4, and 0-360 across all stations. Relative humidity ranges from 9 to 100, while temperature and precipitation range from −17.4-40.6 and 0-63.4, respectively. Gangnam-gu exhibits the lowest maximum of precipitation values among all stations, whereas Songpa-gu has the highest. The wind speed ranges from 0 to 11.6, with Geumcheon-gu and Gangnam-gu reporting the lowest and highest values, respectively, at 7.1 and 11.6. Regarding air pollution data, PM 10 ranges from 1 to 993, with the lowest maximum value observed in Geumcheon-gu at 329 and the highest in Seocho-gu at 993, exhibiting a significant difference of 664. PM 2.5 , ranged from 1 to 175, and the maximum value varied across stations, with the lowest reported in Songpa-gu at 140 and the highest in Gangnam-gu at 175. The range for O 2 is 0.001-0.169, while the range for NO 2 is 0-0.169. CO ranges from 0.1 to 3.4, and SO 2 ranges from 0.001 to 0.028. Owing to the variability in the data range across monitoring stations, individual forecast models are required to provide accurate forecasts. In this study, min-max normalization was applied to the data from each monitoring station to normalize the data. The normalization equation used in this study is expressed in Equation (1), where y max and y min represent the maximum and minimum values of the normalization range, respectively, which were set to 1 and −1, respectively. Moreover, max(X) and min(X) represent the maximum and minimum values of the variable X, respectively.

Input Selection
To achieve an accurate forecasting of PM 2.5 concentrations, it is crucial to carefully select the influential input variables. Including unnecessary input variables in the model can increase the complexity and reduce forecasting performance. Thus, selecting appropriate input variables is essential for implementing the forecast model. In this study, the feature importance is calculated as shown in Equation (2) [37] to select the necessary input variables when classifying data labels. This is an embedded method that selects the input variables by calculating their importance when learning a model. In Equation (2), FI j indicates the j-th feature importance and T m indicates the m-th decision tree. I indicates the indicator function. p m represents the weight difference when splitting the t-th node in the m-th decision tree. Equation (3) is used to find p m . In Equation (3), p le f t,m (t), p right,m (t), and p parent,m (t) represent the weight ratios of the left, right, and parent of the t-th node, respectively, of the m-th decision tree. Moreover, i le f t,m (t), i right,m (t), and i parent,m (t) represent the impurity for the left, right, and parent of the t-th node, respectively, of the m-th decision tree.

Imbalanced Data
In South Korea, PM 2.5 levels are managed through classification based on concentration. The concentration range for each grade is presented in Table 4, where "Good" corresponds to a concentration greater than 0 and less than 15, "Normal" corresponds to a concentration greater than 16 and less than 35, "Bad" corresponds to a concentration greater than 36 and less than 75, and "Worst" corresponds to a concentration greater than 76.  Table 5 presents the number of data points and the percentage of data in each grade for each of the stations. The proportion of data classified as 'normal' was the highest among all stations. The proportion of low-concentration data (PM 2.5 ≤ 35; Good, Normal) is consistently higher than that of high-concentration data (PM 2.5 ≥ 36; Bad, Worst) across all stations. Training a model with such a proportion of data will lead to a low-concentration bias. We used the weighting method, which is cost sensitive, to solve the data imbalance problem. The weighting method assigns weights to a small number of classes to learn unbiasedly from a large amount of data. In addition, unlike the sampling method, it does not lose information about the data and is not affected by the problem of generalization ability deterioration owing to the generation of redundant data. To assign weights, we categorized it into four grades (Good, Normal, Bad, and Worst). The random forest method was used for data classification. Random forest [38] is a supervised ensemble learning method that uses multiple decision trees to select many outcomes that can be used for classification and regressions. Since this method combines multiple decision trees to make forecasts, it reduces the bias and variance of the model to solve the overfitting problem, resulting in a relatively high forecast performance. Additionally, the importance of a variable can be calculated using this model. Since the data distribution and range of each station differs, the models were trained separately for each station and for Bayesian optimization [39]. Table 6 lists the parameters of the model used in each station to train the random forest.  To assign weight variables to the data classified into four grades (Good: 1, Normal: 2, Bad: 3, Worst: 4), the probability of each class was calculated by computing the proportion of data in each class. We then used Equation (4), where c denotes a class, Nc denotes the number of data points in the c-th class, and k represents the total number of data classes. Equation (4) is simple to calculate and is an intuitive method. The resulting value of Equation (4) is highly weighted towards prime number data, which can lead to better learning. By applying Equation (4) we obtained the weight variable value (cw c ) for the c-th class.

Bidirectional Long Short-Term Memory
To address the challenge of long-term dependence in traditional recurrent neural networks (RNNs), which arises owing to the vanishing or exploding gradient problem when processing long sequence data, Hochreiter and Schmidhuber proposed LSTM [40]. LSTM consists of a forget gate (f t ), input gate (i t ), update gate (g t ), output gate (o t ) and a cell state (c t ). Equations (5)-(9) compute each gate and cell state of the LSTM at time t. In each equations, σ represents a sigmoid function, and tanh is the hyperbolic tangent function. x t represents the input vector at time t, and h t−1 represents the hidden layer output at time t − 1. W and b denote the weight and bias of the equations, respectively. Equation (5) expresses the forget gate operation, which determines which information to retain from the previous time point.
The input gate is calculated using Equation (6) and is responsible for determining which of the new information should be stored in the cell state.
The update gate is a function that determines the amount of information to store in the current cell state, calculated using Equation (7).
The output gate is calculated using Equation (8) and is the gate that determines what information to output.
The cell state is calculated by multiplying the cell state of the previous time by a value of the forget gate, as shown in Equation (9), and then adding the product of the outputs of the input gate and value of the update gate to add new information. The cell state contains information from the previous time to the current time.
Calculate h t at time t using the calculated cell state and output gate. Equation (10) is LSTM was proposed to address the long-term dependence encountered by traditional RNNs when processing long sequence data, they are still constrained by their unidirectional processing. To overcome this limitation, Bi-LSTM was introduced [41]. Bi-LSTM performs operations on the forward LSTM and on the backward LSTM. Figure 3 shows backward LSTM in Bi-LSTM. In Figure 3, the backward layer consists of the same four gates (forget gate, input gate, update gate, and output gate) and cell state as the forward layer. Unlike the gate operation of the forward layer, the backward layer uses the output value of the hiding layer at t + 1 as the input of each gate. The output values of the forward and backward layers are then combined to determine the output value of the hidden layer. Since the Bi-LSTM can consider both the forward and backward directions of the data, it can better reflect the information generated at both ends. It can also utilize a large amount of information because it uses the input data once again.
The Bi-LSTM is a type of neural network where the performance varies depending on the number of nodes in the hidden layer. To select the appropriate number of hidden nodes, we performed forecasts by increasing the number of nodes in the hidden layer from 16 (2 4 ) to 128 (2 7 ) in a doubling fashion [42,43]. The optimal number of hidden nodes was determined as the one that resulted in the lowest root mean square error (RMSE). Each model was optimized using the Adam (adaptive moment estimation) method [44]. Tables 7 and 8 shows the RMSE calculation results by the number of covert layer nodes for each station. Where RMSE represents the RMSE of the training data, and bold indicates the lowest RMSE per station. Table 9 shows the training options for the model used to select the number of hidden layer nodes. Table 10 presents the number of nodes for each hidden layer, where FC denotes a fully connected layer.

Experiments and Results
Air pollution data and meteorological data were used to forecast the PM 2.5 concentration. Data from four years (2015-2018) were used in the forecasting, and the training data were from three years (2015-2017), and the test data was from 2018. To compare the performance of the proposed model, we performed experiments for two cases. In case study 1, we selected input variables to reduce the complexity of the model and increase its comprehensibility and added weighting variables to solve the data imbalance problem and compared the results of forecasting the model without doing so. Case study 2 compares the performances of three deep learning models (LSTM, Bi-LSTM, and GRU) and conventional machine learning models (MLP, SVM, decision tree, random forest) by the station to compare the performances of the forecast models in the proposed method. In both experiments (case study 1 and 2), in order to consider the past time points and that from the current time point (t) to 23 h before the past time, point (t−23) was used as the input to forecast one hour (t+1) later.

Performance Index
To numerically compare the experimental results of the case study, we used three performance indices used in regression: RMSE, mean absolute error (MAE), relative root mean square error (RRMSE), and R 2 . The RMSE was obtained by averaging the squares of the error difference between the forecast and actual values and taking the square root of the result. The MAE was calculated as the mean of the absolute errors. The RRMSE is the relative value of the RMSE between the forecasted and actual values divided by the average of the actual values. The lower the values of RMSE, MAE, and RRMSE, the better the forecasting performance. R 2 is an index that evaluates the extent to which the forecast describes its true value. R 2 has a value between 0 and 1. The closer it is to 1, the better the model describes the data. Equations (11)-(14) are used to determine RMSE, MAE, RRMSE, and R 2 . In each of these formulas,ŷ i refers to the i-th forecasted value, and y i refers to the i-th observed value, whereȳ denotes the mean of the observed values.

Case Study 1: Comparing the Conventional Method with the Proposed Method
Case study 1 compares the proposed method with the conventional method. The conventional method forecasts using all the variables in the data. The proposed method uses a random forest to select the input variables and a weight variable to solve the unbalanced data and uses Bi-LSTM to forecast the PM 2.5 concentration. The results of calculating the feature importance using a random forest to select the input variables are shown in Table 11. In Table 11, PM 2.5 has the highest value for all monitoring stations. PM 10 is the second highest at all stations except Geumcheon-gu. Among the meteorological variables, the temperature had the highest value at all stations except Songpa-gu. Among the values calculated in Table 11, non-zero and weight variables were used as input variables for the forecasting model. In addition, to assign weighting variables, the data must be classified by thePM 2.5 class. Therefore, this paper performs the classification using a random forest. The input variables are the same as those of the forecasting model selected earlier. Since each monitoring station has a different data range and distribution, they were trained separately, and the input variables at time t were used to classify the PM 2.5 class at time t+1. To compare the classification accuracy according to the selection of input variables, we conducted experiments before and after the selection of input variables. Figures 4 and 5 show the confusion matrix before and after the selection, respectively. In each figure, (a) and (b) show the confusion matrices for Gangnam-gu, (c) and (d) for Geumcheon-gu, (e) and (f) for Seocho-gu, and (g) and (h) for Songpa-gu. Using input selection for data classification improved the training and test data classification accuracy. The most considerable improvement was observed for Gangnam-gu, where the accuracy increased by 4.34%p in the training data and 2.37%p in the test data.  The results of calculating the weighting variables are shown in Table 12. In Table 12, 'Worst' has the smallest data percentage and the highest weighting variable.  Figures 6-9 show the forecast results of each station using the conventional and proposed methods. In the figures shown in Figures 6-9, (a) shows the test period; (b) shows the period between the two green dashed lines, which is a section of active high-concentration; while (c) shows the period between the two yellow dashed lines, which corresponds to a low-concentration section. The x-axis in the figure represents time, and the y-axis represents the concentration. The black solid line represents the actual PM 2.5 concentration measured at each monitoring station, while the red dashed line represents the forecasted value using the conventional method. The blue circled line represent the forecasted values obtained using the proposed method. The magenta dotted line indicates the threshold value of the PM 2.5 concentration at 35, which is the standard for high concentration. The proposed and conventional methods are under forecasting when forecasting the PM 2.5 concentration in Gangnam-gu. Between 445 and 450 h, the target value continues to increase, and the forecast value of the conventional method decreases. However, the proposed method exhibits a lower error in forecasting by increasingly following the target value. In the lowconcentration period (Figure 6c), the forecast result of the conventional method is under forecast, and it is evident that the forecast performance is lower than that of the proposed method in the 5250-5300 h. In the high-concentration section of Geumcheon-gu (Figure 7b), which spans from 1950 to 1980 h, the conventional method shows a sharp decrease in forecasted values, while the proposed method forecasts the target PM 2.5 concentrations. In Figure 7c, which shows the forecast result of the low-concentration section, the proposed method forecasts better than the conventional method. In the high-concentration section of Seocho-gu, the conventional and proposed methods are primarily under forecasting. The proposed method has a better forecast performance from 330 to 380 h, where the concentration is above 75 and changes rapidly. For the low-concentration section, the conventional method over forecasts compared with the proposed method and has a lower forecast performance than the proposed method. For the high-concentration section of Songpa-gu, between 1090 and 1095 h, the conventional method does not forecast more than 60, while the proposed method forecasts up to 80, exhibiting a lower error. However, from 1165 to 1170 h, where the concentration is above 100, both methods have errors due to under forecasting.
In Figures 6-9b, the high-concentration range of each station, it can be seen that the conventional method under forecasts the proposed method because it is trained with a bias towards low-concentration, which is the major class data.     Table 10, the proposed method has an average RMSE of 0.2095 (3.98%p), which is lower than that of the conventional method. Specifically, in the high-concentration section, the proposed method is 0.3011 (3.21%p) lower than the conventional method. Notably, the Gangnam-gu forecast model shows the most significant difference, with the proposed method differing from the conventional method by 0.3262 (6.88%p) for the overall RMSE and 0.534 (6.42%p) for the RMSE of the highconcentration section. Furthermore, even in the low-concentration section, the proposed method shows an average RMSE of 0.1931 (4.78%p), which is lower than that of the conventional method, with the smallest difference of 0.38 (7.43%p) observed for Songpa-gu. Additionally, the performance improvement of the proposed method is more significant in the high concentration range for Geumcheon-gu and Seocho-gu stations. Table 14 indicates that the proposed method has, on average, a 5.87% lower MAE than the conventional method. In the high-concentration section, the proposed method outperforms the conventional method by 4.36% for all stations. Moreover, the proposed method exhibits a 6.63% lower MAE than the conventional method in the low-concentration section. The largest differences in MAE between the proposed and conventional methods are observed for the stations in Gangnam-gu, with reductions of 10.49, 8.65, and 11.29%, respectively, for the entire test period, high-concentration, and low-concentration sections, respectively. Among the high-concentration sections, the station in Songpa-gu shows the largest difference of 0.0011 (0.11%p). Table 15 shows that the proposed method is, on average, 0.0097 (3.96%p) lower than the conventional method. In the high-concentration range, the proposed method is lower on average by 0.0056 (3.21%p), with the largest difference (0.0096, 6.46%p) in Gangnam-gu. In the low-concentration range, the proposed method is also lower than the conventional method by 0.013 (4.76%p), with the most significant difference (0.0263, 7.42%p) in Seocho-gu. Table 16 shows that the R 2 value of the proposed method is 0.0066 (0.72%p) higher than the conventional method. In the high-concentration range, the proposed method is higher by 0.0007 (0.07%p). Especially in the high-concentration range, Songpa-gu shows the most significant difference of 0.0011 (0.11%p). In the low-concentration range, the proposed method is also higher than the conventional method by 0.018 (2.23%p). Especially in Seocho-gu, the proposed method is higher than the conventional method by 0.0094 (1.06%p) and 0.0398 (5.50%p) in the test period and low-concentration range. In case study 2, we evaluate the forecasting accuracy of PM 2.5 concentrations using deep learning models (LSTM, GRU, and Bi-LSTM) with superior forecasting performance and conventional used machine learning models (MLP, SVM, DT, and RF). The input variables of the models are those proposed in case study 1. The forecast results of the LSTM, GRU, and Bi-LSTM models for each station are illustrated in Figures 10-13. The x-axis represents time, and the y-axis represents the PM 2.5 concentration. The black line represents the actual values of each station, while the red and blue dashed lines indicate the forecasting results of the LSTM and GRU models. The forecast result of Bi-LSTM is indicated by the purple dashed line. The magenta-coloured dotted line represents the point where the PM 2.5 concentration value is 35. In each figure, (a) represents test periods; (b) represents the period between the two green dashed lines, which are sections of high-concentration; while (c) shows the period between the two yellow dashed lines, which is the low-concentration section.
Regarding comparing forecast performance among different models in case study 2, it can be observed that all models underestimate the actual PM 2.5 concentration in the high-concentration section of Gangnam-gu shown in Figure 10b. However, the Bi-LSTM model provides the best forecasting performance among the areas with a concentration above 75, particularly between 350 and 390 h. In the low-concentration section, all models over forecast. For the high-concentration section of Geumcheon-gu, the Bi-LSTM model performs better than the other two models during 1940-2010 h, where the concentration changes rapidly. On average, in the low-concentration section, the Bi-LSTM over forecasted, while LSTM and GRU under forecasted. The Bi-LSTM exhibits better forecasting performance in the normal range of 15 and above and 35 and below. Within the PM 2.5 concentrations between 0 and 15, LSTM exhibits the most accurate forecasting performance.
In the high-concentration section of Seocho-gu, all models under forecast on average, and Bi-LSTM outperforms LSTM and GRU in the range of 330-400 h. All models over forecast in the low-concentration section, and the GRU exhibits the best performance in the range of 0-10, followed by Bi-LSTM. For Songpa-gu, GRU exhibits the best forecasting performance in the high-concentration section of 1080-1095 h, and LSTM exhibits the best forecast performance in the increasing section of 1240-1285 h, followed by Bi-LSTM with the second best forecasting performance. In the low-concentration section, LSTM and GRU over forecast on average, while Bi-LSTM under forecasts. Therefore, it exhibits the best overall forecasting performance in this section.     Tables 17 and 18 are the averages of the results from 10 replicates. In the case of RMSE,MAE, and RRMSE, Bi-LSTM performs better at all stations except for Geumcheon-gu. Comparing the average RMSE, Bi-LSTM outperforms LSTM and GRU by 0.1405 (2.6977%p) and 0.15 (2.8748%p), respectively. Comparing the average MAE of the models, the performance of Bi-LSTM is higher than LTSM and GRU by 0.0295 (0.844%p) and 0.1302 (3.6264%p), respectively. When using Bi-LSTM, we can see that the RRMSE is lower than LSTM and GRU by 0.0070 (2.8866%p) and 0.0074 (3.0266p%), respectively. For R 2 , the LSTM performance is best for the Geumcheon-gu and Songpa-gu stations, whereas the Bi-LSTM performance is best for the Gangnam-gu and Seocho-gu stations. With respect to the average R 2 , Bi-LSTM outperforms LSTM and GRU by 0.0029 (0.3196%p) and 0.0044 (0.4759%p), respectively. Tables 19 and 20 show the performance index values of conventional machine learning. Among the machine learning methods, the performance index shows the best performance when forecasting is performed using RF. Compared to Bi-LSTM, the best performer in deep learning, Bi-LSTM was better than RF by RMSE: 26.96%, MAE: 32.56%, R 2 : 5.02%, and RRMSE: 20.83%. To summarize case study 2, Bi-LSTM has the best performance regarding RMSE, MAE, RRMSE, and R 2 compared to other deep learning and machine learning methods because it considers bi-directionality to make a forecast.

Discussion
This study aims to forecast the concentration after 1 h of PM 2.5 that can harm the human body. The proposed method is conducted in two steps; (1) selection of appropriate input variables and weight assignment using random forest, and (2) forecasting of PM 2.5 using Bi-LSTM. Appropriate input variables for forecasting were selected by calculating the importance of each variable using RF. However, the data usually consists of imbalanced data where the categories are not proportioned. Imbalanced data can lead to bias problems and degrade predictive performance. To improve this problem, a weight variable was added according to the grade classified through RF and used as an input variable for the forecast. Finally, the PM 2.5 concentration was forecasted by applying Bi-LSTM to the input and weight variables selected through RF. To validate the proposed method, two case studies were applied to monitoring stations in South Korea. Case study 1 (Section 3.2) compares the prediction performance according to the selection of the input variables. Case study 2 (Section 3.3) compares the forecast performance between the deep learning and conventional machine learning methods. Experimental results confirm that the proposed method is improved compared to conventional methods, such as LSTM, GRU, MLP, SVM, DT, and RF. In particular, it is shown that the prediction can be effectively performed even if there is a data imbalance problem by assigning weights using RF. In future work, we will discuss various multi-step forward forecasting strategies such as recursive, direct, and multi-input multi-output to perform long-term forecasting.

Conclusions
As the incidence of disease caused by PM 2.5 exposure increases, it is essential to forecast PM 2.5 concentrations to prevent PM 2.5 exposure. In this study, we proposed a method for forecasting PM 2.5 after 1 h from PM 2.5 data with imbalanced data. Appropriate input variables were selected through RF and then used to add weight variables to improve the prediction performance. Consequently, using RF reduces model complexity and improves the forecasting performance. Then, PM 2.5 forecasting was performed using Bi-LSTM, one of the deep learning models. For the number of nodes in the hidden layer, the node with the smallest RMSE was selected as an appropriate node through trial and error. The performance of the proposed method was verified through two case studies at four monitoring stations in Korea: Forecasting performance according to preprocessing of input variables and forecasting performance between deep learning and machine learning. The experimental results showed that the proposed method improved RMSE: 3.98%, MAE: 5.87%, RRMSE: 3.96%, and R 2 : 0.72% when comparing the conventional method and the proposed method. In particular, at high concentrations, the proposed method outperformed each of the performances indicated by RMSE: 3.21%, MAE: 4.36%, R 2 : 0.07%, and RRMSE: 3.21%. In addition, the proposed method outperforms other deep learning models on average with RMSE:2.79%, MAE: 2.25%, RRMSE: 2.96%, and R 2 : 0.40%. Furthermore, compared to machine learning, the proposed method outperformed RMSE:27.38%, MAE: 27.57%, RRMSE: 27.60%, and R 2 : 7.71%.