An LSTM-based neural network method of particulate pollution forecast in China

Particulate pollution has become more than an environmental problem in rapidly developing economies. Large-scale, long-term and high concentration of particulate pollution occurs much more frequently, which not only affects human health but also economic production. As PM10 is one of the main pollutants, the prediction of its concentration is of great significance. In this study, we present a PM10 forecast model based on the long short-term memory (LSTM) neural network method and evaluate its performance of predicting PM10 daily concentrations at five representative cities (Beijing, Taiyuan, Shanghai, Nanjing and Guangzhou) in China. Our model shows excellent adaptability for various regions in China. The predicted PM10 concentrations have good correlations with observations (R = 0.81–0.91). We also achieve great predication accuracy (70%–80%) on predicting the next-day changing trend and the model has the best performance for heavy pollution situation (PM10 > 100 μg m−3). In addition, the comparison of LSTM-based method and other statistical/machine learning methods indicates that our model is not only robust to different pollution intensities and geographic locations, but also with great potential on pollution forecast with temporal-correlated feature.


Introduction
The epidemiological investigation, animal toxicology test and human clinical observation of PM 10 show that PM 10 has obvious and direct toxic effects on human health, and can cause extensive damage to respiratory system (Leng et al 2017), heart and blood system, immune system and endocrine system (Li et al 2002). For every 50 µg m −3 increase of PM 10 daily average concentration, the mortality could increase by 4%-5% on average in the research of Utah River Valley (Maynard 1997). In the meantime, the exposure level of PM 10 concentration is also important. When the mass concentration of PM 10 is greater than 100 µg m −3 , the mortality rate is 11% higher than that when the mass concentration of PM 10 is less than 50 µg m −3 (Maynard 1997). In addition, as a major component of air pollution, aerosol load trends also have an important impact on the prediction of climate change (Rotstayn et al 2015, Westervelt et al 2015. Therefore, accurate and timely prediction of PM 10 is of great significance in terms of both climatology and socioeconomic development. In recent years, statistical methods have been used in air pollution predictions and gradually formed a research trend for its advantages of high efficiency, convenience and low cost. In addition to a single numerical prediction model (Li et al 2016, machine learning methods have also been gradually applied to air pollution prediction (Mallet et al 2009, Papaleonidas and Iliadis 2013, Nieto et al 2018. Deep learning, as one of the most highly sought after class of machine learning algorithms, shows great potential in fitting nonlinear complex relationships between the influencing factors and the pollution concentrations (Hinton et al 2006, Li et al 2018, such as the recurrent neural network (RNN) method, which usually refers to considering the influence of the current pollution concentration of the geographical adjacent area of the interested area (Tong et al 2019).
Long short-term memory (LSTM) is a RNN with long-term and short-term memory. In recent years, it has become an effective and scalable model to solve some learning problems related to sequential data. The prediction ability of LSTM has been revealed in atmospheric science recently (Asanjan et al 2018, Alhirmizy and Qader 2019, Mohapatra et al 2020. These studies show that LSTM has better statistical properties (such as correlation coefficient and mean squared error) than traditional RNN and also has better performance than the single environmental meteorological model (Dai et al 2019). Therefore, LSTM method brings us another opportunity of better predicting the PM 10 changing trend in China.
In recent years, the air particulate pollution in China shows an obviously region-based feature. For example, the sources of particulate pollution in Jing-Jin-Ji Area are similar, and the seasonal distribution shows specific characteristics of regions (Dao 2015). Li et al (2017) found that aerosol extinction coefficient in Northeast and Northwest China continued to decline. Around 2006, the trend of the Pearl River Delta and southwest China has indeed changed from increasing to decreasing, while the trend of the North China Plain (NCP) and the Yangtze River Delta (YRD) is still increasing. These results indicate the diversity of PM 10 in different regions and the complexity of its prediction. There is a significant correlation in PM 10 among the cities in the Yangtze River Delta region and PM 10 concentration is also correlated with other pollutants (e.g. NO 2 and SO 2 ) (Shi et al 2008). Similar relationships also appears in Pearl River Delta region (Hu et al 2011) and Northwest China (Qiu 2010), which are light and heavy polluted PM 10 regions, respectively.
This paper aims to explore the potential of a LSTM-based method prediction ability of daily PM 10 concentration and develop the general prediction model that can be applied to different regions of China. We evaluate the model adaptability in five representative regions of China in section 4. This study is an important progress of neural network method that is applied to air pollution forecast. Although this study focuses on forecasting PM 10 concentrations, it can play a significant role in generalization to a broader area to forecast other air pollutants.

Data source
We obtain the national urban pollution data from the national urban air quality real-time release platform (http://106.37.208.233:20035/, last access: 22 November 2020). The meteorological data is from the China surface climatic data daily data set V3.0, which is from the China Meteorological Data Network (https://data.cma.cn, last access: 22 November 2020). We select the national urban pollution data and meteorological data of 359 cities from period 2015 to 2017. We have 1095 data samples for each city in our data set and our LSTM model is based on the time dimension. The details of the data are summarized in table S1 (available online at stacks.iop.org/ERL/16/044006/mmedia). The description of data preprocessing and the temporal feature are in supplemental text S1. The combined data sets for our research are saved at figshare (Zhu 2021).

Spatial patterns of different city metropolitans
Data collection sites are mainly distributed in the eastern region of China, especially the Yangtze River Delta, the Pearl River Delta and the Beijing-Tianjin-Hebei region, with relatively few sites in the western region. Therefore, in this paper, we select a few representative cities (Beijing, Taiyuan, Shanghai, Nanjing and Guangzhou) from different geographic regions (Jing-Jin-Ji, Northwest China, Yangtze River Delta and Pearl River Delta) of China and cities surrounding them within a certain distance (e.g. 100 km, 200 km) for this study ( figure 1(a)). Due to the geographical proximity between Beijing and Taiyuan as well as Nanjing and Shanghai, some of the selected cities within 200 km are with overlapped areas (the red circle in figure 1(a)).

The principle of LSTM
Long short-term memory (LSTM) neural network is a widely used RNN architecture in the field of deep learning. It has a tree-like hierarchical structure, consisting of input layer, hidden layer and output layer (figure 1(b)), while its network nodes are recursive with the input information connected to each other in sequential order. Each hidden layer contains LSTM structure and dropout layer. LSTM structure is the core part of the entire neural network. Each LSTM neuron contains three control gates: input gate, forgetting gate and output gate. The input gate learns to decide when to let the activation signal into the cell, the output gate learns when to let the activation signal out of the cell, and the forgetting gate learns when to let the cell of the last moment into the cell of the next moment. In the forgetting gate, only the information conforming to algorithm authentication will remain, while the information not conforming will be forgotten, thus it effectively solves the problem of long-term dependence (Greff et al 2017). In the dropout layer, neural network cells are temporarily discarded from the network with a certain probability during the training process, which is a means to prevent overfitting and has good fault tolerance.

Model design
In our model, the prediction object is the target pollutant concentration of a central city, and the prediction factors are the historical daily data of meteorological elements and pollution elements of the central city and its surrounding n SITE stations in the past r days. Specifically, the target pollutant is daily PM 10 in this study. The input factors include d factors such as daily average temperature, daily average humidity, maximum wind speed, maximum wind direction, SO 2 concentration, NO 2 concentration, CO concentration, O 3 concentration for 8 h, PM 2.5 concentration and PM 10 concentration (table 1).
We constructed the input factor matrix W, and its dimension is r × n SITE × d (figure 1(c)). One or more LSTM cells constitute the middle layer of the prediction model, and the number of layers is denoted as n LAYER . Each LSTM layer contains n NEU neurons and is followed by a dropout layer. The output layer is where we find the predicted target p PRE . The forecaster is denoted as p PRE while the corresponding observed value is denoted as p OBS . After e times of prediction, the prediction sequence p PRE would be achieved. The R-square and root mean square error (RMSE) of p PRE and p OBS are calculated to evaluate the prediction performance of different model configurations.
Numbers are used to represent 16 directions and then be converted to degrees (wind from North is 0 • ) During the training process, the optimization function used in the LSTM model is the 'Adam' algorithm  and the loss function is 'Mean_Squared_Error' .
In general, the prediction effect of the model is related to the parameters of r, n SITE , n LAYER and n NEU . In this study, we used Python and its deep learning libraries Tensorflow and Keras to build and implement the above neural network algorithms. We set up a control experiment of multiple parameter configurations, and then discuss the capabilities of our LSTM model in different regions of China.
We divided the data set with time length of L days into two sets, with 70% of the available data allocated as training set and the remaining 30% data as test set. A summary of experiment settings is shown in table S3. We adopted the control variables for the selected representative cities (Beijing, Taiyuan, Shanghai, Nanjing, and Guangzhou) with only one changeable variable in each group of experiments. Finally, we used R 2 and RMSE to evaluate the experimental results to explore the influence of parameters such as the number of layers in LSTM (supplemental text S2), the scope of surrounding cities (supplemental text S3), the days of input data as well as different regions that the model applied to (supplemental text S4). Based on the experimental results, we determined the best model is double-layer LSTM with previous one observed data as input within a certain geographical range (DLP1-LSTM model; DLP1, double-layer with previous ONE observed data).

Other machine learning methods
To evaluate superiority of the LSTM model in PM 10 prediction in this paper, we compared it with five other methods including traditional statistical methods and machine learning models: ordinary least squares regression (OLSR), Bayesian ridge regression, support vector regression (SVR), multi-layer perceptron (MLP), and Random Forest Regression. We will briefly introduce these methods in the following paragraphs.
OLSR, which seeks the best match for data by minimizing L2 norm errors, is a mathematical optimization technique (Moutinho and Hutcheson 2011). In this algorithm, OLSR is used to fit a linear model using coefficients in Linear Regression to minimize the sum of squares of differences between the actual observed data and the predicted data.
Bayesian Ridge Regression solves these regression problems by imposing a penalty on the Regression coefficients on the basis of OLSR. In other words, a regular term is added to the deviated sum of squares function, which is used to adjust parameters and delete those related terms (Massaoudi et al 2020).
SVR is different from traditional regression methods. For SVR, as long as the predication f (x) does not deviate too much from Y, the prediction can be regarded as correct and there is no need to calculate the error. SVR is still a convex optimization problem and thus ensures the global solution give it nonlinear prediction ability and generalization performance (Uçak and Günel 2016).
The creation of feedforward neural network of MLP starts from the most basic form of a single perceptron. A sensor has one or more inputs, biases, activation functions, and a single output. The sensor takes the inputs, multiplies them by certain weights, and passes them to the activation function to produce the output. The input layer of an MLP consists of all the variables in the individual neurons, and the output layer consists of response variables. The input layer and the hidden layer include a constant neuron associated with the intercepting synapse or deviation (that is, a synapse not directly affected by any covariable called interception) (Egbo 2018).
Random Forest Regressor is an ensemble method by creating forest with multiple decision trees during training. Each tree contains an arrangement of decision nodes, based on which the tree is divided into different branches until the termination point (leaf) is reached. Each decision node depends on whether the value of the input feature exceeds a certain threshold. The final prediction is obtained by averaging each tree prediction that provided by training it independently (Pillai et al 2020).
In this paper, the modules in the Scikit-earn library are used to implement the models for comparison. To ensure reliability of the comparison results, the data preprocessing and the input factors in other machine learning methods are kept consistent with our LSTM model.

DLP1-LSTM model performance between representative cities
In order to study the applicability of double-layer LSTM model to PM 10 prediction in different regions of China, we applied our DLP1-LSTM model to several typical metropolitans that represent different features of pollution in China. They are Beijing, Taiyuan, Shanghai, Nanjing and Guangzhou. We set up the configuration of using observation data from cities within 100 km of circle around of the target city and 1 d before the prediction day. The training convergence curves (figure S2) show that the DLP1-LSTM prediction model gradually converges for all five target cities. This further reflects the good applicability of the prediction model in China. Figure 2 is the time series diagram of the predictions and observations of all five cities. In general, the predictions match the trends of observations of all five cities. The predictions could match most of the peaks and valleys. We also know from figure 2 that the observed PM 10 of Guangzhou is relatively lower than that of other cities and has fewer sharp changes. Figure S3 shows that in the representative cities, the predicted values of the DLP1-LSTM is roughly positively correlated with the observations, and the slope of the fitting curve is close to 1. In addition, Nanjing has the highest R 2 (0.828), while Shanghai has the lowest R 2 (0.642) among these five cities. The predictions of PM 10 concentrations in these cities all have good correlations (R = 0.81-0.91) with observations. Guangzhou has the smallest RMSE, while Taiyuan has the largest RMSE.
In summary, our configuration of double LSTM model is able to make accurate predictions of daily PM 10 concentrations for various city metropolitans within a wide range of pollution levels in China.

DLP1-LSTM model performance of next-day trend prediction at different pollution levels
To further explore the predictive power of the DLP1-LSTM model, we analyzed the sensitivities of the model to different levels of pollution and also its performance in the five representative cities of different regions. We divided the pollution into three levels: mild (0-50 µg m −3 ), moderate (50-100 µg m −3 ) and polluted (>100 µg m −3 ), and then tested the prediction accuracy of our model on next-day trend (increase or decrease from the previous day) for different pollution levels. Overall, the model has the best trend prediction capability for PM 10 concentration in heavy pollution situation (table S7). Figure S4 compares the next-day changes of predictions to observations. Positive value means the PM 10 concentration increases when compared to the previous day, while negative value means it decreases when compared to the previous day. Beijing has more days with moderate and heavy pollution, Nanjing has more days with moderate and mild pollution, and Taiyuan has more days with heavy pollution. The prediction accuracy of the next-day trend is much better for polluted cases of these three cities. However, Guangzhou has the most days of mild pollution, and the corresponding correlation of prediction of mild pollution is much better than other cities (figure S3 and table 2). Shanghai has the lowest accuracy of the next-day trend forecast, at about 70%. The prediction ability of the model is weak for mild (R 2 = 0.12) and moderate (R 2 = 0.17) pollution days.
It can be seen from figure S3 and table 2 that the prediction ability of next-day trend in Shanghai is weaker than that in the other cities. It is due to the large coastal area of Shanghai, and its landform is mostly plain, which is conducive to the diffusion of pollutants. In addition, there is a lot of rain in summer, especially in the Meiyu season. The relative humidity is high and is negatively correlated with the concentration of PM 10 (Jian et al 2019), that is, it is conducive to the settlement of PM 10 . Moreover, Shanghai has less pollution in general and fewer days of heavy pollution, which reduces the ability of the model to capture its characteristics. Guangzhou, which also faces the sea and has a relatively lower degree of pollution, is special with more hills and high in the northeast and low in the southwest. As a result, the terrain of Guangzhou is not conducive to the diffusion of pollutants and the DLP1-LSTM model is able to capture the data features well.
In general, the prediction accuracy of increasing or decreasing of the next day is about 70%-80% when the PM 10 concentration of the current day is known, and it varies slightly among different representative cities. The model is more suitable for predicting heavy pollution events (PM 10 concentration >100 µg m −3 ) with some regional differences.

Comparison of LSTM with other methods
In this section, we compared the prediction performances of LSTM with several traditional statistical prediction methods (OLSR, Bayesian ridge regression, SVR) as well as two machine learning methods (MLP, Random Forest Regression). Here we show the comparison results in Guangzhou as an example. We set up the same input factors for DLP1-LSTM and the other five methods, which are the meteorological factors and pollution factors recorded in the previous day of Guangzhou as well as partial stations (Zhongshan city, Dongguan city, Huizhou city and Foshan city) within 100 km around it (table 1). Similarly, the first 70% of the data set is the training set while the remaining 30% is the test set.
In the case of Guangzhou, the DLP1-LSTM method does a much better job than the other five methods on predicting the daily PM 10 concentrations with R 2 equals to 0.838 while other methods only have R 2 in the range of 0.466-0.537 (figure 3). From the time series plots (figure S5), it is also obvious that the performance of DLP1-LSTM model is much better than the other five models which show large discrepancies with observations and always miss the peaks, either behind or ahead.
Besides the methods we applied in this section, there are many combined methods applied by other studies in order to better predict the air quality. Guo et al (2020) explored the potential of including wavelet method in the artificial neural networks (ANNs) and evaluated the performance of several combined algorithms of wavelet and ANNs. The R 2 of the best prediction case of Air Pollution Index is 0.78 and 0.79 at Xi'an and Lanzhou in China, respectively. Cortina-Januchs et al (2015) used a combination of Multilayer Perceptron Neural Network and clustering algorithm to predict the daily PM 10 concentration at three monitoring stations of Salamanca city in Mexico with R 2 between 0.49 and 0.59. It specified that the combined method of ANNs with clustering algorithms had better generalization capacities than those based on a simple ANN method.
In our study, our LSTM-based method shows obvious superiority in PM 10 prediction and the good adaptability to different regions of China. In the future, we would like to explore the opportunity of combined methods of LSTM and data analyzing algorithms as the studies we mentioned above. Table 2. Prediction accuracy of next-day trend on increasing or decreasing for five representative cities at three pollution levels: 0-50 µg

Conclusions
In this paper, we explored the LSTM neural network method with application in atmospheric particulate pollution prediction. Our DLP1-LSTM model shows excellent adaptability for various regions in China with different geographical conditions and PM 10 characteristics. First, the predictions of PM 10 concentrations in cities of different regions all have good correlations (R = 0.81-0.91) with observations. Second, the DLP1-LSTM model also performances well on predicting the changing trend of next-day's PM 10 concentration. The prediction accuracy of whether the next day would increase or decrease is 70%-80%. In addition, by dividing the pollution degree into three levels (mild: <50 µg m −3 , moderate: 50-100 µg m −3 and polluted: >100 µg m −3 ), the model has the best trend prediction capability for PM 10 concentration in heavy pollution situation. This shows great potential for our model to contribute to the prediction, protection and regulation of seasonal concentrated pollution in PM 10 heavily polluted areas, such as northwest China.
Among various prediction methods (LSTM, OLSR, Bayesian ridge, SVR, MLP and Random Forest), the DLP1-LSTM model shows superior performance than the others, and it indicates the great application prospect of LSTM method on pollution forecast with temporal-correlated feature.
The significance of this study is not only the application of LSTM method for PM 10 daily concentration prediction, but also the great potential of implementing RNN method on better forecasting particulate matter pollution. In the future, our research will focus more on the adjustment of model structure and hyperparameters tuning in order to improve the spatio-temporal scale of model prediction (e.g. hourly prediction on a wider geographical area). In addition, the capability of DLP1-LSTM model on PM 2.5 prediction as well as other pollutants is worthy to explore.