1 Introduction

Currently, environmental pollution poses a serious threat to human health (Zheng et al. 2016). The air quality prediction model mainly adopts the classic regression statistical model and the Numerical Air Prediction model (NAP) (Demuzere et al. 2009). The NAP uses mathematical models of the atmosphere and oceans to predict the air quality based on current atmosphere conditions and the pollutant sources (Zhang et al. 2014). Using cloud computing to implement high performance of computing of NAP would be the key research area in the future (Li et al. 2017; Liu et al. 2017; Li et al. 2017; Cui et al. 2016). However, the NAP can not be adopted among many cities in China due to the lack of the basic data of pollutant concentration in a short time (Huchao et al. 2015; Niska et al. 2005). Hence, statistical prediction methods based on historical data should be studied in order to obtain a high prediction performance.

Shi et al. (2012) have proved that Artificial Neural Network (ANN) can provide better results than the traditional multiple linear regression models about pollutant concentration prediction based on historical monitoring data. The combined model of ARMA and BPNN based on historical monitoring data studied by Zhu and Lu (2016) has a smaller prediction error than that of the traditional BPNN. The time series data is decomposed into wavelet coefficients, and the prediction experiments based on three types of neural network models (Multi-layer Perceptron, Elman and Support Vector Machine) based on historical data have shown that the improved models have provided low RMSE and mean absolute error in comparison to the original models, which is developed by Feng et al. (2013). Grivas and Chaloulakou (2006) has employed a novel ANN which selected factors of historical monitoring data via genetic algorithm, proving the model has produced smaller error than that of the regression model in the prediction of PM\(_{10}\) concentration. Shi et al. (2012) has used Feed-forward BPNN with the hyperbolic tangent sigmoid activation function and the Levenberg–Marquardt optimization method to predict PM\(_{10}\) concentration in each station. However, there is no explanation that the interaction between central station and neighbor stations in their studies.

The monitoring data that is often lost from multiple stations in a city. Nejadkoorki and Baroutian (2011) have removed the missing data, resulting in the waste of data.The Newton Interpolation algorithm has good interpolation effect when the data loss ratio is not too high (Breu et al. 2016). Thus, in this study, the missing data is effectively interpolated by INI algorithm. In order to enlarge the data, we have taken neighbor stations data into consideration. Because different distance makes different effects on central station, weight has been assigned to each stations data. The oversize of the dimension of input data would affect the generalization ability of the BPNN, therefore, the PCA algorithm is utilized to extract the eigenvector from neighbor stations effectively in the study (Skrobot et al. 2016). The extracted eigenvector from neighbor stations and the eigenvectors of the central station are regarded as input to the BPNN, which can not only keep information as much as possible, but also ensure that the dimension of the input to the BPNN is suitable, increasing the generalization ability of the BPNN. Finally, early stop strategy (Yang and Zhang 2017) has been performed during training. That is, when the error reached the value set by program, the training should be terminated in order to avoid overfitting. The learning rate gradient descent strategy (Huang and Lin 2009) has been employed to avoid the slow convergence speed.

NEXT, INI algorithm would be proved in detail for dealing with missing data. Multiple stations data and AW method have enlarge data for modeling. However, this affects the speed and efficiency of BPNN. Thus, Neighbor-PCA algorithm has been employed to reduce the dimension of data in order to avoid overfitting. BP model and NN_BPNN model are used for contrast. The effectiveness of NN_BPNN model has showed that the integration of INI algorithm and AW method proved to be persuasive and the effectiveness of NNP_BPNN model has showed that the combination of INI algorithm, AW method and Neighbor-PCA algorithm proved to be persuasive.

2 Related algorithm

In this section, the basic principles of BPNN, Newton interpolation algorithm and PCA algorithm are respectively introduced.

2.1 BPNN principles

BPNN is composed by a series of simple unit connected with each other densely. In data mining , neural network has been employed by Chang and Yang (2017). Each unit has a certain amount of input and output. The structure of neuron is shown in Fig. 1 and the structure of BPNN is shown in Fig. 2.

Fig. 1
figure 1

The neuron map of the BPNN

Fig. 2
figure 2

The structure of BPNN

The stimulation delivered by neuron is called x\(_i\). The connection weight is called w\(_i\). This accumulation is called a\(_j\). Sigmoid(x) is put as the activation function. The error is utilized to modify the connection weights for feedback to complete the learning process, which is the feedback mechanism of the BPNN (Guan et al. 2016). The inner of neuron also interferes with the output result, so an extra bias named as b is brought. So we get the formula (1).

$$\begin{aligned} a_{j}=f\left( \sum _{1}^{n}(x_{i}\times w_{j})+b\right) \end{aligned}$$
(1)

But in practice, the traditional BPNN has some shortcomings, such as the long training time, the slow convergence speed, local minimum and the poor stability. It is hard to adjust the initial weight and learning rate parameters (Xiao et al. 2012). If the dimension of the input is high enough, there is a greater challenge for training.

2.2 Newton interpolation algorithm

Interpolation function has many different types. Li et al. has introduced multiple methods of interpolation (Li et al. 2017). Using basis function to get the Lagrange interpolation polynomial is common in the theoretical analysis. The basis function would be changed with the change of nodes, which results in the change of formula. Newton interpolation algorithm can overcome this shortcoming (Varsamis and Karampetakis 2012). Newton interpolation algorithm function is determined by independent variables and dependent variables. The first-order function,second-order function,kth-order function are shown in formulas (2), (3) and (4) respectively. Newton interpolation formula defined in formula (5) is deducted by formulas (2), (3) and (4).

$$\begin{aligned} f[x_{i},x_{j}]=\frac{f(x_{j})-f(x_{i})}{x_{j}-x_{i}} \end{aligned}$$
(2)
$$\begin{aligned} f[x_{i},x_{j},x_{k}]=\frac{f[x_{j},x_{k}]-f[x_{i},x_{j}]}{x_{k}-x_{i}} \end{aligned}$$
(3)
$$\begin{aligned} f[x_{0},x_{1},\ldots ,x_{k}]=\frac{f[x_{0},\ldots ,x_{k-2},x_{k}]-f[x_{0},x_{1},\ldots ,x_{k-1}]}{x_{k}-x_{k-1}} \end{aligned}$$
(4)
$$\begin{aligned} f(x)={}&f(x_{0})+f[x_{0},x_{1}](x-x_{0})+\ldots +f[x_{0},x_{n}](x-x_{0})\ldots (x-x_{n-1}){}\nonumber \\&+f[x,x_{0},\ldots ,x_{n}](x-x_{0})\ldots (x-x_{n})=N_{n}(x)+R_{n}(x) \end{aligned}$$
(5)

Whereas, the instability of the interpolation results is affected by the high power of formula (5) during the process of interpolation (Hlbach 1979).

2.3 PCA algorithm

Mapping high dimensional data to low dimensional data through PCA algorithm (Nejadkoorki and Baroutian 2011). It seems convenient to introduce the following steps of PCA algorithm (Table 1).

Table 1 The steps of PCA algorithm

3 NNP-BPNN model

In order to improve the data utilization and prediction accuracy, the NNP-BPNN model is proposed for pollutant concentration prediction. In this paper, the INI algorithm has been utilized to handle missing data. The AW method is calculated based on geographical location. The Neighbor-PCA algorithm is used to deal with the data of the neighbor stations.

3.1 INI algorithm

There will be some outliers or even missed value in the data because of all kinds of reasons. If we use the Newton interpolation algorithm directly, large error would be brought. The INI algorithm is seen as formula (6).

$$\begin{aligned} f(x)= {\left\{ \begin{array}{ll} f_{newton}(x)&{} \frac{f_{newton}(x)- {\overline{f(x)}}}{f(x)} \le \varepsilon (I)\\ \overline{f(x)}=\sum _{1}^{k}x_{i}p_{i} &{} \frac{f_{newton}(x)- {\overline{f(x)}}}{f(x)} \&{}gt;\varepsilon (II) \end{array}\right. } \end{aligned}$$
(6)
$$\begin{aligned} {\overline{f(x)}}=\frac{1}{k}\sum _{i=1}^{k}f(x_{i}) \end{aligned}$$
(7)

f\(_{newton}\)(x) is the interpolation results of formula (5). \({\overline{f(x)}}\) is the expectation of k samples. x\(_i\) (\(i=1,2,3,\ldots ,k-1\)) is independent variables from the nearest k hours. The data of kth hour is the dependent variable. The p\(_i\) is the probability and it is a constant of \(\frac{1}{k}\).

If the formula I is true, then the samples of (k - 1) is used as the independent variables and the kth sample is used as the dependent variable, getting the f\(_{newton}\)(x). The expectation value of (\(k - 1\)) samples is as the input to f\(_{newton}\)(x). If the formula II is true, then f(x) is calculated by formula (7).The data before interpolation is named as D\(_{orinal}\). The data after the interpolation is named of D.

3.2 Multiple stations data and the dimension reduction

In this part, we mainly introduce the data composition from multiple stations, Neighbor-PCA algorithm and the training process of BPNN model.

3.2.1 Multiple stations data and AW method

Due to different distance, neighbor stations have a different impact on the central station. The air pollutant concentration of all monitoring stations is recalculated by AW method defined as formula (8).

$$\begin{aligned} k_{ij}=1-\frac{d_{ij}}{\sum _{j=1}^{n}d_{ij}} \end{aligned}$$
(8)

The k\(_{ij}\) is the weight between the ith station and the jth station. The d\(_{ij}\) is the distance between the ith station and the jth station. It is calculated based on the latitude and longitude and defined in formula (9).

$$\begin{aligned} d_{ij}=637814 \times ACOS \left( 1-\left( sin\left( \frac{(90-C) \times PI}{180}\right) \times cos \left( \frac{B \times PI}{180}\right) -sin\left( \frac{(90-F) \times PI}{180}\right) \times cos\left( \frac{E \times PI}{180}\right) \right) ^{2}\right) \end{aligned}$$
(9)

The C is the latitude of the ith station, and The B is the longitude of the ith station, The F is the latitude of the jth station, and The E is the longitude of the jth station.

The closer distance is, the greater the impact on the central station. The k\(_{ij}\) from formula (8) is the minus function. The neighbor stations that are closer from central station can get greater weight from k\(_{ij}\), enhancing the ability of BPNN model.

The D\(_i\) is the data of ith station and its composition is seen as formula (10) . The D is the collection of data of k stations and is seen as formula (11). \(\phi\)\(_{i}\) is the data of D\(_{i}\)\(\times k_{ij}\) where n ranges from 0 to n and is seen as formula (12). \(\phi\)\(_{neighbor}\) is the data of D\(_{i}\)\(\times\) k\(_{ij}\) where n ranges from 1 to n.

$$\begin{aligned} D_{i}=(F_{pollutant},F_{neighbor}) \end{aligned}$$
(10)
$$\begin{aligned} D=[D_{1},\ldots ,D_{i},\ldots ,D_{n}] \end{aligned}$$
(11)
$$\begin{aligned} \phi _{i}=[D_{0} \times 1,D_{1} \times k_{i1},\ldots ,D_{i} \times k_{ij},\ldots ,D_{n} \times k_{in}] \end{aligned}$$
(12)

Compared with the traditional methods, we have considered the neighbor stations based on geographic information, which improves the accuracy of BPNN model.

3.2.2 Dimension reduction process

If the dimension is too high or the sample size is too small, the BPNN would be unable to learn the general rule (Meng and Meng 2010). In this paper, the PCA algorithm has been used to handle data \(\phi\)\(_{neighbor}\).

The extracted eigenvector from data \(\phi\)\(_{neighbor}\) using PCA algorithm is named as V defined as formula (13). \(\phi\) defined in formula (14) has been used as the input to BPNN. A three-layer structure BPNN mentioned above has been applied in this paper. The number of nodes of the hidden layer are determined by the formula (15). The m is the number of nodes of the input layer and the q is the number of nodes of the output layer. The a is the constant from 1 to 10 (Wang et al. 2017). The relevancy, error and RMSE between the prediction and actual value are respectively defined as formulas (16), (17) and (18).

$$\begin{aligned} V=(v_{1},v_{2},\ldots ,v_{m}) \end{aligned}$$
(13)
$$\begin{aligned} \phi =[D_{0},V] \end{aligned}$$
(14)
$$\begin{aligned} N=\sqrt{m+q}+a \end{aligned}$$
(15)
$$\begin{aligned} R=\frac{\sum (x-\bar{x})(y-\bar{y})}{\sqrt{\sum (x-\bar{x})^2\sum (y-\bar{y})^2}} \end{aligned}$$
(16)
$$\begin{aligned} error=(y-\hat{y}) \end{aligned}$$
(17)
$$\begin{aligned} RMSE=\sqrt{\frac{\sum _{1}^{n}error^{2}}{n}} \end{aligned}$$
(18)

4 Modeling methods

In order to verify the effectiveness of the NNP-BP model, three models have been established, including the BPNN model, the NN-BPNN model with INI algorithm and AW method and the NNP-BPNN model with INI algorithm, AW method and Neighbor-PCA algorithm.

Modeling is performed in Matlab7.0, and data D is as input to the models. We choose traingda function as gradient descent function and set 100 as interval. Learning rate is 0.1. We set 1000 as the largest number of training and set 0.0001 as target error. The input nodes is 15, the number of output nodes is 1, and a is 8. So, the hidden layer node is 12 according to the formula (15). One half of data has been applied to train and another half of data has been treated to evaluate the model.

The learning rate (lr) valued between 0 and 1, which determines the step size for updating in each iteration. If lr is too big, it is easy to oscillate, and if lr is too small, it converges too slowly. The gradient descent method can not only solve the problem, to a certain extent, but also make results closer to the global minimum (Table 2).

4.1 Modeling of BPNN model

Table 2 The steps of BPNN model

4.2 Modeling of NN-BPNN model

Assuming data of the ith hour is missed, we take the data named x from the (i−4)th to the (i−2)th and the (i−1)th data is named as y, which is used to calculate Newton interpolation polynomial. The value of f\(_{newton}\) (\(\bar{x}\)) can be calculated. The ranged from 0 to 1. The missing data can be obtained by formulas (6) and (7). Different weights have been assigned for each station to get the data \(\phi\) (Table 3).

Table 3 The steps of NN-BPNN model

4.3 Modeling of NNP-BPNN model

Throughout this section, The PCA algorithm is performed to deal with data \(\phi\)\(_{neighbor}\) to obtain data V and train NNP-BPNN model as follows (Table 4).

Table 4 The steps of NNP-BPNN model

5 Experiments and analysis

Here, we mainly talk about data in experiment and discussions about results.

5.1 Data description

16 stations data in one city, including pollutant concentration and meteorological parameters of 254 days from July 30, 2015 to April 10, 2016, have been employed in this paper. The basic information is consist of stations name, time, longitude, latitude. Pollutants are consist of CO, SO\(_2\), O\(_3\), NO, NO\(_2\), NOX, PM\(_2.5\), PM\(_10\), VOC. The meteorological parameters are consist of relative humidity, wind direction, wind speed, temperature, pressure, visibility, total rainfall. Among them, the longitude and latitude are used to calculate the distance. A continuous 24 h data is lost among 16 stations.

5.2 Results and discussions

Three models have been established, including BPNN model, NN-BPNN model, NNP-BPNN model, which the same data and parameters are used. Here, we discuss the effectiveness of NN-BPNN model with INI algorithm and the AW method. Then we discuss the validity of NNP-BPNN model with INI algorithm, AW method and Neighbor-PCA algorithm. The RMES has been regarded as the evaluation criterion shown in formula (18).

5.2.1 The effectiveness of NN-BPNN model

Figure 3 is the comparison of RMES between BPNN model and NN-BPNN model. In the neural network, the local minimum value of the error function is not the global minimum. Ten times of experiments have been performed to prove that the NN-BPNN model has a smaller value of sum of RMES than that of BPNN model, as shown in Fig. 5. The RMSE of three models is shown in Fig. 4. Meanwhile, experimental results on the concentration of PM\(_{2.5}\) show that the sum of RMES of NNP-BPNN model is relatively minimal, as shown in Fig. 5. The relevancy of the three models is shown in Table 5.

Fig. 3
figure 3

The RMES between BPNN model and NN-BPNN model

Fig. 4
figure 4

The RMES of three models

Fig. 5
figure 5

The RMES of three models

Table 5 The value of relevancy

In ten times of experiments,the area of RMSE produced by BPNN model is the largest among that of three models, the average RMSE of NN-BPNN model is reduced by 18% than that of BPNN model, the average RMSE of NNP-BPNN model is reduced by 24% than that of BPNN model and the average RMSE of NNP-BPNN model is reduced by 7% than that of NN-BPNN model. It took 20 s to implement BPNN model, it took 170 s to implement NN-BPNN model and it took 47 s to implement NNP-BPNN model. The time used by NNP-BPNN model is reduced by 72% than that of NN-BPNN model. Empirically, the NNP-BPNN model of statistical forecasting method based on historical monitoring data is simple and practical.

6 Conclusions

In this paper, we study the problem that how to handle missing data and how to utilize historical data effectively for prediction with BPNN. There are some achievements in this paper. Firstly, the INI algorithm is adopted to deal with missing data, so as to avoid the waste of data. Secondly, the AW method not only enriches the experimental data, improving the utilization rate of data, but also reduces the RMSE. The average RMSE of NN-BPNN model is reduced by 18% compared with BPNN model. Thirdly, the NNP-BPNN model with INI algorithm, AW method and Neighbor-PCA algorithm has improved the prediction accuracy and relevancy. The average RMSE of NNP-BPNN model is reduced by 7% compared with NN-BPNN model. The average RMSE of NNP-BPNN model is reduced by 24% compared with BPNN model.

There are still some limitation in this research. More methods of dimension reduction should be studied in the future work. The research of NNP-BPNN model to find out the global minimum error. Multiple machine learning models should be explored in air pollutant concentration prediction area.