Multi-Scale Variation Prediction of PM 2.5 Concentration Based on a Monte Carlo Method

: Haze concentration prediction, especially PM 2.5 , has always been a significant focus of air quality research, which is necessary to start a deep study. Aimed at predicting the monthly average concentration of PM 2.5 in Beijing, a novel method based on Monte Carlo model is conducted. In order to fully exploit the value of PM 2.5 data, we take logarithmic processing of the original PM 2.5 data and propose two different scales of the daily concentration and the daily chain development speed of PM 2.5 respectively. The results show that these data are both approximately normal distribution. On the basis of the results, a Monte Carlo method can be applied to establish a probability model of normal distribution based on two different variables and random sampling numbers can also be generated by computer. Through a large number of simulation experiments, the average monthly concentration of PM 2.5 in Beijing and the general trend of PM 2.5 can be obtained. By comparing the errors between the real data and the predicted data, the Monte Carlo method is reliable in predicting the PM 2.5 monthly mean concentration in the area. This study also provides a feasible method that may be applied in other studies to predict other pollutants with large scale time series data.


Introduction
With the rapid development of social economy and the improvement of people's living standards, a healthy and comfortable living environment has become the main pursuit goal of the public. However, air quality problems, especially haze problems, are the key factors that plague people's healthy life. According to satellite statistics, about 30% of the region, and nearly 800 million people in China suffer from smog damage like respiratory and cardiovascular diseases [Xie, Chen and Li (2014)], especially in Beijing and Shanghai. The frequency of fog in these regions is more than 50%, causing huge losses to the national economy [He, Wang, Wang et al. (2013)]. Therefore, predicting the concentration and trend of haze effectively, and rationally formulating prevention measures have far-reaching significance for people's health and social economic development. Generally, the methods used to predict haze concentration at home and abroad mainly include meteorological methods [Saide, Carmichael, Spak et al. (2011)], statistical methods [Li, Bai, Shi et al. (2010)] and machine learning methods [Ordieres, Vergara, Capuz et al. (2005); Du, Lu and Dou (2017)]. In most of the previous studies, they mainly conducted short-term predictions of haze pollutant concentrations. Besides, some prediction models are a little complex and converge slowly, failing to mine potential information from data [Wang, Liu, Zhang et al. (2018)]. Therefore, motivated from these solutions, we propose an algorithm named Monte Carlo to predict the monthly average value of haze concentration. The Monte Carlo algorithm not only transforms some complex objects into the calculation of random numbers and digital features, but also is simple and easy to operate and has the characteristics of being able to visually explain the randomness of objects. The rest of this paper is organized as follows. Section 2 mainly presents related work in haze prediction. Section 3 briefly introduces the related theories of Monte Carlo. Section 4 gives the modeling method of Monte Carlo. In Sections 5 and 6, the experimental results for prediction are reported. Finally, Section 7 concludes this paper.

Related work
In fact, the study of smog began very early. Limited by early detection equipment, Fuller et al. [Fuller, Carslaw and Lodge (2002)] designed an empirical model to predict the daily average concentrations of PM2.5 and PM10 in London and the southeastern United Kingdom and also studied the relationship between various pollutants and PM2.5 and PM10 and finally realized the daily average concentration of PM10. Jian et al. [Jian, Zhao, Zhu et al. (2012)] proposed a statistical model, called Autoregressive Integrated Moving Average (ARIMA), whose results indicated that barometric pressure and wind velocity were anti-correlated and temperature and relative humidity were positively correlated with PM10 mass concentrations. Fu [Fu (2016)] gave an online update multivariate linear regression method and used meteorological elements as the influence factor of haze. Without the large amount of data, the model could be updated according to the results of the day, which improved the prediction accuracy. In recent years, some methods about machine learning and neural network have also been introduced for haze prediction by domestic and foreign scholars. Among them, Dong et al. [Dong, Yang, Kuang et al. (2009)] established a hidden Markov model to predict the high PM2.5 concentration in haze weather and established the function mapping between parameters and variables. Then the trained hidden Markov model can better predict the high PM2.5 concentration in the next 24 hours. Ai et al. [Ai and Shi (2015)] put forward a prediction model of BP artificial neural network and established a haze prediction system based on time series. The results showed that the model can accurately predict haze weather, but the convergence speed of the model is slow and easy to fall into local optimum situation. Ganesh et al. [Ganesh, Arulmozhivarman and Tatavarti (2018)] presented several ensemble models of neural network and regression predictors to predict the concentration of the PM2.5 pollutant in Houston and New York on the basis of meteorological data and found that ensemble approach would perform promising prediction results compared to single model.

Basic principle
The Monte Carlo Method is called the statistical simulation method. It is based on the law of large numbers in probability statistics and uses a random number (pseudo-random number) for numerical calculation [Yi, Guan, Zhang et al. (2002)]. The basic idea is to establish a probability model related to the problem and let the parameters of the model be equal to the results of the problem, then calculate the statistical characteristics of parameters by a large number of repeated sampling tests on the model, finally obtain the approximate value of the problem. Combined with the characteristics of the Monte Carlo algorithm, the purpose of our study is not to accurately gain the concentration of a certain day in the future, but to make the predicted value of haze concentration still consistent with the regularity of historical data and the distribution characteristics.

Monte Carlo method theory
Set the sample of random variables 1 , 2 , ⋯ , , its arithmetic mean is According to the law of large numbers: lim {| � − ( )| < } = 1 (2) Then when is large enough, � converges to ( ) according to probability. If random variable 1 , 2 , ⋯ , independently identically distributed with non-zero finite variance 2 , from the central limit theorem, we have: when is sufficiently large, there is an approximate expression as follows: Generally, the error formula for the Monte Carlo method can be written as: Therefore, reducing the error of the Monte Carlo method can generally be achieved by reducing the sample standard deviation and increasing the sample size . In the actual situation, the sample size is often certain, so the error can be decreased by reducing in most cases.

Monte Carlo method application
In general, as long as we can establish a suitable probability model for our problem, the Monte Carlo method can almost be applied to almost any field, including deterministic mathematical problems and stochastic problems. For the deterministic mathematical problems, according to the Monte Carlo method, firstly, construct a probability model related to the solution, so that the solution is the probability distribution or mathematical expectation of the model we build, and then generate random numbers according to the known probability distribution. Perform multiple random samplings, and finally use the arithmetic mean value as the approximated estimate. For the stochastic problems, the numerical simulation method is widely adopted, that is, a random sampling can be conducted to obtain the statistical characteristics of the parameters, according to the probability distribution of the actual problem, and finally we can obtain an approximate solution with a specific expected value. Since the haze concentration can be affected by different factors such as temperature, wind levels, humidity, CO concentration and so on, the prediction of haze concentration in this paper is a stochastic problem.

Monte Carlo method modeling 4.1 General steps of the Monte Carlo method
(1) Collect historical haze data and process the sample data; (2) Establish a suitable probability model = ( ) for the random variable X ; (3) Use a random number generator to generate random numbers that are uniformly distributed between 0 and 1; (4) According to the generated random numbers and the established probability model, the sampling is repeatedly simulated times in the sample data in order to obtain a large number of sampling function values 1 , 2 , ⋯ , ;

Probabilistic model establishment method
By consulting the literature, there are two commonly used methods for establishing a probabilistic model: The first one is the historical empirical method, that is to refer to the probabilistic model generally used by the predecessors when they review the related problems [Chi, Wang, Chen et al. (2015)]; The second is to use the statistical method to solve the probability model through historical data, usually including the parameter estimation method and the non-parametric estimation method. Given that there are few references on the probability model of haze concentration, it is quite difficult to achieve the probability model of the research object based on the experience of the predecessors and the non-parametric estimation does not need to rely on the prior knowledge of the relevant historical data. Therefore, this paper uses the kernel density estimation method in non-parametric estimation to estimate the probability density of the haze concentration data. The principle of kernel density estimation [Wu and Wang (1996)] is as follows: Let K(•) be a given Borel measurable function on R, ℎ > 0 is a constant related to , satisfying lim →∞ ℎ = 0 Then, we have where K(•) is kernel function, ℎ is bandwidth.
We can denote kernel density estimation ( ) by ̂( ), that is, ̂( ) is kernel density estimation. Generally, we tend to select Gaussian kernel as our kernel function. The expression is as follows: Actually, under certain conditions of satisfying the kernel function, when the sample size is large enough, the different choices of the kernel function are insensitive to the estimation results. The real impact on the performance of kernel estimation depends on the bandwidth ℎ , because if ℎ is too small, it will lead to the irregular shape of ̂( ) due to the increase of randomness. On the contrary, if ℎ is too large, then ̂( ) will be overaveraged so that the properties of ( ) cannot be well reflected.
According to the above principle, we use Python to draw a histogram of frequency distribution and perform kernel density estimation on the frequency histogram, and finally obtain a probability model of haze concentration.

Random number generation and sampling
There are many ways to generate random numbers, but most of them are pseudo-random numbers generated by mathematical recursive formulas. Although these pseudo-random numbers are not random numbers in the true sense, as long as their cycles are long enough, then they can reflect the randomness of sampling. In this way, we use the 'uniform' function in Python to generate a uniformly distributed random number between 0 and 1, then divide the PM2.5 concentration interval into parts, the probability corresponding to each interval is . Set then represents the sum of the top probability values. Let = − −1 (10) If the random number falls on the interval [ , −1 ) , that means a PM2.5 concentration data is successfully extracted, and the corresponding probability is . In this way, the random number is matched with the PM2.5 concentration data one by one and we successfully complete one sampling.

Data preprocessing
This paper mainly studies the distribution of haze concentration in Beijing. Because the composition of haze is very complicated and PM2.5 is the most important component of haze, we choose Beijing PM2.5 concentration as the research object. According to the PM2.5 daily concentration data (https://www.zq12369.com/) that can be collected, we select Beijing PM2.5 daily concentration data from January 1, 2014 to March 31, 2018 as our training set, the daily concentration data from April 1 to June 30, 2018 as test set. However, there are a few missing data in the training set collected in this way. For time series data such as haze, the concentration of haze today tends to be closely related to that of yesterday and tomorrow. Therefore, the single missing value for the training data can be filled by selecting the average of the two-day observations before and after the missing value. Fig. 2 shows a line chart of Beijing PM2.5 daily concentration data: From this figure, we find that from January 2014 to the end of December 2016, there are many data with high peaks. Maybe people from all walks of life don't pay enough attention to the harm of haze and don't take effective preventive measures, which cause a large number of haze pollution in Beijing. In 2017, although Beijing PM2.5 daily concentration peaks still have a slow upward trend, compared with previous years, the peak value has dropped sharply and the PM2.5 concentration is basically stable between 0 and 200 µg/ 3 . This phenomenon is closely related to the Chinese governments effective prevention measures against haze problems. The importance of the haze hazard has been continuously improved in all sectors of society and the frequency of high peaks of haze concentration has also been effectively reduced in the past year. From the overall perspective of historical data, we also find a rather regular phenomenon: Beijing PM2.5 concentration shows a downward trend at the beginning of the year and gradually increased in the middle and late years, which has a certain reference value for predicting the concentration of PM2.5. Due to the large fluctuation of historical data, according to the error Eq. (5), in order to minimize the error caused by Monte Carlo prediction, we need to reduce the sample variance in the case of the same number of samples. Therefore, we can convert the historical data to logarithmic data and draw the result in Fig. 3.  Fig. 3, we could calculate the maximum PM2.5 logarithmic concentration in Beijing as 6.17 µg/ 3 , the minimum value is 1.61 µg/ 3 , thus the range is 4.56 µg/ 3 . Take 0.2 as the group distance, the total is divided into 24 groups, then the PM2.5 logarithmic concentration interval can be divided. Then we can draw the histogram of PM2.5 logarithmic concentration in Beijing, as is shown in Fig. 4. According to the characteristics of the single peak and the symmetric distribution appearing in Fig. 4, we find that the PM2.5 daily logarithmic concentration in Beijing is approximately normal distribution. Fig. 5 shows a comparison of the probability density function of the actual data with its normal distribution function with the same mean and standard deviation, and by comparing the theoretical probability with the actual data frequency value in Tab. 1, we can know that the degree of agreement between the two is higher. Meanwhile, PM2.5 concentration is subject to short-term changes due to natural factors and human factors, so it is appropriate to use a normal distribution as the probability distribution of PM2.5 daily logarithmic concentration in Beijing to a certain extent. Its mean value is 3.92 µg/ 3 and the standard deviation is 0.88 µg/m 3 . Further, it is found that a large amount of data is concentrated in the interval [3.6, 4.8]; that is, the actual value [36.6, 121.5], which shows that the PM2.5 concentration in most of Beijing's time period is concentrated in the middle level, only a small amount of extreme values are in the picture.

Random number generation and sampling
After data preprocessing, we can get a total of 1551 valid data from January 1, 2014 to March 31, 2018. In order to predict the monthly average haze concentration, we need to generate 30 pseudo-random numbers (1 ≤ ≤ 30), which obeys uniform distribution. Then, the quantile of the normal distribution with a mean value of 3.92 and a standard deviation of 0.88 corresponds to the random number . Each corresponds to a concentration value, so that the PM2.5 daily logarithmic concentration is calculated in turn, that is, 30 samples are randomly extracted. As the program simulates a sufficient number of times, the sample data amount is gradually increasing. The mean of the simulation results gradually tends to a more stable value and this value is our predicted value.

Simulation prediction
This paper use Python software programming and set the number of simulations to 10000 times in order to obtain the predicted mean of PM2.5 concentration next month in Beijing. Take April as an example: we can predict the average concentration in April 2018 based on the historical data from January 1, 2014 to March 31, 2018. In the same way, we also predict the concentration in May and June. What needs to be changed is to expand the scope of our historical data from March 31 to April 30 and May 31, 2018. Finally, we can obtain predicted values from April to June. Unit: g/ 3 (Tab. 2). From the data comparison in Tab. 2, the mean relative error of the predicted values from April to June 2018 is calculated to be 0.036. Therefore, the Monte Carlo method to predict the average monthly concentration of PM2.5 by virtue of the PM2.5 concentration value has a certain reliability. However, in Fig. 8, we find that the PM2.5 daily concentration predicted value has a small fluctuation range and can't roughly reflect the daily variation trend of Beijing PM2.5 concentration, which indicates that the PM2.5 concentration as a random variable cannot effectively simulate the change of PM2.5 concentration in the region although the PM2.5 concentration may be predicted well.

Monte Carlo simulation based on the chain development speed
In order to further study the change trend of PM2.5 concentration in a month, it is not significant to rely solely on the concentration data as variables to simulate. Therefore, we consider the development speed of Beijing PM2.5 logarithmic concentration value as a new random variable. The chain development speed is studied here. Assume that the chain development speed is recorded as t S , then we have where and −1 indicates the PM2.5 concentration levels for the and − 1 phases respectively.

Data processing
According to the Eq. (11), the PM2.5 daily logarithmic concentration data need to be converted into the PM2.5 daily logarithmic concentration chain development speed. The total distance is 0.1, which is divided into 15 groups. In this way, we draw the PM2.5 daily concentration chain development speed histogram figure (Fig. 6). The histogram in Fig. 6 is still a single peak, and symmetrically distributed left and right, what's more, the distribution is still approximately normal distribution. In Tab. 3, we can also find that the theoretical probability is very close to the frequency of the actual data. Therefore, it is appropriate to use a normal distribution to simulate the probability distribution of the PM2.5 daily concentration chain development speed. Similarly, we figure out that the mean value of the sample data at this time is = 1.024 and the standard deviation is = 0.232. By comparing the chain development speed of PM2.5 logarithmic concentration with the normal distribution chart of the same mean and variance, Fig. 7 can be obtained. We can see that the sample data have higher peaks than the normal distribution curve, and the rest are almost coincident, indicating that the PM2.5 concentration has a significant jump, but it has long-term stability as a whole. Among them, the jump performance in the PM2.5 concentration may appear large fluctuations in the short term, and the stability performance in the PM2.5 concentration is maintained at a low level in the long term.
In the above two figures, based on the mean value Zof the sample data = 1.024 > 1, we can know that the chain development speed of PM2.5 daily logarithmic concentration is mainly concentrated in the interval [1, 1.2], which shows that the daily concentration of PM2.5 in Beijing has a slight increase compared with the previous period. The PM2.5 concentration in Beijing has had a slow growth trend over the past period of time.

Random number generation and sampling
According to the probability distribution of the chain development speed of PM2.5 daily logarithmic concentration (subject to the mean distribution of µ = 1.024, standard deviation = 0.232), According to the Eq. (12), a random number (1 ≤ ≤ 30) satisfying (0, 1) uniform distribution can be generated by computer, then the quantile function of the normal distribution corresponds to the random number . In addition, we also need to calculate the chain development speed of PM2.5 daily logarithmic concentration , then repeat the simulation times, we can achieve the mean value of the chain development speed of the PM2.5 daily logarithmic concentration at different times in a month. Here take data in June as an example and = 10000. Tab. 4 gives the partial chain development speed predicted values.

Simulation results
Assume that the logarithm of the PM2.5 concentration at the current time is 0 , the calculation formula of the haze concentration in the phase from the current time is as follows: = 0 ∏ , = 1,2, ⋯ ,30 =1 (13) Let the logarithm of PM2.5 concentration on May 31, 2018 be the starting value 0 . According to Eq. (13) and the data in Tab. 4, the PM2.5 daily logarithmic concentration in the next month can be calculated, then we can obtain the PM2.5 daily concentration predicted value in Tab. 5 by indexing it. Based on the above predicted values and combined with Fig. 8, the average monthly concentration in June is 53.66 g/ 3 and the relative error is 18.12%. Compared with the predicted value of 51.39 g/ 3 calculated directly from PM2.5 concentration, we find that although the second method is slightly higher than before in the concentration prediction, the concentration of PM2.5 in Beijing shows a slow growth phenomenon in the period of June, and it is better to match the trend of the actual curve. Therefore, it is considered that the chain development speed of PM2.5 daily logarithmic concentration can be used as a random object to more comprehensively evaluate the PM2.5 concentration change.

Conclusion
In order to solve the prediction of PM2.5 concentration in Beijing, this paper adopts the Monte Carlo method, which is simple and easy to implement, to establish a normal distribution probability model based on PM2.5 logarithmic concentration data and the chain development speed of PM2.5 logarithmic concentration as two different random variables. Through a large number of random samplings, the mean value of PM2.5 monthly concentration can be calculated, and the validity of the model is verified by comparing the relative error between the predicted value and the true value. The results show that based on historical effective data, Monte Carlo method can be used to predict the monthly average value of PM2.5 concentration and the trend of PM2.5 concentration change with certain reliability. Therefore, combined with Monte Carlo's broad applicability, as long as we can obtain valid data of PM2.5 (also can be PM10, SO2 concentration, etc.) and calculate its corresponding probability model, then we can reasonably predict the mean and trend of PM2.5 monthly concentration in this area.