Elsevier

Atmospheric Environment

Volume 150, February 2017, Pages 146-161
Atmospheric Environment

Relevance analysis and short-term prediction of PM2.5 concentrations in Beijing based on multi-source data

https://doi.org/10.1016/j.atmosenv.2016.11.054Get rights and content

Highlights

  • Correlation analysis of PM2.5 and different source types of potential related factors.

  • PM2.5 had highest mathematical correlation with average wind speed, CO, NO2, PM10, daily number of specific microblog entries.

  • Short-term prediction of PM2.5 was studied using ARIMA.

Abstract

The PM2.5 problem is proving to be a major public crisis and is of great public-concern requiring an urgent response. Information about, and prediction of PM2.5 from the perspective of atmospheric dynamic theory is still limited due to the complexity of the formation and development of PM2.5. In this paper, we attempted to realize the relevance analysis and short-term prediction of PM2.5 concentrations in Beijing, China, using multi-source data mining. A correlation analysis model of PM2.5 to physical data (meteorological data, including regional average rainfall, daily mean temperature, average relative humidity, average wind speed, maximum wind speed, and other pollutant concentration data, including CO, NO2, SO2, PM10) and social media data (microblog data) was proposed, based on the Multivariate Statistical Analysis method. The study found that during these factors, the value of average wind speed, the concentrations of CO, NO2, PM10, and the daily number of microblog entries with key words ‘Beijing; Air pollution’ show high mathematical correlation with PM2.5 concentrations. The correlation analysis was further studied based on a big data's machine learning model- Back Propagation Neural Network (hereinafter referred to as BPNN) model. It was found that the BPNN method performs better in correlation mining. Finally, an Autoregressive Integrated Moving Average (hereinafter referred to as ARIMA) Time Series model was applied in this paper to explore the prediction of PM2.5 in the short-term time series. The predicted results were in good agreement with the observed data. This study is useful for helping realize real-time monitoring, analysis and pre-warning of PM2.5 and it also helps to broaden the application of big data and the multi-source data mining methods.

Introduction

High levels of haze and PM2.5 usually induced by coal combustion, vehicle emissions, industrial processes, and petroleum usage, have become progressively more serious public crises in urban areas (Kong et al., 2015, Tan et al., 2015). In 2009, the annual average concentration of PM2.5 in Milan was 30 μg/m3 (Perrone et al., 2012), which was higher than the World Health Organization's short-term (24-hour) guideline concentration value of 25 μgm-3 (Caggiano et al., 2011). In January 2013, Beijing experienced a severe and long-lasting haze episode during which the daily mean PM2.5 concentrations exceeded 75 μg/m3 (the Second Grade National Standard for China) for 22 days, and exceeded 35 μg/m3 (the First Grade National Standard for China) for 27 days (He et al., 2014). Moreover, as has been shown in many studies, the health risks from PM2.5 exposure can be serious, due primarily to the chemical composition of PM2.5. Higher risk of pulmonary disease, emphysema, lung and nasal cancer could result from the carcinogenic constituents in PM2.5 (Diaz and Dominguez, 2009, Matus et al., 2012, Gursumeeran Satsangi et al., 2014). To efficiently respond to the PM2.5 problem, both individually and governmentally, it is necessary to develop effective models to analyze the correlation between PM2.5 and potential related factors, and to be able to predict the concentrations of PM2.5 in time series.

Most of the research that has been done on the PM2.5 problem focuses on either Sampling Analysis or Atmospheric Dispersion Model Simulation. Sampling Analysis can describe the characteristics of PM2.5 in certain regions to a certain extent. A PM-visibility correlation study was carried out in the Yangtze River Delta in China using sampling PM data and meteorological data from 1980 to 2012. The results showed that fine PM, such as PM2.5, was the key influence on visibility in this region. Fine PM affects both the PM itself and the relative humidity, ultimately altering visibility (Cheng et al., 2013). Additional studies with sampling PM data showed that PM2.5 follows a characteristic seasonal cycle (Zhao et al., 2013), and the scenes of PM2.5 differ greatly according to the region in which the huge Chinese city is located (Yang et al., 2011). The Atmospheric Dispersion Model Simulation is used mostly because of the dynamic mechanisms resulting in the dissemination and transmission of PM2.5. To model the distribution circumstances and the dispersion of PM2.5 in certain regions, the CMAQ Model (Pun et al., 2006, Liu et al., 2008, Wang et al., 2012, Chemel et al.,2014), the WRF-Chem Model (Saide et al., 2011, Marcelo et al., 2012), the GEOS-CHEM Model (Hu et al., 2009, Zhang et al., 2013) and some other atmospheric dispersion models were applied.

Despite the studies mentioned above, the ability to analyze and predict PM2.5 from the perspective of atmospheric dynamics is still limited because of the complexity of the formation and development of PM2.5 (Kirk and Peter, 2007, Kirk and Kristen, 2011, Zhang et al., 2016). Recently, studies of air pollution from the aspect of Big Data have appeared. Some studies gave an exposition of the qualitative correlation of PM2.5 and meteorological factors, traffic factors, human mobility, etc., through a Data Co-Training Method (Zheng et al., 2013). Another way to qualitatively analyze the correlation between PM2.5 and related factors is to perform a Correlation Coefficient Analysis, the result of which can show which factors are most related to PM2.5 (Chen et al., 2014, Yang et al., 2015). Some quantitative methods have been developed based on Neural Network models to forecast the concentrations of PM2.5 (Russo et al., 2013, Arhami et al., 2013, Fu et al., 2015). In the field of correlation analysis and estimation of the current PM2.5 using related factors based on data mining methods, the RMSE of the predicted PM2.5 and observed PM2.5 was around the level of about 20 μg/m3 to 30 μg/m3 (Xie et al., 2015, You et al., 2016, Fu et al., 2015).

Big data encompasses many different data types, ranging from social media data (mobile app data, microblog data, internet search engine data, etc.) to purely physical data. For example, as the concentration of PM2.5 increases, complaints and comments about it in microblogs increase substantially. Studies into the correlation of PM2.5 and social media data are useful in helping to understand PM2.5 and thus facilitating information distribution and pre-warning of PM2.5. This paper established a correlation analysis model of PM2.5 to physical data (meteorological data, other pollutant concentration data) and social media data (especially microblog data) based on the Multivariate Statistical Analysis method and the BPNN method. The correlation analysis models were evaluated by the indicator RMSE. The RMSE of the estimated concentration of PM2.5 and the actual concentration of PM2.5 based on the Multivariate Statistical Analysis method and the BPNN method reached to 26.69 μg/m3 and 24.06 μg/m3 in the case study in Beijing, China, a city facing complex PM2.5 problem in recent years. A short-term prediction of PM2.5 was made using historical PM2.5 data based on the ARIMA Time Series model, and the relationship of RMSE to the prediction time-lag was also studied. This study helps to realize real-time monitoring, analysis and pre-warning of PM2.5, and also broadens the application of big data and the multi-source data mining methods.

Section snippets

Overview

This paper focused on relevance analysis and short-term prediction of PM2.5 using multi-source data. In this study physical data were used, including meteorological data (regional average rainfall, daily mean temperature, average relative humidity, average wind speed, maximum wind speed) and other pollutant concentration data (CO, NO2, SO2, PM10). We selected microblog data to represent social media data, since it is a well-used platform with a large number of users in China. Fig. 1 shows the

Study area and data

Beijing covers an area of 16 808 square kilometers and has a population of nearly 21.7 million. PM2.5 levels are currently extremely high in highly-populated city clusters such as the Beijing-Tianjin-Hebei region (Zhang and Cao, 2015). Beijing is thus a typical city faced with a serious PM2.5 problem. Fig. 3 shows the study area. The black marks on the map represent the main air monitoring stations for each district in Beijing.

The time period used in this study was from 1st January to 31st

Relevance analysis of PM2.5 and the related factors based on Multivariate Statistical Analysis

We collected physical data (meteorological data, including regional average rainfall, daily mean temperature, average relative humidity, average wind speed, maximum wind speed, and other pollutant concentration data, including CO, NO2, SO2, PM10) and social media data (microblog data) from 2014. Correlation scatter diagrams of PM2.5, other physical data, and social media data are plotted using the database in Fig. 4. For the content transmission of microblog is common in specific news,

Conclusion

Correlation analysis and short-term prediction of PM2.5 based on multi-source data mining were explored in this paper. Two methods were developed to analyze the correlation of PM2.5 and other possibly related factors: physical data (meteorological data, including regional average rainfall, daily mean temperature, average relative humidity, average wind speed, maximum wind speed), other pollutant concentration data (CO, NO2, SO2, PM10) and social media data (microblog data). This paper also

Acknowledgements

This work was supported by the National Natural Science Foundation of China (Grant No. 71473146). This work was supported by China Clean Development Mechanism Foundation (grant No. 2013049). This work was also supported by the Collaborative Innovation Center of Public Safety.

References (33)

  • Pablo E. Saide et al.

    Forecasting urban PM10 and PM2.5 pollution episodes in very stable nocturnal conditions and complex terrain using WRF-Chem CO tracer model

    Atmos. Environ.

    (2011)
  • Tijian Wang et al.

    Urban air quality and regional haze weather forecast for Yangtze River Delta region

    Atmos. Environ.

    (2012)
  • Xiuming Zhang et al.

    Characterization of haze episodes and factors contributing to their formation using a panel model

    Chemosphere

    (2016)
  • Mohammad Arhami et al.

    Predicting hourly air pollutant levels using artificial neural networks coupled with uncertainty analysis by Monte Carlo simulations

    Environ. Sci. Pollut. Res.

    (06 January, 2013)
  • George E.P. Box et al.

    Time series analysis, forecasting and Control

    J. R. Stat. Soc. Ser. A General.

    (1976)
  • Minglei Fu et al.

    Prediction of particular matter concentrations by developed feed-forward neural network with rolling mechanism and gray model

    Neural Comput. Appl.

    (2015)
  • Cited by (137)

    View all citing articles on Scopus
    View full text