Temporal Patterns in Fine Particulate Matter Time Series in Beijing: A Calendar View

Extremely high fine particulate matter (PM2.5) concentration has become synonymous to Beijing, the capital of China, posing critical challenges to its sustainable development and leading to major public health concerns. In order to formulate mitigation measures and policies, knowledge on PM2.5 variation patterns should be obtained. While previous studies are limited either because of availability of data, or because of problematic a priori assumptions that PM2.5 concentration follows subjective seasonal, monthly, or weekly patterns, our study aims to reveal the data on a daily basis through visualization rather than imposing subjective periodic patterns upon the data. To achieve this, we conduct two time-series cluster analyses on full-year PM2.5 data in Beijing in 2014, and provide an innovative calendar visualization of PM2.5 measurements throughout the year. Insights from the analysis on temporal variation of PM2.5 concentration show that there are three diurnal patterns and no weekly patterns; seasonal patterns exist but they do not follow a strict temporal division. These findings advance current understanding on temporal patterns in PM2.5 data and offer a different perspective which can help with policy formulation on PM2.5 mitigation.

our study offers an innovative calendar visualization of PM 2.5 concentration on a daily basis over the year of 2014, which yields important insights on temporal variation patterns of PM 2.5 concentration.
The contribution of our study is two-fold. First, our study presents an innovative and straightforward calendar visualization of daily PM 2.5 time-series in Beijing in the year of 2014. This technique provides a very useful tool to visualize and understand the data and can be applied to examine temporal patterns of other air pollutants. Second, the insights generated from the two calendar plots advance our understanding of Beijing's PM 2.5 concentration. Compared to previous studies on Beijing's PM 2.5 concentration, our study offers a different perspective and brings in insights on PM 2.5 concentration that are more complete and convincing. Figure 1 shows two calendar views of the cluster analyses using the correlation distance and Euclidean distance, and two corresponding trend curves of averaged PM 2.5 concentrations. We obtain three clusters for the analysis based on correlation distance, and each of them has 162, 117, and 86 time-series (days). We named these as S1, S2, and S3, as shown in Fig. 1a,c. For the cluster analysis based on Euclidean distance, nine clusters are formed, each consisting of 255, 82, 15, 5, 2, 2, 2, 1, and 1 time-series. They are named as L1, L2, L3, L4, O1, O2, O3, O4, and O5 (Fig. 1b,d). Those clusters with less than three time-series, namely O1, O2, O3, O4, and O5, are considered as "outliers" that either have extremely high PM 2.5 concentration or exhibit odd variation patterns. We will discuss these "outliers" later.

Results and Discussion
Interpretation on calendar visualization. The calendar plot based on correlation distance (Fig. 1a) and the corresponding curve (Fig. 1c) shows the cluster result based on shape differences among the 365 PM 2.5 time-series. The result shows that there are three distinct variation patterns for the PM 2.5 time-series. An increasing pattern from 0 AM to 11 PM in a day is most likely to be observed from January to March and from September to December (S1 in Fig. 1a,c). For these days that show an increasing PM 2.5 concentration pattern, the maximum PM 2.5 concentration of the day usually occurs at night. The decreasing pattern can be observed in all months throughout the year (S2 in Fig. 1a,c) and this pattern attains its minimum value in the afternoon. The third pattern with a shape like an inverted V often take place from April to August (S3 in Fig. 1a,c) and the PM 2.5 concentrations during these days usually peaks at noon. These results show that the diurnal patterns of PM 2.5 vary from day to day through the year, and PM 2.5 concentration in the daytime could be higher than at night in many days, which complement previous studies concluding that diurnal variation of PM 2.5 change by seasons and PM 2.5 concentration at night is higher than that in the daytime 10,16 . (a) shows PM 2.5 timeseries cluster result based on correlation distance, and the letter S denotes "shape"; (b) shows the cluster result based on Euclidean distance, L denotes "level" and O refers to "outlier". (c) shows the averaged PM 2.5 trend for clusters based on correlation distance, and (d) shows the averaged PM 2.5 for clusters based on Euclidean distance. Note that the colours and labels are matched for each cluster for consistency, and the lines for O1, O2, O3, O4, and O5 are set to dash for clear presentation.
Our findings are consistent with a previous research which identified a 'sawtooth cycle' of PM 2.5 variation 17 . During a 'sawtooth cycle' , the PM 2.5 concentration first rises over a few days, which corresponds to the increasing pattern in our study (S1 in Fig. 1a,c), and then falls, which matches the decreasing pattern in our study (S2 in Fig. 1a,c) 17 . One possible interpretation is that the increasing and decreasing patterns (S1 and S2 in Fig. 1a,c) are largely formulated by the passage of cold front. When the cold front arrives, high-speed wind associated with the cold front blows the pollution away and thus the PM 2.5 concentration is decreasing. But when the cold front moves on, cold air underlies the warm air as the cold air is denser and heavier, which leads to temperature inversion. The temperature inversion traps PM 2.5 pollution near the surface and makes the PM 2.5 concentration increasing.
Human activities such as heating and combustion, as well as weather conditions including wind, boundary layer height, etc. are closely linked to the variation of PM 2.5 concentration 18,19 . As we can see in Fig. 1a,c, not all variation patterns in PM 2.5 concentration match the daily cycle of human activities such as transportation that usually peaks in the morning and afternoon during a full day. The third pattern (S3 in Fig. 1a,c) is the closest one that possibly matches the daily cycle of human activities but this pattern usually happens from April to August. This finding suggests that the effect of human activities on variations of PM 2.5 concentration may vary at different time periods. We speculate that from January to March and September to December, weather conditions including cold front, wind, boundary layer height, etc., may be the major factors determining variations in PM 2.5 concentration. However, from April to August, the weather conditions (e.g., cold front) weaken and human activities thus might have stronger impact on PM 2.5 variation.
The cluster result based on differences in PM 2.5 concentration levels can be found in the calendar plot based on Euclidean distance (Fig. 1b) and the corresponding curve (Fig. 1d). We can see that a majority of days in the year have an averaged PM 2.5 concentration of around 50 μ g/m 3 (L1 in Fig. 1b,d), a figure far beyond the WHO (25 μ g/m 3 ) and USA air quality standards (15 μ g/m 3 ). The calendar plot also indicates that high averaged PM 2.5 concentration around 150 μ g/m 3 (L2 in Fig. 1b,d) are likely to occur in every month throughout the year. Also, extremely high PM 2.5 concentration above 250 μ g/m 3 (L3, O1, O2, O3, O4, and O5 in Fig. 1b,d) can be usually observed in January, February, March, October, November, and December. This finding is consistent with previous studies concluding that PM 2.5 concentration is generally the highest during winter and lowest during summer 15,16 . Outliers. A few "outliers" (O1, O2, O3, O4, and O5 in Fig. 1b,d) can be found in Fig. 1b. For example, two notable "outliers" O4 and O5 on January 15 and February 26, 2014, respectively, show quite drastic variations across the day. As we can see, extremely high PM 2.5 concentrations (O5 has a maximum PM 2.5 concentration of 534 μ g/m 3 ) are observed on the two days and the two incidents were reported by the Guardian 20 , Time magazine 21 , and Financial Times 22 .
One event of particular interest is the Asia-Pacific Economic Cooperation (APEC) Summit on 10 and 11 November 2014 in Beijing. It is reported that in order to maintain a blue sky in Beijing during the APEC Summit, coordinated efforts were taken by the governments of Beijing and six surrounding provinces before the summit 23 . Measures included impositions on road traffic and plant operations. The two calendar visualization plots in our study indicate that PM 2.5 concentration was very high in mid-October before the summit. For example, on October 19, 24, and 25, the PM 2.5 concentration was over 150 μ g/m 3 . After the emission control measures were enforced, the PM 2.5 concentration was greatly reduced on November 1. However, on November 4, a sharp increase in PM 2.5 concentration was observed, which was around 150 μ g/m 3 . Fortunately, a significant reduction occurred on November 5 and PM 2.5 concentration returned to lower level afterwards by November 15, four days after the summit. These interpretations from the two calendar plots can also be obtained from local observations, but here we would like to note that the two calendar visualizations in our study offer a much more straightforward understanding of the whole picture of PM 2.5 variations over time than using other tools. Seasonal and weekly patterns? As we can see from the two cluster results, both shape and level variation do exhibit a rough seasonal pattern but the pattern do not follow strict seasonal divisions. As Fig. 1a shows, S1 pattern usually occurs in around winter seasons (from January to March and from September to December) and S3 patterns often happens around summer times (from April to August). Figure 1b shows a rough seasonal pattern too. Days in L3 cluster usually occur near winter (in February, March, October, November and December but not January) although days in L1 and L2 clusters can be found in any month throughout the full year which doesn't exhibit very clear seasonal pattern. There may exist significant differences in PM 2.5 concentration levels between different seasons 10,15,16 , however we argue that the arbitrary seasonal division of variation in PM 2.5 concentration may result in information loss and conceal potentially important insights. The calendar visualization used in our study, however, provides an informative and straightforward way to look into variation patterns of air pollutants.
Several studies reported that there existed weekly patterns in PM 2.5 concentration in Beijing 9,13 . And their findings are not consistent with each other. One study stated that the lowest concentrations occurred in Mondays while the highest concentrations appeared from Thursdays to Saturday 9 ; another study concluded that PM 2.5 concentrations on weekdays were lower than that on weekends 13 . Our findings, however, do not observe these reported weekly patterns. Figure 1b shows that among all 52 weeks in 2014, higher PM 2.5 concentrations in weekdays than those in weekends are observed in at least 18 weeks. For example from March 24 to 30, the lowest PM 2.5 concentrations were observed on weekends while the highest were on weekdays (Fig. 1b). We did not observe any explicit and universal weekly variation pattern after visual inspection over the two calendar plots (Fig. 1a,b) and further calculations. This finding suggests that the weekly cycle of human activities may not play a key role in determining variations in PM 2.5 concentration. Our finding complements and improves previous studies that report weekly patterns in PM 2.5 concentration in Beijing 9,13 . Future research. As we know, PM 2.5 pollution can be measured in terms of optical properties and chemical compositions in addition to the mass concentration [24][25][26][27] . With the help of the calendar visualization technique used in this study, these informative properties and other air pollutants such as NO 2 , SO 2 , and O 3 can help provide a better understanding of the air pollution problem.

Data and Methods
Data. The PM 2.5 measurement data in Beijing used in this study were originally obtained from the official hourly air quality reporting platform (http://zx.bjmemc.com.cn/) run by Beijing Environment Protection Agency. This platform is part of the national air quality monitoring network initiated in late 2012. The data is rich, reporting hourly concentrations of six pollutants: particulate matter with aerodynamic diameter no greater than 2.5 microns (PM 2.5 ), particulate matter with aerodynamic diameter less than 10 microns (PM 10 ), and sulphur dioxide (SO 2 ), nitrogen dioxide (NO 2 ), ozone (O 3 ), and carbon monoxide (CO) in 35 stations across Beijing (Fig. 2). However, the data is not easily accessible because the online reporting system only reports the air quality of the day and does not show historical data and is unavailable to the public. Fortunately, third parties created by civic efforts such as PM25.in, AQISTUDY.cn, and EPMAP.org have been crawling this data since late 2013.
Our study uses one-year air quality monitoring data from 1 January 2014 to 31 December 2014 from AQISTUDY.cn, EPMAP.org, and the US Embassy Beijing Air Quality Monitor (Fig. 2). We noticed that there are missing hourly measurements in all the three data sources. Therefore, we combined them to get complete PM 2.5 measurement data covering 24 hours of all the 365 days in 2014. The US Embassy Beijing Air Quality Monitor is operated by the US Department of State. The US Department of State requires that the following disclaimer by included in any publication that uses these data: "State Air observational data are not fully verified or validated; these data are subject to change, error, and correction. The data and information are in no way official".
A comprehensive data quality check on the raw data is conducted to reduce the impact of problematic data points, including duplicated data records, missing measurements with a placeholder, implausible zeros, etc. After the data quality check, the hourly PM 2.5 measurement data for all 35 stations are then aggregated into one averaged PM 2.5 concentration per hour for cluster analysis as explained below.
Method. Since we have 24 hourly PM 2.5 measurements for each day, it implies we have 365 time-series objects with 24 data points each to analyse. We would like to aggregate together time-series objects with similar variation patterns of PM 2.5 concentration and separate those with dissimilar patterns into different groups. Thus, we employ time-series clustering technique to mine the data.
In general, there are two essential components in a clustering analysis: clustering algorithm and distance measure 28 . Clustering algorithm controls the procedures on how similar objects are clustered, while distance measures are used to establish the resemblance between two objects. There are several algorithm and distance measures available in the field of cluster analysis but our study employed the most straightforward and suitable clustering method and metrics. Specifically, we use average-linkage agglomerative hierarchical clustering as the clustering method because this method generates repeatable and consistent results and does not require the number of clusters to be specified as compared with K-means 29 , and it is usually able to obtain more robust cluster results than other hierarchical clustering methods 30 .
Distance measures were selected based on two basic features of the PM 2.5 time-series data: level and shape. Level refers to the quantity of PM 2.5 concentration, and the Euclidean distance is used to identify the level difference between PM 2.5 time-series. Shape refers to trends in PM 2.5 concentration variation with respect to time, and we use Pearson's correlation-based distance to capture the shape difference between PM 2.5 time-series. We derived a generalized correlation-based dissimilarity function from this study 31 by making the coefficient α and power β adjustable (equation (1)).
where the correlation coefficient ρ , 0 This dissimilarity function satisfies all the requirements for dissimilarity measure: the non-negativity, symmetry, and identity 32,33 . When both α and β are set to 1, this dissimilarity function becomes the classic Pearson's correlation coefficient distance that has been used in several studies 34 . In our study, however, we deliberately set α and β to 0.5 and 0.25, respectively, in order to attain a desirable robust cluster result.
We employ the cophenetic correlation coefficient to examine the validity and robustness of the cluster analysis. Cophenetic correlation coefficient is a measure of how faithfully the hierarchical cluster results represent the dissimilarity among observations 35 . It is defined as the linear correlation coefficient between the original pairwise dissimilarities and the cophenetic dissimilarities obtained from the dendrogram. The value of this coefficient varies between 0 and 1. A higher cophenetic correlation coefficient indicates a better cluster solution, and a value of 0.8 or higher is usually regarded as a successful cluster application 36 .
It turns out that the cophenetic correlation coefficients for Euclidean-distance-based and correlation-distance-based cluster analyses are 0.86 and 0.81, respectively, suggesting that both cluster results are robust and valid.
We used Python version 2.7.5 to process and analyse the data, and R version 3.2.2 to draw the calendar plots.