Skip to main content
Advertisement
Browse Subject Areas
?

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here.

  • Loading metrics

Directional dependence between major cities in China based on copula regression on air pollution measurements

  • Jong-Min Kim,

    Roles Conceptualization, Data curation, Formal analysis, Investigation, Methodology, Project administration, Software, Supervision, Visualization, Writing – original draft

    Affiliation Statistics Discipline, Division of Sciences and Mathematics, University of Minnesota-Morris, Morris, MN, United States of America

  • Namgil Lee ,

    Roles Conceptualization, Formal analysis, Funding acquisition, Investigation, Methodology, Software, Visualization, Writing – original draft

    namgil.lee@kangwon.ac.kr

    Affiliation Department of Information Statistics, Kangwon National University, Chuncheon, Gangwon, South Korea

  • Xingyao Xiao

    Roles Data curation, Formal analysis, Software, Visualization

    Affiliation Applied Statistics and Psychometrics, Lynch School of Education, Boston College, Chestnut Hill, MA, United States of America

Abstract

Air pollution is well-known as a major risk to public health, causing various diseases including pulmonary and cardiovascular diseases. As social concern increases, the amount of air pollution data is increasing rapidly. The purpose of this study is to statistically characterize dependence between major cities in China based on a measure of directional dependence estimated from PM2.5 measurements. As a measure of the directional dependence, we propose the so-called copula directional dependence (CDD) using beta regression models. An advantage of the CDD is that it does not rely on strict assumptions of specific probability distributions or linearity. We used hourly PM2.5 measurement data collected at four major cities in China: Beijing, Chengdu, Guangzhou, and Shanghai, from 2013 to 2017. After accounting for autocorrelation in the PM2.5 time series via nonlinear autoregressive models, CDDs between the four cities were estimated to produce directed network structures of statistical dependence. In addition, a statistical method was proposed to test the directionality of dependence between each pair of cities. From the PM2.5 data, we could discover that Chengdu and Guangzhou are the most closely related cities and that the directionality between them has changed once during 2013 to 2017, which implies a major economic or environmental change in these Chinese regions.

1 Introduction

Recently, air pollution has become a significant environmental and social problem in China. PM2.5 refers to the concentration of atmospheric fine particulate matter (PM) whose diameter is ≤ 2.5μm. Exposure to PM2.5 is associated with increased mortality rates caused by lung cancer and cardiopulmonary diseases [14]. It is generally accepted that PM2.5 is more harmful to human health than PM with diameter > 2.5μm and ≤ 10μm (PM10) [5, 6]. Sources of particulate matter include residential wood burning, coal-fired thermal power generation, agricultural burning, diesel fuel combustion, and natural/industrial dust. PM2.5 can also be generated indirectly when gases and particles interact in the air. China’s air pollution situation is so extreme that ambient air pollution ranked the second highest and household air pollution ranked the third highest in the G20 in terms of disability-adjusted life-years (DALYs) in 2010 [7]. It has been estimated that the air pollution in China contributed to 1.6 million deaths/year in 2014 [8]. In addition to the harmful effects of air pollutants on human health and mortality, pollutants can also have diverse effects on climate and weather due to their complex composition and sources [912].

China has regulated ambient air quality since 1982, when it set limits on total suspended particulates (TSP), SO2, NO2, lead, and Benzopyrene [13]. In February 2012, China adopted a new Ambient Air Quality Standard [14], where limits on PM2.5 were set for the first time. In 2012, major cities in China including the Beijing-Tianjin-Hebei region, Yangtze River delta region, Pearl River delta region, and provincial capitals were required to implement the standards. The current standards were implemented nationwide in 2016; see [15] for more detail.

Nowadays, a large amount of air quality monitoring data is being accumulated, where air pollutants are monitored directly on ground level; see Section 2 for more detail. Due to the increase in the amount of air quality monitoring data in China, numerous studies have been conducted to characterize China’s air quality status and recent trends [8, 1618]. However, most of the studies focus on analyzing temporal trends of air pollutant concentration and visualizing its spatial distributions at various time scales.

In recent years, there have been efforts to uncover possible sources of air pollution in China [8, 18]. However, it is a challenging task to identify causal effects based on observational data compared to controlled experimental data due to the effects of unobserved variables [19, 20]. On the other hand, under the consideration that a wide variety of factors may influence the PM2.5 level, which may be closely related to environmental and industrial factors, we focus on inferring statistical dependence and causal relations between four major cities in China based on the PM2.5 measurement data as observational evidence. In this study, we included Beijing, Chengdu, Guangzhou, and Shanghai as the target cities in China because they are the four major cities which represent main industrial regions in China.

Approaches to the determination of causal relations based on obserational data include graphical causal modeling [20] such as causal Bayesian networks and structural equation models. Graphical causal models determine directed graph structures satisfying certain conditional independence assumptions called the Markov assumption and faithfulness assumption [1921]. However, it is well known that a number of directed graph structures cannot be distinguished even if the conditional independence assumptions are satisfied, e.g., XY and YX cannot be distinguished.

Alternatively, the concept of directional dependence has been studied in linear regression settings in the statistics community [22, 23]. Measures of directional dependence using copula regression were investigated in [24]. More generally, regression-based approaches for determining cause and effect variables between two random variables have been suggested in the machine learning community [2531]. The basic principle is to compare the regression models in alternate directions and investigate asymmetry in the joint distribution.

The purpose of this study is to infer and validate a statistical measure of directional dependence between the selected cities based on time series of PM2.5 concentrations. Copula directional dependence (CDD) is a measure of directional dependences between two regions of interest, which can be applied to non-normal distributions and nonlinear relationships [24, 32, 33]. Contrary to other dependence measures such as correlations and mutual information, the CDD yields bi-directional dependences which measure statistical influence from one region to another.

Copula-based modeling has been widely studied and applied in many fields such as macroeconomics and finance [34, 35] and genetics and biology [36, 37]. Gaussian copula generalized linear models for longitudinal data analysis were introduced in [38]. Multivariate regression analysis using Bayesian inference with Gaussian copulas was proposed subsequently [39]. Later, a framework for joint regression analysis of one-dimensional generalized linear models using Gaussian copulas was proposed [40]. More recently, Gaussian copula marginal regression (GCMR) models were developed [41].

On the other hand, a regression model called the beta regression was proposed [42], where continuous responses are assumed to take values in unit intervals. The beta regression is effective for modeling bounded responses such as rates and proportions due to its flexibility in modeling the shapes and asymmetries of distributions. Various extensions of beta regression models have been proposed [4346]. For time series data, a beta regression model using the Gaussian copula for bounded time series was proposed in [47], where stationary autoregressive moving average (ARMA) models were adopted for addressing the serial correlation.

In this paper, we apply the CDD proposed in [32, 33]. The suggested CDD measure uses GCMR models [41] and the beta regression model [47]. By using the suggested CDD measure, non-normal and non-linear relationships in the data can be efficiently modeled without any complicated processes of structure search or hyperparameter selection. Moreover, since the CDD is a directional dependence structure, we can further infer directed networks of statistical dependence between the major cities in China.

The rest of this paper is organized as follows. In Section 2, we describe the PM2.5 data and their descriptive statistics. In Section 3, we explain the proposed measure of directional dependence using beta regression. And we describe statistical procedures for processing the time series data and inferring directional dependences between the four cities in China. In Section 4, we present the analyses and results. Discussion and conclusions are given in Section 5.

2 Data

2.1 Air pollution data

Nowadays, air pollutants are monitored at various locations in many countries. A national real-time air quality monitoring system version 1.0, known as the Air Reporting System, has been operating on the China National Environmental Monitoring Center (CNEMC) website (http://www.cnemc.cn/) since January 2013. The Air Reporting System version 2.0 has been operating since January 2014 and covers 945 sites in 190 cities according to the Ambient Air Quality Standard (GB3095-2012). Other approaches to air quality monitoring include satellite data-based approaches [4850] and geoscientific modeling [51]. Current air quality monitoring systems such as the Air Reporting System can provide the hourly concentration of various types of air pollutants measured directly at ground level.

Even though the Air Reporting System of China monitors the ambient air quality in a wide range of Chinese regions and cities in real time, most air quality monitoring history is not publicly available. Independently from the Chinese national air quality monitoring system, the US Mission in China started monitoring air quality in 2008 at the US Embassy in Beijing. Since then, monitoring stations have been subsequently established at the US consulates in Shanghai (2011), Guangzhou (2011), Chengdu (2012), and Shenyang (2013) [17]. The air quality status and hourly PM2.5 concentration are available at http://www.stateair.net/.

The PM2.5 time series data analyzed in this study were obtained from the US Mission China air quality website (http://stateair.net). Note that the US Mission China air quality website states that the data are not fully verified or validated; these data are subject to change, error, and correction (http://stateair.net/web/assets/USDOS_AQDataUseStatement.pdf). The historical data files on the stateair.net site can be downloaded manually after agreeing to the data use statement. The obtained data consist of hourly PM2.5 time series of four major Chinese cities, Beijing, Chengdu, Guangzhou, and Shanghai, from January 2013 until June 2017. The original data are hourly measurements of PM2.5 concentration levels measured in μg/m3. We removed the data with missing values and then took the daily maximum of the PM2.5 levels.

We note that about 62% of the daily maximum values fell in the night hours (between 21:00 and 4:00) for the 2013 Beijing data. One may consider that air pollution levels during business hours are an indicator of poor health and not values falling in the night hours when the population is asleep. In this sense, it will be very interesting to analyze subsets of the air pollution data measured during business hours for applications to public health in future work. In this paper, on the other hand, we took the daily maximum values to analyze statistical dependence between major cities in China. Since the air pollution level is measured at ground-level, it fluctuates sensitively depending on daily air temperature change. Hence, we remove the effect of daily air temperature change by taking the daily maximum values, which can be indicators of other factors such as coal fuel burning, transportation, industrial activity, and wind.

2.2 Descriptive statistics

Sample time series data of the PM2.5 levels for the year 2013 are illustrated in Fig 1. Hourly measurements of the PM2.5 levels were obtained in μg/m3. Missing values were set to zero in the figure. We can see that the PM2.5 levels are relatively high during the winter season (January and December), and Guangzhou and Shanghai show relatively low overall PM2.5 values compared to the other two cities.

thumbnail
Fig 1. Time series data of PM2.5 levels in the four Chinese cities (Beijing, Chengdu, Guangzhou, and Shanghai) during the year 2013.

https://doi.org/10.1371/journal.pone.0213148.g001

The histograms for the hourly PM2.5 level measurements in the year 2013 are presented in Fig 2. The marginal distribution for each of the Chinese cities is highly skewed and non-normally distributed, which implies that typical statistical dependence measures such as the Pearson correlation coefficient are not suitable.

thumbnail
Fig 2. Histograms of PM2.5 levels in the four Chinese cities during the year 2013.

https://doi.org/10.1371/journal.pone.0213148.g002

We have presented summary statistics of the PM2.5 level data for the years 2013 to 2017 in Table 1. During 2013 and 2014, the median value of Beijing was the highest among the four cities. However, it constantly decreased to the second rank in the following years, while Chengdu ranked the first from 2015 to 2017. The median PM2.5 levels of Guangzhou and Shanghai were relatively low and kept decreasing from 2013 to 2016. On the other hand, the maximum of Beijing’s PM2.5 levels was the highest during the years 2013 to 2017 except 2014. The maximum of Shanghai’s PM2.5 levels was the second highest in 2013, but it kept decreasing rapidly and ranked 4th among the four cities by 2016. In summary, the overall PM2.5 levels of the four cities showed a decreasing trend, but each city had its own statistical characteristics.

After removing the data with missing values and taking the daily maximum PM2.5 values, we generated scatter plots between each pair of cities, which are illustrated in Fig 3. The data points are distributed in a single cluster, which implies that they were sampled from an identical distribution. In addition, we can find that the distribution is highly skewed, and linear relationship between the variables is not apparent.

thumbnail
Fig 3. Scatterplot matrix for daily time series data of the PM2.5 levels between the four Chinese cities (BJ: Beijing, CD: Chengdu, GZ: Guangzhou, SH: Shanghai) during the year 2013.

https://doi.org/10.1371/journal.pone.0213148.g003

As a measure of pairwise correlation between the PM2.5 levels of the four cities, we computed Spearman correlation coefficients, which are summarized in Table 2. Note that the Spearman correlation coefficient is a correlation coefficient between the ranks of the measured values, so it is independent of the skewed marginal distributions.

thumbnail
Table 2. Spearman correlation coefficients of daily maximums of the PM2.5 levels between each pair of the four Chinese cities (BJ: Beijing, CD: Chengdu, GZ: Guangzhou, SH: Shanghai) during the years 2013 to 2017.

https://doi.org/10.1371/journal.pone.0213148.t002

We can find that

  1. Chengdu (CD) had relatively high Spearman correlations with Beijing (BJ) and Guangzhou (GZ) from 2013 to 2017.
  2. Correlations between Beijing (BJ), Guangzhou (GZ), and Shanghai (SH) were relatively low from 2013 to 2017.
  3. Spearman correlations between Beijing (BJ) and Guangzhou (GZ) slightly increased more recently in 2016 and 2017.

3 Methods

A copula is a multivariate joint distribution function, which is an effective approach for describing statistical dependencies within a set of non-normal random variables [52, 53]. Statistical dependence structures can be modeled by choosing a copula function independent of the marginal distributions of each random variable. Since a normal distribution assumption or linearity assumption is not required, copulas can be applied to a wide variety of statistical dependence modeling tasks.

3.1 Directional dependence based on copula regression

In this paper, we consider bivariate copulas and suggest a measure of directional dependence between a pair of random variables. A generalization to a measure of directional dependence using multivariate copulas is left for future works, where computationally efficient methods need to be developed.

According to Sklar’s theorem [54, 55], any joint distribution function FXY(x, y) of two random variables X and Y can be represented by a bivariate function, C(u, v) composed with the marginal distribution functions, FX(x) and FY(y), as (1) where C(x, y) is called the copula. Note that U = FX(X) and V = FY(Y) have a uniform distribution on [0, 1]. We can see that copulas are independent and invariant of marginal distributions. Hence, a copula determines the dependency structure between two random variables independently of any one-to-one continuous transformations of each variable.

Directional dependence refers to asymmetric dependence between random variables. The concept of directional dependence was first investigated under linear regression models in [22, 23]. A copula can be used to define directional dependence in terms of regression [24]. Let C(u, v) denote a copula function, which defines the joint distribution of two random variables (U, V) whose marginal distributions have a uniform distribution on [0, 1]. Let Cu(v) be defined by the conditional distribution of V given U = u as

The conditional expectation of V given U = u can be expressed by the copula as [24] (2)

Note that the conditional expectation function Cu(v) is the regression function of V on U.

In [24], the directional dependence from U to V is defined, based on the copula regression function of V, by (3)

Note that the copula directional dependence (CDD) in Eq (3) is the proportion of the variance of V which has been explained by the copula regression function rV|U(u). In the same way, the directional dependence from V to U is defined by the proportion of the variance of U which has been explained by the copula regression function rU|V(v).

Moreover, note that the CDD defined in Eq (3) is a version of the Spearman’s correlation coefficient. The Spearman’s correlation coefficient can also be expressed in terms of a copula irrespectively of the marginal distributions. The Spearman’s correlation coefficient, ρS, is the ordinary (Pearson) correlation between U = FX(X) and V = FY(Y). It is well known that ρS can be expressed in the following form [5658]: (4)

Note that if U and V are statistically independent, then C(u, v) = uv on 0 ≤ u, v ≤ 1, which leads to rV|U(u) = rU|V(v) = 0.5. This implies that the directional dependence in Eq (3) and the Spearman’s ρS in Eq (4) are measures of the deviations from the joint probability distribution function for indenpendent random variables. See, e.g., [57] for relations between Spearman’s ρS and Kendall’s τ.

Moreover, for a pair of random variables U and V, we can estimate and compare the two CDDs, and , to find which copula regression provides a better fit to the data and better explains the variance of the response variable.

3.2 Gaussian copula beta regression

The definition of CDD in Eq (3) between two uniformly distributed random variables U and V can be extended to a pair of continuous random variables X and Y by defining U = FX(X) and V = FY(Y) and (5)

On the other hand, for estimating the CDD, it is necessary to determine a parametric form of the copula regression function, rV|U(u). Note that both U = FX(X) and V = FY(Y) are uniformly distributed and take their values in the unit interval [0, 1]. Since rV|U(u) is a conditional expectation of V given U = u, both the response variable and predictor variable have bounded ranges.

In beta regression, the conditional distribution of a response variable V given U = ut is expressed by a beta distribution, Beta(μt, κt), with the mean 0 < μt < 1 and the precision κt > 0 [42]. A beta distribution can flexibly represent a wide range of continuous probabilty distributions defined on the unit interval, that is, various shapes and asymmetries of distributions can be expressed by a beta distribution. The probability density function of V given U = ut can be written as (6) where Γ(⋅) is the gamma function. The mean parameter μt is a function of the covariate ut, through the logit function as (7)

The parameters β0 and β1 can be efficiently estimated based on the maximum likelihood approach in the Gaussian copula marginal regression (GCMR) [41, 47]. In the literatures about copula directional dependences, it has been known that the Gaussian copula regression [39] and GCMR [41] are computationally efficient approaches compared to previously known copula directional dependence measures using a family of asymmetric copulas [37, 59, 60]. In the GCMR [41], the cumulative distribution function, F(vt|μt, κt), for the beta distribution in Eq 6 is used to transform the response variable vt into wt = F(vt|μt, κt). Then, the transformed variable is related with a standard normal random variable ϵt by the inverse of the probability integral transform (8) where Ψ is the cumulative distribution function of the standard normal distribution [47]. The above relationship between the responses vt and the normal random variables ϵt is used to formulate the sampling distribution and the likelihood function. See [32, 33, 47] for more details.

3.3 Neural network autoregression

In Fig 1, we can see that the air pollution data are far from white noise processes, while their mean and variance change dramatically over time. In order to guarantee a feasible maximum likelihood inference for the beta regression, the serial dependence in the time series data should be removed via a proper modeling of the data [61].

In [32, 33], financial time series data showing conditional heteroscedasticity and serial dependence were preprocessed by employing the asymmetric GARCH(p,q) model [62], and then standardized residuals were generated. In [61], it is noted that GARCH models are employed frequently to remove serial dependence when dealing with financial log-return time series.

In the case of meteorological time series data, nonlinear time series models have been widely applied for addressing irregular or chaotic behavior in the observations [63]. Nonlinear time series models consider that irregularity in the observations can be attributed to nonlinear dynamics occuring on a low dimensional chaotic attractor, which can be reconstructed and used to forecast future observations under appropriate conditions [64, 65]. Traditional linear stochastic time series models such as ARIMA models [66] have influenced the forecasting community significantly; however, they cannot capture the nonlinear dynamics underlying real life observations [67]. On the other hand, many useful nonlinear time series models have been proposed [67, 68]. Among them, artificial neural networks (ANNs) such as feedforward neural networks have been suggested as a promising approach for modeling irregular behavior in time series [6972]. We adopted the neural network approach for estimating nonlinear autocorrelations in the data by using the nnetar function in the R package forecast [73].

Let denote the PM2.5 concentration level at a city i = 1, 2, 3, 4, day t = 1, 2, …, Ts and year s = 2013, …, 2017. The nonlinear feed-forward neural network model for a lagged time series can be written as (9) where L is the lag order and f is a neural network with H hidden nodes in a single layer. The neural network model with L lags and H hidden nodes is denoted by NNAR(L,H). The model parameters L and H were determined automatically by default values, i.e., L by the optimal value according to AIC for a linear AR(p) model, and H by half of the numbers of input values plus one. In addition, the error process is assumed to be homoscedastic. See the forecast package documentation [73] for default preprocessing procedures and default hyper-parameter values for the nnetar function.

After fitting the neural network model in Eq (9) to the data, the fitted values, , are the predicted values which can be written as (10)

The term denotes a value randomly drawn from normal distributions, which can be used for obtaining prediction intervals. We have because we do not need predicted values for t > Ts. The residuals can be written as

Due to the model in Eq (9), the residuals, , can be considered to be not serially correlated.

3.4 Bootstrap confidence interval

Let denote the difference, . It measures the difference of the two directional dependences between the two cities U and V. If , then it means that city U affects the other city V more than V affects U.

For a statistical test of the difference, we use the bootstrap resampling method. We resampled a fixed rate from the data, then computed the CDDs repeatedly. In this way, we can estimate the 95% confidence interval for the difference, i.e., (11) where LB(ΔU,V) and UB(ΔU,V) are the lower and upper limits of the 95% confidence interval. If the confidence interval does not include zero, then we can reject the null hypothesis that there is no difference between the directional dependences and with a 95% confidence level.

3.5 Statistical properties: A review

The proposed CDD using beta regression was proposed recently in [32, 33]. Statistical properties of the proposed CDD measure using beta regression have been assessed numerically in [32] by using simulated data sets. The experimental setting and the results can be summarized as follows. First, an asymmetric bivariate copula was constructed based on the formula of Durante [59], which is written as (12) where C1 and C2 are two symmetric copulas and α, β ∈ (0, 1) are asymmetry parameters. C1 and C2 were selected as a Frank copula with parameter θ3 = 5 and a Gumbel copula with parameter θ4 = 40, respectively [32, 59]. It is challenging to obtain a theoretical CDD value, , of random variables (U, V) having the specific distribution C(u, v). Instead, a sample of 8,888 correlated pairs, {(ui, vi)}, was generated, and the CDD value was estimated by and . In order to assess the accuracy of the proposed CDD measure, the CDD value was estimated from each of the thousand independent subsamples with sample size 1000. Based on the thousand estimated CDD values, , the accuracy of the CDD was evaluated via the bias and standard error. In summary, the bias was estimated as and , and the standard error was estimated as and .

On the other hand, theoretical properties of the CDD using beta regression are under development. For example, the exact CDD value of a given bivariate copula is not easy to obtain in general. In the same way, given a CDD value, it may be interesting to construct an asymmetric or symmetric copula function in a parametric form.

4 Results

We applied the neural network approach for autocorrelation estimation described in Section 3.3. Details on the feedforward neural networks and their hyper-parameters used in this paper are presented in Table 3.

thumbnail
Table 3. Details on the feedforward neural networks used for autocorrelation estimation.

L: the number of input nodes, i.e., the lag order, H: the number of nodes in the hidden layer.

https://doi.org/10.1371/journal.pone.0213148.t003

The PM2.5 values predicted by the estimated feedforward neural networks are illustrated in Fig 4, which are for the year 2013. After the prediction, we carried out the Durbin-Watson test for serial correlation based on the residuals. The obtained p-values for the year 2013 were 0.274 (Beijing), 0.686 (Chengdu), 0.275 (Guangzhou), and 0.318 (Shanghai), all of which were larger than the significance level α = 0.05. For the other years, we obtained p-values larger than 0.05 as well.

thumbnail
Fig 4. Predicted values (black straight line) of PM2.5 levels of the four Chinese cities during the year 2013 and the observed PM2.5 measurements (red dotted line).

https://doi.org/10.1371/journal.pone.0213148.g004

Next, we computed the CDDs between each pair of Chinese cities based on the residuals of the daily PM2.5 levels in each year. The results are summarized in Table 4. Contrary to the correlation coefficients presented in Table 2, the CDDs are not symmetric and are computed in both directions, U to V and V to U for each pair (U, V) of the cities. We remark that the CDD values presented in Table 4 are compared with the square of correlation coefficients rather than correlation coefficients themselves since the definition in Eq (3) is a version of the coefficient of determination in multiple linear regression. In addition, we have computed 95% bootstrap confidence intervals for the difference, . We can find that most of the confidence intervals do not include zero, which means that the difference of directional dependences is significantly different from zero.

thumbnail
Table 4. Copula directional dependences (CDDs) between ordered pairs of Chinese cities (BJ: Beijing, CD: Chengdu, GZ: Guangzhou, SH: Shanghai) during the years 2013 to 2017, together with 95% bootstrap confidence intervals (CIs) for the difference of directional dependences.

The larger value between the two directional dependences is marked in bold font.

https://doi.org/10.1371/journal.pone.0213148.t004

The estimated CDD values are visually illustrated in Figs 5 to 9. The labels on the arrows between the four Chinese cities represent the directional dependences between the cities. Several findings from the CDD values are as follows:

  1. The CDD between Beijing (BJ) and Chengdu (CD) was relatively high from 2013 to 2017 except in 2016. The CDD from CD to BJ was higher in 2013, but the CDD from BJ to CD became higher in the later years from 2014 to 2017.
  2. The CDD between Chengdu (CD) and Guangzhou (GZ) was relatively high in the years 2013, 2015, and 2017. Similarly, the CDD from GZ to CD was higher in 2013 and 2014, but the CDD from CD to GZ became higher in the later years from 2015 to 2017.
  3. The CDD between Chengdu (CD) and Shanghai (SH) was relatively high in 2014, in which the CDD from CD to SH was higher than the other direction. The CDD between CD and SH was relatively low in the other years.
  4. The CDD between Beijing (BJ) and Shanghai (SH) was relatively high during 2014, 2015, and 2017, in which the CDD from BJ to SH was higher in 2014, but the CDD from SH to BJ was higher in 2015 and 2017.
  5. The CDD between Beijing (BJ) and Guangzhou (GZ) was relatively high in 2014, 2016, and 2017, in which the CDD from GZ to BJ was higher in 2014, but the CDD from BJ to GZ became higher from 2015 to 2017.
thumbnail
Fig 5. A connectivity map of China based on the directional dependences between the four cities in the year 2013.

Base map figure source: CIA Maps (https://www.cia.gov/library/publications/resources/cia-maps-publications/China.html).

https://doi.org/10.1371/journal.pone.0213148.g005

thumbnail
Fig 6. A connectivity map of China based on the directional dependences between the four cities in the year 2014.

Base map figure source: CIA Maps (https://www.cia.gov/library/publications/resources/cia-maps-publications/China.html).

https://doi.org/10.1371/journal.pone.0213148.g006

thumbnail
Fig 7. A connectivity map of China based on the directional dependences between the four cities in the year 2015.

Base map figure source: CIA Maps (https://www.cia.gov/library/publications/resources/cia-maps-publications/China.html).

https://doi.org/10.1371/journal.pone.0213148.g007

thumbnail
Fig 8. A connectivity map of China based on the directional dependences between the four cities in the year 2016.

Base map figure source: CIA Maps (https://www.cia.gov/library/publications/resources/cia-maps-publications/China.html).

https://doi.org/10.1371/journal.pone.0213148.g008

thumbnail
Fig 9. A connectivity map of China based on the directional dependences between the four cities in the year 2017.

Base map figure source: CIA Maps (https://www.cia.gov/library/publications/resources/cia-maps-publications/China.html).

https://doi.org/10.1371/journal.pone.0213148.g009

The findings described above are overall consistent with the results obtained from the correlation coefficients in Table 2. Contrary to the correlation coefficients, however, the CDDs could infer directionality between each pair of cities and its structure change over time.

5 Discussion and conclusions

In this work, we presented a statistical measure of directional dependence which is called copula directional dependence (CDD). We obtained the PM2.5 concentration level data for the four major Chinese cities, Beijing, Chengdu, Guangzhou, and Shanghai, from January 1, 2013 to June 30, 2017. We preprocessed the data by using feedforward neural networks with lagged inputs, and we statistically analyzed directional dependence between the cities based on the proposed CDDs. An R source code for preprocessing and analyzing the PM2.5 data in this study is available at S1 Appendix.

Particulate matter consisting of PM2.5 can be generated from various sources, and its concentration can be changed by numerous factors such as wind direction, wind speed, seasonal change, weather, and industrialization [8, 18]. On the other hand, the proposed CDD can infer dependence structures between the major Chinese cities conveniently without requiring complicated models involving the diverse economic or environmental factors in a meteorological system. The inferred CDD structure can characterize economic and environmental relationships between Chinese cities, which may indirectly reflect interrelationships between underlying hidden factor variables.

In this study, we applied CDD using bivariate copulas. A multivariate generalization can be computationally challenging due to the large number of combinations of the variables. Kim and Jung [74] extended the bivariate copula-based directional dependence to a multivariate version by combining it with time-varying partial correlations. It is necessary to develop computationally efficient methods further in future works.

Based on the directional dependence structures obtained by the CDDs, we could visually present the interactions between the four Chinese cities over different time periods. Moreover, we could provide levels of confidence on the directionality of the dependences between the cities by computing bootstrap confidence intervals.

We remark that the CDD values need to be interpreted with caution. The CDD values range between 0 and 1, and a larger value implies a stronger correlation between the variables. However, since the CDD is a version of the coefficient of determination in multiple linear regression according to the definition in Eq (3), it should be expressed with higher precision, e.g., in three to four digits as in Table 4, than the Pearson and Spearman correlation coefficients. Note that the methodology presented in this paper consists of several statistical methods such as copulas, beta regression, and ANNs, each of which can be a source of uncertainty that affects the CDD values. Therefore, statistical decisions made by the bootstrap confidence interval are not always reliable, especially when the CDD values are relatively small. An alternative method for a reliable decision is to introduce a multiple comparision procedure to determine significant nonzero CDD values; see, e.g., [75, 76].

It is remarkable that we could see a structural change in the CDD networks, especially in the directions of the edges of the networks, between the period 2013-2014 and the period 2015-2017. Such a structural change indicates a certain dramatic shift in the states of the underlying latent factors affecting PM2.5 levels around 2014. Trends in air quality in Chinese major cities were analyzed in [17] and it was found that Beijing experienced decreased PM2.5 from 2013 to 2015. Overall, Beijing, Chengdu, and Guangzhou experienced improvements in air quality in the recent decade. Note that China released an ambient air quality standard, GB 3095-2012, in 2012 [14], and key cities including the four major cities analyzed in this study were required to implement the standard. In September 2013, the State Council of China issued the Air Pollution Prevention and Control Action Plan (APPCAP). Implementation of the APPCAP led to a control of ambient air pollution and substantial reductions in mortality and years-of-life-lost (YLLs) [77]. We can guess that continuous efforts in China have resulted in dramatic changes in air quality and the constituents of air pollutants. Further studies on the nature of the changes in air quality are necessary in order to make better public health policies. The proposed CDD can be effectively applied to any further analysis in general settings or for a more detailed analysis of air pollution.

Supporting information

S1 Appendix. Source code.

An R source code for preprocessing and analyzing the PM2.5 data in this study.

https://doi.org/10.1371/journal.pone.0213148.s001

(PDF)

References

  1. 1. Dockery DW, Pope CA, Xu X, Spengler JD, Ware JH, Fay ME, et al. An Association between Air Pollution and Mortality in Six U.S. Cities. New England Journal of Medicine. 1993;329(24):1753–1759. pmid:8179653
  2. 2. Pope CA, Burnett RT, Thun MJ, Calle EE, Krewski D, Ito K, et al. Lung cancer, cardiopulmonary mortality, and long-term exposure to fine particulate air pollution. JAMA. 2002;287(9):1132–1141. pmid:11879110
  3. 3. U S EPA. Integrated Science Assessment (ISA) for Particulate Matter (Final Report). Washington, DC: U.S. Environmental Protection Agency; 2009.
  4. 4. Zanobetti A, Schwartz J. The effect of fine and coarse particulate air pollution on mortality: A national analysis. Environmental Health Perspectives. 2009;117(6):898–903. pmid:19590680
  5. 5. Cifuentes LA, Vega J, Köpfer K, Lave LB. Effect of the Fine Fraction of Particulate Matter versus the Coarse Mass and Other Pollutants on Daily Mortality in Santiago, Chile. Journal of the Air & Waste Management Association. 2000;50(8):1287–1298.
  6. 6. Schwartz J, Dockery DW, Neas LM. Is daily mortality associated specifically with fine particles? Journal of the Air & Waste Management Association. 1996;46(10):927–939.
  7. 7. Yang G, Wang Y, Zeng Y, Gao GF, Liang X, Zhou M, et al. Rapid health transition in China, 1990–2010: findings from the Global Burden of Disease Study 2010. The Lancet. 2013;381(9882):1987–2015.
  8. 8. Rohde RA, Muller RA. Air Pollution in China: Mapping of Concentrations and Sources. PLoS ONE. 2015;10(8):e0135749. pmid:26291610
  9. 9. Eatough Jones M, Paine TD. Detecting changes in insect herbivore communities along a pollution gradient. Environmental Pollution. 2006;143(3):377–387. pmid:16459003
  10. 10. Tao J, Zhang L, Ho K, Zhang R, Lin Z, Zhang Z, et al. Impact of PM2.5 Chemical Compositions on Aerosol Light Scattering in Guangzhou—The Largest Megacity in South China. Atmospheric Research. 2014;135–136:48–58.
  11. 11. Zhao X, Zhang X, Pu W, Meng W, Xu X. Scattering properties of the atmospheric aerosol in Beijing, China. Atmospheric Research. 2011;101(3):799–808.
  12. 12. World Bank. Cost of pollution in China: economic estimates of physical damages (English); 2007.
  13. 13. Environment Protection Office. Ambient air quality standard; 1982.
  14. 14. Ministry of Environment Protection. Ambient air quality standard; 2012.
  15. 15. Zhao B, Su Y, He S, Zhong M, Cui G. Evolution and comparative assessment of ambient air quality standards in China. Journal of Integrative Environmental Sciences. 2016;13(2–4):85–102.
  16. 16. Hu J, Ying Q, Wang Y, Zhang H. Characterizing multi-pollutant air pollution in China: Comparison of three air quality indices. Environment International. 2015;84:17–25. pmid:26197060
  17. 17. Lowsen DH, Conway GA. Air Pollution in Major Chinese Cities: Some Progress, But Much More to Do. Journal of Environmental Protection. 2016;7(13):2081–2094. pmid:28845334
  18. 18. Lv B, Liu Y, Yu P, Zhang B, Bai Y. Characterizations of PM2.5 Pollution Pathways and Sources Analysis in Four Large Cities in China. Aerosol and Air Quality Research. 2015;15:1836–1843.
  19. 19. Pearl J. Causality: Models, Reasoning, and Inference. Cambridge University Press; 2000.
  20. 20. Spirtes P. Introduction to Causal Inference. Journal of Machine Learning Research. 2010;11:1643–1662.
  21. 21. Spirtes P, Glymour C, Scheines R. Causation, Prediction, and Search. Adaptive Computation and Machine Learning. The MIT Press; 2000.
  22. 22. Dodge Y, Rousson V. Direction dependence in a regression line. Communications in Statistics—Theory and Methods. 2000;29(9-10):1957–1972.
  23. 23. Dodge Y, Rousson V. On Asymmetric Properties of the Correlation Coeffcient in the Regression Setting. The American Statistician. 2001;55(1):51–54.
  24. 24. Sungur EA. A note on directional dependence in regression setting. Communications in Statistics—Theory and Methods. 2006;34(9-10):1957–1965.
  25. 25. Kano Y, Shimizu S. Causal inference using nonnormality. In: Higuchi T, Iba Y, Ishiguro M, editors. Proceedings of the International Symposium on Science of Modeling—The 30th Anniversary of the Information Criterion. Tokyo, Japan; 2003. p. 261–270.
  26. 26. Shimizu S, Hoyer PO, Hyvärinen A, Kerminen AJ. A linear non-Gaussian acyclic model for causal discovery. Journal of Machine Learning Research. 2006;7:2003–2030.
  27. 27. Sun X, Janzing D, Schölkopf B. Causal Inference by Choosing Graphs with Most Plausible Markov Kernels. In: Proceedings of the 9th International Symposium on Artificial Intelligence and Mathematics. Fort Lauderdale, FL; 2006. p. 1–11.
  28. 28. Hoyer P, Janzing D, Mooij J, Peters J, Schölkopf B. Nonlinear causal discovery with additive noise models. In: Advances in Neural Information Processing Systems (NIPS). Vancouver, Canada: MIT Press; 2009. p. 689–696.
  29. 29. Mooij J, Janzing D, Peters J, Schölkopf B. Regression by dependence minimization and its application to causal inference. In: Proceedings of the 26th International Conference on Machine Learning (ICML). Montreal: Omnipress; 2009. p. 745–752.
  30. 30. Zhang K, Hyvärinen A. On the identifiability of the post-nonlinear causal model. In: Proceedings of the 25th Conference on Uncertainty in Artificial Intelligence. Arlington, Virginia, United States: AUAI Press; 2009. p. 647–655.
  31. 31. Blöbaum P, Janzing D, Washio T, Shimizu S, Schölkopf B. Cause-effect inference by comparing regression errors. In: Proceedings of the 21st International Conference on Artificial Intelligence and Statistics (AISTATS); 2018. p. 900–909.
  32. 32. Kim JM, Hwang SY. Directional dependence via Gaussian copula beta regression model with asymmetric GARCH marginals. Communications in Statistics—Simulation and Computation. 2017;46(10):7639–7653.
  33. 33. Kim JM, Hwang SY. The copula directional dependence by stochastic volatility models. Communications in Statistics—Simulation and Computation. 2018; p. 1–23.
  34. 34. Cherubini U, Luciano E, Vecchiato W. Copula Methods in Finance. New York: John Wiley & Sons; 2004.
  35. 35. Cherubini U, Mulinacci S, Gobbi F, Romagnoli S. Dynamic Copula Methods in Finance. Chichester, UK: John Wiley & Sons; 2012.
  36. 36. Li M, Boehnke M, Abecasis GR, Song PXK. Quantitative Trait Linkage Analysis Using Gaussian Copulas. Genetics. 2006;173:2317–2327. pmid:16751671
  37. 37. Kim JM, Jung YS, Sungur EA, Han KH, Park C, Sohn I. A copula method for modeling directional dependence of genes. BMC Bioinformatics. 2008;9:225. pmid:18447957
  38. 38. Song PXK. Multivariate dispersion models generated from Gaussian copula. Scandinavian Journal of Statistics. 2000;27:305–320.
  39. 39. Pitt M, Chan D, Kohn R. Efficient Bayesian inference for Gaussian copula regression models. Biometrika. 2006;93:537–554.
  40. 40. Song PXK, Li M, Yuan Y. Joint Regression Analysis of Correlated Data Using Gaussian Copulas. Biometrics. 2009;65:60–68. pmid:18510653
  41. 41. Masarotto G, Varin C. Gaussian copula marginal regression. Electronic Journal of Statistics. 2012;6:1517–1549.
  42. 42. Ferrari SLP, Cribari-Neto F. Beta regression for modelling rates and proportions. Journal of Applied Statistics. 2004;31:799–815.
  43. 43. Cribari-Neto F, Zeileis A. Beta Regression in R. Journal of Statistical Software. 2010;34(2):1–24.
  44. 44. Casarin R, Valle LD, Leisen F. Bayesian Model Selection for Beta Autoregressive Processes. Bayesian Analalysis. 2012;7(2):385–410.
  45. 45. Figueroa-Zuniga JI, Arellano-Valle RB, Ferrari SLP. Mixed beta regression: A Bayesian perspective. Computational Statistics & Data Analysis. 2013;61:137–147.
  46. 46. Casarin R, Leisen F, Molina G, ter Horst E. A Bayesian Beta Markov Random Field Calibration of the Term Structure of Implied Risk Neutral Densities. Bayesian Analalysis. 2015;10(4):791–819.
  47. 47. Guolo A, Varin C. Beta regression for time series analysis of bounded data, with application to Canada Google flu trends. The Annals of Applied Statistics. 2014;8(1):74–88.
  48. 48. Ma Z, Hu X, Huang L, Bi J, Liu Y. Estimating Ground-Level PM2.5 in China Using Satellite Remote Sensing. Environmental Science & Technology. 2014;48(13):7436–7444.
  49. 49. Wang J, Xu X, Spurr R, Wang Y, Drury E. Improved algorithm for MODIS satellite retrievals of aerosol optical thickness over land in dusty atmosphere: Implications for air quality monitoring in China. Remote Sensing of Environment. 2010;114(11):2575–2583.
  50. 50. Zhang Q, Geng GN, Wang SW, Richter A, He KB. Satellite remote sensing of changes in NOx emissions over China during 1996-2010. Chinese Science Bulletin. 2012;57(22):2857–2864.
  51. 51. Liu XH, Zhang Y, Cheng SH, Xing J, Zhang Q, Streets DG, et al. Understanding of regional air pollution over China using CMAQ, part I performance evaluation and seasonal variation. Atmospheric Environment. 2010;44(20):2415–2426.
  52. 52. Nelsen RB. Properties and applications of copulas: A brief survey. In: Dhaene J, Kolev N, Morettin PA, editors. Proceedings of the First Brazilian Conference on Statistical Modeling in Insurance and Finance. São Paulo: University Press USP; 2003. p. 10–28.
  53. 53. Nelsen RB. An Introduction to Copulas. 2nd ed. New York: Springer; 2006.
  54. 54. Sklar A. Fonctions de répartition à n dimensions et leurs marges. Publications de l’Institut de statistique de l’Université de Paris. 1959;8:229–231.
  55. 55. Sklar A. Random variables, joint distribution functions, and copulas. Kybernetika. 1973;9(6):449–460.
  56. 56. Cuadras CM. On the Covariance between Functions. Journal of Multivariate Analysis. 2002;81(1):19–27. https://doi.org/10.1006/jmva.2001.2000.
  57. 57. Fredricks GA, Nelsen RB. On the relationship between Spearman’s rho and Kendall’s tau for pairs of continuous random variables. Journal of Statistical Planning and Inference. 2007;137(7):2143–2150. https://doi.org/10.1016/j.jspi.2006.06.045.
  58. 58. Schweizer B, Wolff EF. On Nonparametric Measures of Dependence for Random Variables. The Annals of Statistics. 1981;9(4):879–885.
  59. 59. Durante F. Construction of non-exchangeable bivariate distribution functions. Statistics Papers. 2009;50(2):383–391.
  60. 60. Kim D, Kim JM. Analysis of Directional Dependence using Asymmetric Copula-based Regression Models. Journal of Statistical Computation and Simulation. 2014;84(9):1990–2010.
  61. 61. Kojadinovic I, Yan J. Modeling Multivariate Distributions with Continuous Margins Using the copula R Package. Journal of Statistical Software. 2010;34(9):1–20.
  62. 62. Bollerslev T. Generalized autoregressive conditional heteroskedasticity. Journal of Econometrics. 1986;31(3):307–327.
  63. 63. Pérez-Muñuzuri V, Gelpi IR. Application of nonlinear forecasting techniques for meteorological modeling. Annales Geophysicae. 2000;18(10):1349–1359.
  64. 64. Abarbanel HDI. Analysis of Observed Chaotic Data. Institute for Nonlinear Science. New York: Springer-Verlag; 1996.
  65. 65. Kantz H, Schreiber T. Nonlinear Time Series Analysis. 2nd ed. Cambridge University Press; 2003.
  66. 66. Box GEP, Jenkins GM, Reinsel GC, Ljung GM. Time Series Analysis: Forecasting and Control. 5th ed. Wiley Series in Probability and Statistics. Wiley; 2015.
  67. 67. De Gooijer JG, Hyndman RJ. 25 years of time series forecasting. International Journal of Forecasting. 2006;22(3):443–473.
  68. 68. Ben Taieb S, Bontempi G, Atiya AF, Sorjamaa A. A review and comparison of strategies for multi-step ahead time series forecasting based on the NN5 forecasting competition. Expert Systems with Applications. 2012;39(8):7067–7083.
  69. 69. Lapedes A, Farber R. Nonlinear signal processing using neural networks: Prediction and system modelling. Los Alamos, NM: Los Alamos National Laboratory; 1987.
  70. 70. Zhang G, Patuwo BE, Hu MY. Forecasting with artificial neural networks:: The state of the art. International Journal of Forecasting. 1998;14(1):35–62.
  71. 71. Zhang GP, Patuwo BE, Hu MY. A simulation study of artificial neural networks for nonlinear time-series forecasting. Computers & Operations Research. 2001;28(4):381–396.
  72. 72. Liang F. Bayesian neural networks for nonlinear time series forecasting. Statistics and Computing. 2005;15(1):13–29.
  73. 73. Hyndman R, Athanasopoulos G, Bergmeir C, Caceres G, Chhay L, O’Hara-Wild M, et al. R Package: forecast; 2018. http://cran.r-project.org/web/packages/forecast/index.html.
  74. 74. Kim JM, Jung H. Directional time-varying partial correlation with the Gaussian copula-DCC-GARCH model. Applied Economics. 2018;50(41):4418–4426.
  75. 75. Benjamini Y, Hochberg Y. Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing. Journal of the Royal Statistical Society: Series B (Statistical Methodology). 1995;57(1):289–300.
  76. 76. Benjamini Y. Discovering the false discovery rate. Journal of the Royal Statistical Society: Series B (Statistical Methodology). 2010;72(4):405–416.
  77. 77. Huang J, Pan X, Guo X, Li G. Health impact of China’s Air Pollution Prevention and Control Action Plan: an analysis of national air quality monitoring and mortality data. Lancet. 2018;2(7):e313–e323. pmid:30074894