Introduction

In recent years, air pollution has become a more and more serious problem around the world. The new air quality model presented by the World Health Organization in 2016 confirmed that 92% of the world’s population lives in areas where air quality levels exceed their limits1. Fortunately, more and more governments have realized the importance of managing air pollution and some actions have been placed. Nowadays, a common topic around the world is the governance of the air pollution source such as smog (the main ingredient is fine particulate matter). Many researchers have been involved in the study on the cause and propagation of smog2,3,4,5,6,7,8,9. Modern statistic methods provide some new perspectives to assess smog trends and propagation characteristics. Among them, most studies have focused on studying the correlations among various air pollution indicators including air pollution index (API), air quality index (AQI), fine particulate matter of PM2.5 (diameter ≤ 2.5 μm) concentrations, and PM10 (diameter ≤ 10 μm) concentrations, and very limited studies considered the correlations among neighboring areas. A common sense is that smog produced at one source place can spread to surrounding areas6,7,8,9,10. Therefore, it is more practical to explore the dependence of air pollution indicators among adjacent cities as it helps assess the causes of local smog and its spread behavior. It has been found by a newly proposed time-lagged cross-correlation coefficient in ref.10 that there are different degrees of correlation for PM2.5 series between four neighboring cities in Northern China. However, what has not been investigated is how the PM2.5 series of one city depends on those of the neighbouring cities. In this work, we will develop a detrended fluctuation analysis (DFA)-based bivariate regression model to investigate this dependence.

The simplest and maturest method to describe the dependence of variables is the linear regression. However, the information gained from the traditional linear regression cannot fully meet our need of investigation on the dependence among different variables at different time periods. On the other hand, note that the DFA proposed in 1990s11,12 performs excellently in analyzing the long-range correlations13 of a nonstationary series with fractality and multifractality14,15 at different time-scales. To obtain the cross-correlation between two nonstationary series, DFA was extended to the detrended cross-correlation analysis (DCCA)16. By defining scale-dependent detrended fluctuation functions, the methods of DFA and DCCA together with their extensions have been applied in a wide range of disciplines17,18,19,20,21,22,23,24,25,26,27,28,29,30. Since the ordinary least squares (OLS) method expresses the estimated parameters of standard regression framework as a form of variances and covariances, it builds a bridge between the regression framework and the family of DFA/DCCA as the latter can also produce variances and covariances. Then, the idea of estimating multiple time scale regression coefficients can be achieved by the DFA/DCCA. Recently, Kristoufek31 constructed a simple DFA-based regression framework exactly by this bridge. The selected examples show the relationship between the pair of variables varies strongly across scales.

In this work, we focus on the interaction of PM2.5 series of three adjacent cities in Northern China, namely, Beijing, Tianjin, and Baoding. The three cities form a triangular shape in the map. The distances between Beijing and Tianjin, Beijing and Baoding, and Tianjin and Baoding are about 115 km, 140 km, and 150 km, respectively. All three cities have a population of more than 10 million and have been greatly affected by heavy smog in recent years. The real-time data of PM2.5 series of these three cities from December 1, 2013 to November 30, 2016 are chosen for our consideration, which are taken from the Ministry of Environmental Protection of the People’s Republic of China (http://datacenter.mep.gov.cn). The original data show an obvious periodic characteristic and roughly similar trends among the three cities, which imply that there is a possible relevance between per two cities of them. To verify that, the partial correlation technique is employed to get the intrinsic relations between two cities by deleting the interference from the third variable. Four seasons, classified as winter (December, January, and February), spring (March, April, and May), summer (June, July, and August), and fall (September, October, and November), are considered. The results are listed in Table 1.

Table 1 Partial correlation coefficients and t-statistics between per two cities of Beijing, Tianjin, and Baoding in four seasons.

In Table 1, we also list the t-statistics (\(t={r}_{12,3}\sqrt{\tfrac{N-3}{1-{r}_{12,3}^{2}}}\), where r12,3 denotes the partial correlation coefficient between the first and second variables eliminating the effects of the third one, N − 3 is the degree of freedom) of the partial correlation coefficients to assess the statistical significance at the given significance level. Unsurprisingly, Table 1 shows that the correlations of PM2.5 between per two cities are of statistical significance. It explains that the air quality in one city of Northern China cannot be irrelevant to that of its neighbouring cities, which implies potential dependence among the three cities. However, we also note in Table 1 that the degrees of relevance are different among different cities and in different seasons though all of them are significant.

To fully detect and quantify the dependence among the PM2.5 series of the above-mentioned three cities, in this work, we construct a new bivariate regression framework which prevails the DFA method and allows us to investigate the dependence of three nonstationary series with multiple time scales. With the DFA-based variance instead of the standard variance, this new DFA bivariate regression model provides more information on the dependence among variables at different time scales. We organize the rest of this paper as follows. The performance of the proposed DFA regression model and the results on the application to PM2.5 series analysis are reported and discussed in the following section, which is followed by our conclusions. The methodologies including the standard regression method, the DFA method, and the DFA-based regression method are introduced at the end of this paper.

Results and Discussions

Performance of DFA estimators

The bivariate DFA-based regression model produces two time scale-based regression coefficients. This allows us to detect the dependence of a response variable and two independent variables at different time scales. In order to examine the validity of the model and show its advantages, in this section, we perform two numerical tests on the non-stationary bivariate regression frameworks Y = β0 + β1X1 + β2X2 + ε.

In the first test, we investigate the performance of the DFA estimators under different levels of long-term dependence in X1, X2, and Y. According to31, the setting I is given as below: two artificial series X1 and X2 with length 10000 are generated by ARFIMA(0, d, 0) process with identical fractional integration parameter (d) and independent Gaussian noises (ξ i (t), i = 1 and 2) as \({X}_{i}(t)={\sum }_{n=0}^{\infty }\,{a}_{n}(d){\xi }_{i}(t-n)\). The quantity a n (d) is defined by a n (d) = Γ(n − d)/[Γ(−d)Γ(n + 1)], where Γ() is the Gamma function. The error-term ε is set as a standard Gaussian noise so that the response variable Y has the same parameter d as the two independent variables. The regression coefficients are set as β0 = β1 = 1 and β2 = 2. Figure 1 shows mean values and standard deviation of the two DFA estimators \({{\hat{\beta }}_{i}}^{DFA}\) (i = 1 and 2) for the generated series with d ranging from −0.5 to 0.5 (at the step size of 0.1). The estimators are averaged over scales between 10 and 1000 with a logarithmic isometric step. Each case is run 1000 times to eliminate the noise interference. It is clear that the two estimators locate the two given regression coefficients of 1 (Fig. 1a) and 2 (Fig. 1b) unbiasedly, and are independent of the value of d. In addition, the standard deviations of both estimators decrease with the increasing memory. The good performance shows that the method is feasible. On the other hand, to investigate the performance of the DFA estimators faced with a long-range dependent error-term ε, we use setting II given as: the memory parameter d is fixed at 0.4 for both X1 and X2, and the ε is produced by an ARFIMA process with d ε varying from −0.5 to 0.5. Other settings are as those in setting I. Figure 2 records similar information as that in Fig. 1. Although the fluctuation of DFA estimators increases with d ε , which is expected due to an increasing weight of the error-term in the dynamics of Y with the increasing memory of the error-term, we are satisfied to find that the two estimators are still unbiased pointing to the given values with a narrow range for each level of memory of the error-terms.

Figure 1
figure 1

Estimated two DFA Regression coefficients with setting I of ARFIMA model. Mean values of the DFA estimators and standard deviation are shown as solid line (left axis) and dashed line (right axis), respectively. X1 and X2 are two independent variables generated by ARFIMA model with the same changing fractional integration parameter (x-axis) and independent Gaussian noise. Y = 1 + X1 + 2X2 + ε, ε is a standard Gaussian noise error-term. Both of them are of length 10000 and repeated 1000 times. (a) is the result for the estimated \({{\hat{\beta }}_{1}}^{DFA}\) and (b) is for the estimated \({{\hat{\beta }}_{2}}^{DFA}\). The DFA estimators β1 and β2 are unbiased at 1 and 2 with the error-term ±0.002 and ±0.005, respectively, and their standard deviations decrease with the memory strength.

Figure 2
figure 2

Estimated two DFA Regression coefficients with setting II of ARFIMA model with the same legend as in Fig. 1. X1 and X2 are generated by ARFIMA series with the same fixed d = 0.4. ε is an ARFIMA process with changing parameter d ε (x-axis). The remaining settings are the same as those in case I. Results show that the DFA estimators β1 and β2 are unbiased at 1 and 2 with the error-term ±0.002 and ±0.003, respectively, while the standard deviations increase with the error-term memory strength.

Our second numerical test aims to show that the DFA estimators are able to identify the dependence of studied variables at different time scales whereas the classical method cannot. To this end, a binomial multifractal series (BMFs) is employed to be regarded as the independent variable X1, which is constructed as X1 = pnn[k−1](1 − p)n[k−1], k = 1, 2, …, 2n, where the parameter p (0, 0.5) (We take p = 0.3 in our test), n[k] denotes the number of digit 1 in the binary representation of the index k. The variable X2 is a Gauss variable with 0 mean and 0.0001 standard deviation. Both X1 and X2 are of length 215. The bivariate regression framework Y = β0 + β1X1 + β2X2 + ε is set with the same coefficients as the first test (β0 = β1 = 1 and β2 = 2). The error-term ε is the Gauss noise of the same strength as X2. For the BMFs X1, we remove all values smaller than 0.00001 so that only a few of the largest elements are left. In their places, we substitute Gaussian distributed random numbers with 0 mean and 0.0001 standard deviation. Then we obtain a binomial cascade series embedded in random noise. We analyze the dependence between the response variable Y and two independent variables and find that the estimated \({{\hat{\beta }}_{2}}^{DFA}\) is unbiased at 2 with a desirable error bar for every time scale, as shown in Fig. 3. However, the performance of \({{\hat{\beta }}_{1}}^{DFA}\) has changed a lot. The dependence between Y and X1 is obviously less than the given value at the smaller scales contrary to the larger ones. This is because in the smaller scales, the dependency has been destroyed by the random noise. Our DFA estimators have the capability to recognize this effect while the classical estimators fail to do so (see the errorbar with circle symbol in Fig. 3).

Figure 3
figure 3

Estimated two DFA Regression coefficients with BMFs model.

Performance of the three models’ regression coefficients

As mentioned above, air pollution in Northern China is very serious in recent years. Fine particulate matter from industrial exhaust and smoke dust forms smog to fill in the air. We now apply our DFA regression model to investigate the dependence of PM2.5 series in these three cities. We build three bivariate models for Beijing, Tianjin, and Baoding, respectively. In Model I, the dependent variable (Y) is the PM2.5 series of Beijing while the two independent variables are the PM2.5 series of Tianjin (X1) and Baoding (X2); in Model II, Y is the PM2.5 series of Tianjin, X1 is the PM2.5 series of Beijing and X2 is the PM2.5 series of Baoding; in Model III, Y is the PM2.5 series of Baoding, X1 and X2 stand for the PM2.5 series in Beijing and Tianjin, respectively. In this section, we first show the performance of the regression coefficients at different scales in the three models and then make two statistical tests for the two regression coefficients in each model. Some evaluations for the DFA-based regression and the standard regression are conducted at the end of this section.

The two regression coefficient estimators together with their standard deviations of the three models are sketched in Figs 46, respectively. As expected, the effect is obviously positive. However, a strong variation across scales is found in different seasons. More specifically,

  1. (a)

    In the Beijing’s model, Tianjin (X1) has strongly positive effect in every season, especially for the larger time scales. On the contrary, Baoding (X2) has different effects on Beijing. Compared to spring and summer, the effect is quite weak in the other two seasons, especially in winter, \({{\hat{\beta }}_{2}}^{DFA}(n)\) is nearly 0 when the scale is more than 800 hours.

  2. (b)

    In the Tianjin’s model, Baoding (X2) presents more unstable effect at different scales. Particularly in summer, \({{\hat{\beta }}_{2}}^{DFA}(n)\) is close to 0 from the smaller scale to the larger scale at about 50 days (1200 hours), which implies that the positive correlation between Tianjin and Baoding can last less than 50 days. In addition, the two coefficients are less than 0.5 in most days, which indicates that Beijing and Baoding have little impact on the PM2.5 in Tianjin.

  3. (c)

    For the model of Baoding, the effect of Tianjin (X2) to Baoding is similar to that of Baoding to Tianjin in model II. However, the fact that after approximately 17 days (408 hours) the effect reaches the value greater than 1 indicates that an increase of 1 unit PM2.5 concentration of Tianjin will lead to the increase of more than 1 unit PM2.5 concentration in Baoding. In this regard, Tianjin has more impact on Baoding. In addition, the narrow confidence intervals and low standard deviations (less than 0.02) shown in all sub-plots suggest satisfied reliability of the estimates.

Figure 4
figure 4

Bivariate DFA regression of Beijing. Main planes of subplots (ad) show estimated DFA regression coefficients β1(n) and β2(n) of winter, spring, summer, and fall, respectively. Gray zones denote 95% confidence intervals. Inserts are standard deviations of \({\hat{\beta }}_{1}^{DFA}(n)\) and \({\hat{\beta }}_{2}^{DFA}(n)\). Subscripts 1 and 2 denote Tianjin and Baoding, respectively.

Figure 5
figure 5

Bivariate DFA regression of Tianjin with the same legend as in Fig. 4. Here, subscripts 1 and 2 denote Beijing and Baoding, respectively.

Figure 6
figure 6

Bivariate DFA regression of Baoding with the same legend as in Fig. 4. Here, subscripts 1 and 2 denote Beijing and Tianjin, respectively.

Statistic significance tests of regression coefficients

As mentioned above, the estimated \({\hat{\beta }}^{DFA}(n)\) is able to theoretically describe the dependence between the impulse variables and the response variables at different time scales. In theory, as long as \({\hat{\beta }}_{j}^{DFA}(n)\) is not equal to zero, the independent variable X j will affect Y. However, for finite time series, \({\hat{\beta }}_{j}^{DFA}(n)\) is not always equal to 0 even in the absence of relationship between X j and Y due to the size limitation. Therefore, we perform a hypothesis test for the estimated \({\hat{\beta }}^{DFA}(n)\) to ensure the significance. The standard regression analysis provides a so-called t statistic defined as \({t}_{j}=\frac{{\hat{\beta }}_{j}-{\beta }_{j}}{\sqrt{{var}({\hat{\beta }}_{j})}}\) (j = 1, 2) for this purpose. We have \({t}_{j}\sim t(N-3)\) for the bivariate regression model as \({\hat{\beta }}_{j}\sim N({\beta }_{j},{var}({\hat{\beta }}_{j}))\). In general, if |t j | > t1−α/2(N − 3) with a given α, we should reject the null hypothesis of β j  = 0 and the dependence between X j and Y is considered to be statistically significant. However, since lots of time scales are taken accounted in the DFA regression model, using a single critical value of t1−α/2(N − 3) is inappropriate. A correct way is to generate a critical value tc(n) for each time scale. To this end, inspired by the idea proposed by Podobnik et al.32, we shuffle the considered PM2.5 series and repeat the DFA regression calculations for 10,000 times. Then let the integral of probability distribution function (PDF) from −tc(n) to tc(n) be equal to 1 − α (here, we take α = 0.01). As an example, we show the PDF of tc(n) with five given n’s produced by the shuffled PM2.5 series of fall in Fig. 7.

Figure 7
figure 7

PDF of critical points t- statistics critical values at different scales for the statistical test with 10000 times of the shuffled PM2.5 series of fall.

As expected, the symmetrical PDF of tc(n) converges to a Gaussian distribution according to the central limit theorem. In addition, the critical value increases as n increases. This implies that large time scale may strengthen dependence between two variables. By using tc(n), we can determine whether the dependence between the impulse variable and the response variable is significant or not. In practice, the dependence between X j and Y is present when \({t}_{j}(n)(=\tfrac{{\hat{\beta }}_{j}^{DFA}(n)-{\beta }_{j}}{\sqrt{{var}({\hat{\beta }}_{j}^{DFA}(n))}})\) is larger than tc(n). For the four seasons, the scale-dependent t-statistics of regression coefficient together with the scale-dependent critical value tc(n) are presented in Fig. 8.

Figure 8
figure 8

t-statistical test of the estimated DFA-based bivariate regression coefficients. (ad) Are for winter, spring, summer, and fall, respectively. The dashed line represents the tc(n) with 0.01 significant levels. Above this line means to decline the null hypothesis β j  = 0.

Note that in Model I (for Beijing), the t(n)-statistics of \({{\hat{\beta }}_{1}}^{DFA}(n)\) (Tianjin’s coefficient) is equal to that of \({{\hat{\beta }}_{1}}^{DFA}(n)\) (Beijing’s coefficient) in Model II (for Tianjin), the t(n)-statistics of \({{\hat{\beta }}_{2}}^{DFA}(n)\) (Baoding’s coefficient) is equal to that of \({{\hat{\beta }}_{1}}^{DFA}(n)\) (Beijing’s coefficient) in Model III (for Baoding), and in Model II, the t(n)-statistics of Baoding’s coefficient \({{\hat{\beta }}_{2}}^{DFA}(n)\) is equal to that of Tianjin’s coefficient \({{\hat{\beta }}_{2}}^{DFA}(n)\) in Model III (for Baoding). Here the three colored lines with different symbols represent the t(n)-statistics between each per two cities while the black dashed line stands for tc(n).

The partial DCCA coefficient ρ PDCCA (n) is recently developed to uncover the intrinsic relation for two nonstationary series at different time scales. We also calculate the partial DCCA coefficients ρ PDCCA (n) of Beijing and Tianjin, Beijing and Baoding, and Tianjin and Baoding, respectively, and present the results in Fig. 9. For the same purpose of testing the statistical significance, we also produce a critical value for the four seasons. Similarly, the PM2.5 data are shuffled 10,000 times in the PDCCA calculations repeatedly, and thus \({\rho }_{PDCCA}^{c}(n)\) for 99% confidence level is obtained, which is also shown in Fig. 9.

Figure 9
figure 9

Statistical test of DPCCA coefficients among the three cities. (ad) Are for winter, spring, summer, and fall, respectively. The dashed line represents the critical value of ρ DPCCA which is obtained from 10000 times Monte-Carlo simulations with 99% confidence level. Below this line suggests no cross-correlated significance.

Comparing results in Figs 8 and 9 gives amazing similarities, which are also in agreement with the results shown in Figs 46. Based on the results, we can draw the following three main points.

  1. (a)

    The dependence between Beijing and Tianjin (the blue square line) gradually increases with the increasing time scales in all seasons. However, the dependence between the two cities is lower than other cities. This finding uncovers that the reason for the serious air pollution in these two cities are mainly due to their own heavy smog or are impacted by other cities.

  2. (b)

    The dependence between Beijing and Baoding (the green triangle line) is significant in spring, summer, and fall. In winter, however, the dependence disappears at long time scale, which implies that the two cities can only affect each other at relatively short term. Moreover, compared to winter and fall, the dependence is much stronger in spring and summer, especially at long time scales, which indicates that they affect much longer in warm weather.

  3. (c)

    In spring and summer, the t(n)-statistics and ρ PDCCA (n) of Tianjin vs. Baoding (the red circle line) go down through the critical lines of tc(n) and \({\rho }_{PDCCA}^{c}(n)\) at about 800 hours, respectively. This suggests that the dependence between Tianjin and Baoding will disappear when it’s more than one month. However, the exact opposite occurs in winter and fall. In these two seasons, both t(n)-statistic and ρ PDCCA (n) increase with the increasing time scales, which demonstrates that the interaction of bad air quality between the two cities will last longer in cold days.

Evaluations of DFA-based regression model

To evaluate our estimated DFA-based bivariate regression model, we plot the scale-dependent determination coefficient \({R}_{DFA}^{2}(n)\), and the beta coefficient β*DFA(n) and the average elasticity coefficient ηDFA(n) in Fig. 10, Figs 1113, respectively.

Figure 10
figure 10

Determination coefficients of bivariate DFA and standard regression model. (ad) Are for models of Beijing, (eh) are for models of Tianjin, and (il) are for models of Baoding. The solid line denotes \({R}_{DFA}^{2}(n)\) and the dashed line denotes R2.

Figure 11
figure 11

Beta coefficients and elasticity coefficients of bivariate DFA and standard regression model of Beijing. The four columns from left to right are for winter, spring, summer, and fall, respectively. The subscript 1 of β* and η denotes Tianjin and 2 denotes Baoding.

Figure 12
figure 12

Beta coefficients and elasticity coefficients of bivariate DFA and standard regression model of Tianjin with the same legend as in Fig. 11. Here the subscripts 1 and 2 denote Beijing and Baoding, respectively.

Figure 13
figure 13

Beta coefficients and elasticity coefficients of bivariate DFA and standard regression model of Baoding with the same legend as in Fig. 11. Here the subscripts 1 and 2 denote Beijing and Tianjin, respectively.

To show the new model provides more information than the standard regression model does, we also include the three corresponding coefficients of standard bivariate regression model in these figures. As seen from Fig. 10 that \({R}_{DFA}^{2}(n)\) is superior to the standard R2 at most time scales. The good performance illustrates that one will gain richer information in explaining the response variable when using our DFA-based regression model. On the other hand, we can conclude from Figs 1113 that (1) Baoding has more influence than Tianjin on Beijing in all seasons except for winter. (2) Tianjin is more sensitive to Baoding’s changes in air quality than Beijing’s in winter and fall. (3) Tianjin affects Baoding more than Beijing does in winter and fall, but less in the other two seasons. In addition, Figs 1113 illustrate that the standard \({\beta }_{j}^{\ast }\) and η j can be seen as the mean values of the DFA-based \({\beta }_{j}^{\ast DFA}(n)\) and \({\eta }_{j}^{DFA}(n)\), respectively. This means that \({\beta }_{j}^{\ast DFA}(n)\) and \({\eta }_{j}^{DFA}(n)\) are able to measure the dependence degree of the studied independent variable on the dependent variable in all directions. Thus one can access the measurement according to his/her needs. For example, in winter of Model I, we find that the \({\beta }_{2}^{\ast DFA}(n)\) and \({\eta }_{2}^{DFA}(n)\) are larger than \({\beta }_{1}^{\ast DFA}(n)\) and \({\eta }_{1}^{DFA}(n)\), respectively, at smaller scales but much smaller at larger scales, which shows that the sensitivity of Y to X2 (Baoding) is greater than that of Y to X1 (Tianjin) for short term (≤300 hours) but Tianjin is more sensitive to Beijing at the long term. This can help air quality inspectors make the correct analysis for Beijing’s PM2.5 at different periods.

Conclusions

The study of dependence between variables helps expose the causal relationship and correlation of the variables of interest in the real world. The linear regression model is undoubtedly one of the simplest methods among many approaches. However, single variety of regression coefficient and evaluation index cannot show all aspects of the dependence between independent variables and dependent variable. As a meaningful extension, we design a new framework for bivariate regression model using the prevailing DFA method. The proposed bivariate DFA regression model allows us to estimate multi-scale regression coefficients and other corresponding scale-dependent evaluation indicators. It has been shown via two artificial tests that these DFA-based regression coefficients are able to describe the dependence between the response variable and two independent variables exactly; and can capture different dependence at different time scales.

An application of the new model to the study of dependence of PM2.5 series among three heavily air polluted cities in Northern China unveils that huge difference of the dependence exists in per two cities in different seasons and at different periods. Three new indicators of the scale-dependent determination coefficient, the scale-dependent beta coefficient, and the scale-dependent elasticity coefficient are proposed, which turned out to be more practical than those in standard regression models. Three main points can be concluded as (1) Beijing and Baoding have little impact on the PM2.5 in Tianjin while Tianjin takes more impact on Baoding and the air quality of Beijing is more sensitive to the changes in Baoding. (2) In contrast, the air quality in Beijing and Tianjin is not significantly relevant, while the air quality in Tianjin and Baoding has a very significant impact on each other especially in the cold weather. (3) In comparison, the fluctuation of PM2.5 in Baoding has the greatest impact on the other two cities in most days. While Baoding’s air quality is more sensitive to Beijing’s changes in spring and summer, and is more sensitive to Tianjin’s changes in winter and fall. These findings may provide some useful insights on understanding air pollution sources among cities in Northern China.

Methods

The standard bivariate regression model

To study the dependence of air quality among three neighboring cities, we consider a bivariate linear regression model as

$$Y={\beta }_{0}+{\beta }_{1}{X}_{1}+{\beta }_{2}{X}_{2}+\varepsilon ,$$
(1)

where Y is a dependent variable, X1 and X2 are two independent variables, ε is a Gaussian error term with zero mean value, and β j (j = 1, 2) is the partial regression coefficient characterizing the dependence on X j . The most critical work in empirical studies is to estimate β1 and β2. The OLS method gives

$$\{\begin{array}{rcl}{\hat{\beta }}_{1} & = & \tfrac{({\sum }_{t=1}^{N}\,{x}_{1t}\,{y}_{t})\cdot ({\sum }_{t=1}^{N}\,{x}_{2t}^{2})-({\sum }_{t=1}^{N}\,{x}_{2t}\,{y}_{t})\cdot ({\sum }_{t=1}^{N}\,{x}_{1t}\,{x}_{2t})}{({\sum }_{t=1}^{N}\,{x}_{1t}^{2})\cdot ({\sum }_{t=1}^{N}\,{x}_{2t}^{2})-{({\sum }_{y=1}^{N}{x}_{1t}{x}_{2t})}^{2}}\sim \tfrac{\widehat{{\sigma }_{{X}_{1}Y}}\cdot \widehat{{\sigma }_{{X}_{2}}^{2}}-\widehat{{\sigma }_{{X}_{2}Y}}\cdot \widehat{{\sigma }_{{X}_{1}{X}_{2}}}}{\widehat{{\sigma }_{{X}_{1}}^{2}}\cdot \widehat{{\sigma }_{{X}_{2}}^{2}}-{\widehat{{\sigma }_{{X}_{1}{X}_{2}}^{2}}}^{2}},\\ {\hat{\beta }}_{2} & = & \tfrac{({\sum }_{t=1}^{N}\,{x}_{2t}\,{y}_{t})\cdot ({\sum }_{t=1}^{N}\,{x}_{1t}^{2})-({\sum }_{t=1}^{N}\,{x}_{1t}\,{y}_{t})\cdot ({\sum }_{t=1}^{N}\,{x}_{1t}\,{x}_{2t})}{({\sum }_{t=1}^{N}\,{x}_{1t}^{2})\cdot ({\sum }_{t=1}^{N}\,{x}_{2t}^{2})-{({\sum }_{t=1}^{N}{x}_{1t}{x}_{2t})}^{2}}\sim \tfrac{\widehat{{\sigma }_{{X}_{2}Y}}\cdot \widehat{{\sigma }_{{X}_{1}}^{2}}-\widehat{{\sigma }_{{X}_{1}Y}}\cdot \widehat{{\sigma }_{{X}_{1}{X}_{2}}}}{\widehat{{\sigma }_{{X}_{1}}^{2}}\cdot \widehat{{\sigma }_{{X}_{2}}^{2}}-{\widehat{{\sigma }_{{X}_{1}{X}_{2}}^{2}}}^{2}},\end{array}$$
(2)

where 〈·〉 denotes the mean value of the whole time period, x1t = X1t − 〈X1〉, x2t = X2t − 〈X2〉, and y t  = Y t  − 〈Y〉. Then the estimator of residuals can be determined by \({\hat{e}}_{t}={Y}_{t}-{\hat{\beta }}_{1}{X}_{1t}-{\hat{\beta }}_{2}{X}_{2t}-\langle {Y}_{t}-{\hat{\beta }}_{1}{X}_{1t}-{\hat{\beta }}_{2}{X}_{2t}\rangle \). With it one can obtain the estimators of variance of the two regression coefficients as

$$\{\begin{array}{rcl}{var}({\hat{\beta }}_{1}) & = & \tfrac{({\sum }_{t=1}^{N}\,{x}_{2t}^{2})\cdot \tfrac{{\sum }_{t=1}^{N}\,{\hat{e}}_{t}^{2}}{N-3}}{({\sum }_{t=1}^{N}\,{x}_{1t}^{2})\cdot ({\sum }_{t=1}^{N}\,{x}_{2t}^{2})-{({\sum }_{t=1}^{N}{x}_{1t}{x}_{2t})}^{2}}\sim \tfrac{1}{N-3}\cdot \tfrac{\widehat{{\sigma }_{{X}_{2}}^{2}}\cdot \widehat{{\sigma }_{\varepsilon }^{2}}}{\widehat{{\sigma }_{{X}_{1}}^{2}}\cdot \widehat{{\sigma }_{{X}_{2}}^{2}}-{\widehat{{\sigma }_{{X}_{1}{X}_{2}}^{2}}}^{2}},\\ {var}({\hat{\beta }}_{2}) & = & \tfrac{({\sum }_{t=1}^{N}\,{x}_{1t}^{2})\cdot \tfrac{{\sum }_{t=1}^{N}\,{\hat{e}}_{t}^{2}}{N-3}}{({\sum }_{t=1}^{N}\,{x}_{1t}^{2})\cdot ({\sum }_{t=1}^{N}\,{x}_{2t}^{2})-{({\sum }_{t=1}^{N}{x}_{1t}{x}_{2t})}^{2}}\sim \tfrac{1}{N-3}\cdot \tfrac{\widehat{{\sigma }_{{X}_{1}}^{2}}\cdot \widehat{{\sigma }_{\varepsilon }^{2}}}{\widehat{{\sigma }_{{X}_{1}}^{2}}\cdot \widehat{{\sigma }_{{X}_{2}}^{2}}-{\widehat{{\sigma }_{{X}_{1}{X}_{2}}^{2}}}^{2}}.\end{array}$$

The variance illustrates the accuracy of the estimated parameters. The estimated regression coefficients together with their variances can be further employed for hypothesis test and model evaluation. As an important indicator to evaluate the regression model, the determination coefficient R2 is defined by

$${R}^{2}=1-\frac{{\sum }_{t=1}^{N}\,{\widehat{{e}_{t}}}^{2}}{{\sum }_{t=1}^{N}\,{y}_{t}^{2}}=1-\frac{\widehat{{\sigma }_{\varepsilon }^{2}}}{\widehat{{\sigma }_{Y}^{2}}},$$
(3)

with the range of [0, 1]. R2 measures a proportion of variance of Y explained by X1 and X2 and higher value of R2 implies better model interpretation ability. Moreover, to quantify sensitivity of explained variable to each explaining variable, two quantities, namely, the beta coefficient (denoted as \({\beta }_{j}^{\ast }\)) and the average elasticity coefficient (denoted as η j ), are defined

$${\beta }_{j}^{\ast }={\hat{\beta }}_{j}\sqrt{\frac{{\sum }_{t=1}^{N}\,{x}_{jt}^{2}}{{\sum }_{t=1}^{N}\,{y}_{t}^{2}}}\,{\rm{for}}\,j=1\,{\rm{and}}\,2\,$$
(4)

and

$${\eta }_{j}={\hat{\beta }}_{j}\frac{\langle {X}_{j}\rangle }{\langle Y\rangle }\,{\rm{for}}\,j=1\,{\rm{and}}\,2,\,$$
(5)

which can explain the relative importance of variables X1 and X2 to Y. According to31, the advantage of translating the standard notation into variance and covariance shown on the right-hand side of Eqs (2)–(5) is available to use the DFA/DCCA methods based on the same idea.

The DFA-based variance and DCCA-based covariance functions

DFA and DCCA methods are described as follows. For a time series {z t }, t = 1, 2, …, N, we split its profile \({Z}_{t}={\sum }_{i=1}^{t}\,({z}_{i}-\langle z\rangle )\) into N n  = [N/n] nonoverlapping segments of equal length n, denoted as Zj,k, k = 1, 2, …, n. The same procedure is repeated starting from the opposite end to avoid disregarding a short part of the series in the end and thus 2N n segments are obtained altogether. In the jth segment, we have Zj,k = Z(j−1)n+k for j = 1, 2, …, N n and \({Z}_{j,k}={Z}_{N-(j-{N}_{n})n+k}\) for j = N n  + 1, N n  + 2, …, 2N n , where k = 1, 2, …, n. In each segment, the local linear (or other) trend33,34 can be fitted as \(\widehat{{X}_{j,k}}\) (in our work, we use 2nd order polynomial to fit the trend in each segment). Fluctuation function \({f}_{Z}^{2}(n,j)\) is then defined for each segment as

$${f}_{Z}^{2}(n,j)=\frac{1}{n}\,\sum _{k=1}^{n}\,{({Z}_{j,k}-\widehat{{Z}_{j,k}})}^{2}.$$
(6)

Averaging the fluctuation \({f}_{Z}^{2}(n,j)\) over all segments yields

$${F}_{Z}^{2}(n)=\frac{1}{2{N}_{n}}\,\sum _{j=1}^{2{N}_{n}}\,{f}_{Z}^{2}(n,j),$$
(7)

which is the so-called DFA-based scale-dependent variance function. To obtain the scale-dependent covariance of two equal length series {z1t} and {z2t}, t = 1, 2, …, N, we only need to translate the univariate fluctuation function in each segment and average fluctuation into the bivariate case directly,

$${f}_{{Z}_{1}{Z}_{2}}^{2}(n,j)=\frac{1}{n}\,\sum _{k=1}^{n}\,({Z}_{1j,k}-\widehat{{Z}_{1j,k}})\,({Z}_{2j,k}-\widehat{{Z}_{2j,k}}),$$
(8)
$${F}_{{Z}_{1}{Z}_{2}}^{2}(n)=\frac{1}{2{N}_{n}}\,\sum _{j=1}^{2{N}_{n}}\,{f}_{{Z}_{1}{Z}_{2}}^{2}(n,j).$$
(9)

The scale-characteristic fluctuation \({F}_{{Z}_{1}{Z}_{2}}^{2}(n)\) is the so-called DCCA-based covariance, which expresses the cross-correlation fluctuations between the series of {z1t} and {z2t}. Thus we have obtained all objects to create the DFA-based regression model. But for purpose of testing, we need some accessories of the DFA process. The DCCA cross-correlation coefficient ρ(n), proposed by Zebende35, can measure the cross-correlation between two nonstationary series at multiple time scales, which is defined as

$${\rho }_{DCCA}({Z}_{1},{Z}_{2},n)=\frac{{F}_{{Z}_{1}{Z}_{2}}^{2}(n)}{\sqrt{{F}_{{Z}_{1}}^{2}(n){F}_{{Z}_{2}}^{2}(n)}}.$$
(10)

To access intrinsic relations between the two time series on time scales of n, Yuan et al.36 and Qian et al.37 developed a so-called partial DCCA coefficient independently, which applies partial correlation technique to delete the impact of other variables on the two currently studied variables. This coefficient is defined as

$${\rho }_{PDCCA}({Z}_{1},{Z}_{2},n)=-\,\frac{{C}_{{j}_{1},{j}_{2}}(n)}{\sqrt{{C}_{{j}_{1},{j}_{1}}(n){C}_{{j}_{2},{j}_{2}}(n)}},$$
(11)

where C is the inverse matrix of the cross-correlation matrix produced by ρ DCCA (n) of Z1, Z2, …, and subscripts j1 and j2 stand respectively for the row and column of the location of ρ DCCA (Z1, Z2, n).

The DFA-based bivariate regression model

We now translate the standard bivariate regression process described above into the DFA-based bivariate regression model. The two estimators in Eq. (2) can be extended to the scale-dependent estimators in the following way using the scale-dependent variance and covariance defined in Eqs (7) and (9),

$$\{\begin{array}{rcl}{{\hat{\beta }}_{1}}^{DFA}(n) & = & \frac{{F}_{{X}_{1}Y}^{2}(n)\cdot {F}_{{X}_{2}}^{2}(n)-{F}_{{X}_{2}Y}^{2}(n)\cdot {F}_{{X}_{1}{X}_{2}}^{2}(n)}{{F}_{{X}_{1}}^{2}(n)\cdot {F}_{{X}_{2}}^{2}(n)-{[{F}_{{X}_{1}{X}_{2}}^{2}(n)]}^{2}},\\ {{\hat{\beta }}_{2}}^{DFA}(n) & = & \frac{{F}_{{X}_{2}Y}^{2}(n)\cdot {F}_{{X}_{1}}^{2}(n)-{F}_{{X}_{1}Y}^{2}(n)\cdot {F}_{{X}_{1}{X}_{2}}^{2}(n)}{{F}_{{X}_{1}}^{2}(n)\cdot {F}_{{X}_{2}}^{2}(n)-{[{F}_{{X}_{1}{X}_{2}}^{2}(n)]}^{2}}.\end{array}$$
(12)

Similarly, the scale-dependent residuals are

$${\hat{e}}_{t}(n)={Y}_{t}-{\hat{\beta }}_{1}^{DFA}(n){X}_{1t}-{\hat{\beta }}_{2}^{DFA}(n){X}_{2t}-\langle {Y}_{t}-{\hat{\beta }}_{1}^{DFA}(n){X}_{1t}-{\hat{\beta }}_{2}^{DFA}(n){X}_{2t}\rangle $$
(13)

with zero mean value. Inserting the calculated \({\hat{e}}_{t}(n)\) into the DFA process, we obtain the fluctuation \({F}_{\varepsilon }^{2}(n)\) to estimate the variances of \({{\hat{\beta }}_{1}}^{DFA}(n)\) and \({{\hat{\beta }}_{2}}^{DFA}(n)\) via Eq. (12) as

$$\{\begin{array}{rcl}{var}({{\hat{\beta }}_{1}}^{DFA}(n)) & = & \frac{1}{N-3}\cdot \frac{{F}_{{X}_{2}}^{2}(n)\cdot {F}_{\varepsilon }^{2}(n)}{{F}_{{X}_{1}}^{2}(n)\cdot {F}_{{X}_{2}}^{2}(n)-{[{F}_{{X}_{1}{X}_{2}}^{2}(n)]}^{2}},\\ {var}({{\hat{\beta }}_{2}}^{DFA}(n)) & = & \frac{1}{N-3}\cdot \frac{{F}_{{X}_{1}}^{2}(n)\cdot {F}_{\varepsilon }^{2}(n)}{{F}_{{X}_{1}}^{2}(n)\cdot {F}_{{X}_{2}}^{2}(n)-{[{F}_{{X}_{1}{X}_{2}}^{2}(n)]}^{2}}.\end{array}$$
(14)

Then Eqs (3)–(5) can be translated into the DFA regression form as

$${R}_{DFA}^{2}(n)=1-\frac{{F}_{\varepsilon }^{2}(n)}{{F}_{Y}^{2}(n)},$$
(14)
$${\beta }_{j}^{\ast DFA}(n)={{\hat{\beta }}_{j}}^{DFA}(n)\sqrt{\frac{{F}_{{X}_{j}}^{2}(n)}{{F}_{Y}^{2}(n)}}\,{\rm{for}}\,j=1\,{\rm{and}}\,2,\,$$
(15)

and

$${\eta }_{j}^{DFA}={{\hat{\beta }}_{j}}^{DFA}(n)\frac{\langle {X}_{j}\rangle }{\langle Y\rangle }\,{\rm{for}}\,j=1\,{\rm{and}}\,2.\,$$
(16)

Comparing to the standard R2, β*, and η, the scale-dependent \({R}_{DFA}^{2}(n)\), β*DFA(n), and ηDFA(n) express more abundant information on model interpretation from multiple time scales.