Time-Varying Multivariate Causal Processes

In this paper, we consider a wide class of time-varying multivariate causal processes which nests many classic and new examples as special cases. We first prove the existence of a weakly dependent stationary approximation for our model which is the foundation to initiate the theoretical development. Afterwards, we consider the QMLE estimation approach, and provide both point-wise and simultaneous inferences on the coefficient functions. In addition, we demonstrate the theoretical findings through both simulated and real data examples. In particular, we show the empirical relevance of our study using an application to evaluate the conditional correlations between the stock markets of China and U.S. We find that the interdependence between the two stock markets is increasing over time.


Introduction
The family of vector autoregressive (VAR) models and the family of multivariate (G)ARCH models are among some of the most popular frameworks for modelling dynamic interactions of multiple variables. The VAR family usually captures the dynamic by imposing structures on the time series itself, while the (G)ARCH family imposes restrictions on the conditional second moments. We acknowledge the vast literature of both families, and have no intention to exhaust all relevant studies in this paper for the sake of space. We refer interested readers to Stock & Watson (2001) and Bauwens et al. (2006) for excellent review on both families.
Although both families have rich literature on their own, to the best of the authors' knowledge not many works have been done to bridge them. Among limited attempts (e.g., Ling & McAleer 2003, Bardet & Wintenberger 2009), most (if not all) of these studies rely on the stationarity assumption. While the stationarity assumption comes in handy when deriving asymptotic properties, it may not be very realistic in practice (Preuss et al. 2015, Chen et al. 2021. For example, economic and financial data always include different macro shocks, as a consequence the behaviour can be quite volatile; the climate data may contain certain time trend which recently has attracted lots of attention due to greenhouse emission; etc. Anyway, certain nonstationarity may always occur. To account for nonstationarity, locally stationary processes have received considerable attention since the seminal work of Dahlhaus (1996), Dette et al. (2011), Zhang & Wu (2012), Truquet (2017), Dahlhaus et al. (2019), among others. In contrast to the unit root process, the locally stationary process nicely balances stationarity and nonstationarity by allowing for the simultaneous presence of both types of behaviours in one time series process.
In a very recent paper, Karmakar et al. (2022) consider simultaneous inference for a general class of univariate p-Markov processes with time-varying coefficients, which covers several time-varying versions of the classical univariate models (e.g., AR, ARCH, AR-ARCH) as special cases. Despite its generality, their study still rules out the time-varying versions of some widely used models (e.g., ARMA, GARCH, ARMA-GARCH). Also, it is worth mentioning this line of research heavily focuses on univariate time series, which somewhat limits the popularity of locally stationary processes.
That said, it is reasonable to call for a framework which can marry the VAR family and the (G)ARCH family while allowing for nonstationarity. To provide a concrete example, consider a time-varying multivariate GARCH model, which can model the co-movements of financial returns. Detailed investigation on such a model can help answer research questions like (i). Is the volatility of a market leading the volatility of other markets? (ii) Whether the correlations between asset returns change over time? (iii). Are they increasing in the long run, perhaps because of the globalization of financial markets? These are of great practical importance for both investors and policymakers (Bauwens et al. 2006, Diebold & Yilmaz 2009).
To allow for flexibility as much as possible from the modelling perspective, we consider a class of multivariate causal processes as follows: µ (x t−1 , x t−2 , . . . ; θ(τ t )) + H (x t−1 , x t−2 , . . . ; θ(τ t )) ε t , for t = 1, . . . , T µ (x t−1 , x t−2 , . . . ; θ(0)) + H (x t−1 , x t−2 , . . . ; θ(0)) ε t for t ≤ 0 , (1.1) where τ t = t/T , µ (·) is an m-dimensional random vector, H (·) is an m × m-dimensional random matrix, θ(τ ) is a d × 1 time-varying parameter of interest with each element belonging to C 3 [0, 1], and {ε t } is a sequence of independent and identically distributed (i.i.d.) random vectors. Note that the value of d usually depends on the value of m, and the connection becomes clear once a specific model is considered. As far as we are concerned, both of m and d are fixed throughout the paper. Notably, both µ(·) and H(·) are known, and share the same unknown parameter θ(·). The setting for t ≤ 0 regulates the time series for the periods that we do not observe, which is commonly adopted when certain nonstationarity gets involved (e.g., Vogt 2012). Essentially, it requires the initial time period does not have a diverging behaviour.
Before proceeding further, we provide two examples to briefly illustrate the rationality behind (1.1), and leave the detailed investigation on these examples to Section 2.4. We refer interested readers to Ling (2003), Ling & McAleer (2003) and Bardet & Wintenberger (2009) for extensive investigation on the parametric counterparts of these examples.
Example 2: Consider the time-varying multivariate GARCH(p, q) model where h j,t stands for the j th element of h t , and η t = Ω 1/2 (τ t )ε t . The model (1.4) generalizes the models of Bollerslev (1990) and Jeantheau (1998). Similar to Example 1, we show that (1.4) admits a representation in the form of (1.1), and θ(τ ) = vec(c 0 (τ ), C 1 (τ ), . . . , C p (τ ), D 1 (τ ), . . . , D q (τ ), Ω(τ )). (1.5) In view of the development of Example 1 and Example 2 in Section 2.4, one may further show the time-varying counterparts of the parametric models mentioned in Bardet & Wintenberger (2009) are also covered by (1.1). To this end, we argue that (1.1) does not only allows for nonstationarity and conditional heteroskedasticity, but also provides sufficient flexibility to cover many well adopted models in the literature.
In this paper, our contributions are in the following four-fold: (1). we consider a wide class of time-varying multivariate causal processes which nests many classic and new examples as special cases; (2). we prove the existence of a weakly dependent stationary approximation for the model ( The paper is organized as follows. Section 2 presents the theoretical findings associated with the stationary approximation, estimation and inferences. In Section 3, we conduct extensive simulation studies to examine the theoretical findings, and further investigate the time-varying conditional correlations between the Chinese and U.S. Stock market. Section 4 concludes. Due to space limit, we give the proofs of the main results to the online appendices of the paper. Before proceeding further, it is convenient to introduce some notation: the symbol | · | denotes the Euclidean norm of a vector or the spectral norm for a matrix; v q := (E|v| q ) 1/q and · := · 2 for short; ⊗ denotes the Kronecker product; denotes the Hadamard product; I a stands for an a × a identity matrix; 0 a×b stands for an a × b matrix of zeros, and we write 0 a for short when a = b; for a function g(w), let g (j) (w) be the j th derivative of g(w), where j ≥ 0 and g (0) (w) ≡ g(w); K h (·) = K(·/h)/h, where K(·) and h stand for a nonparametric kernel function and a bandwidth respectively; letc k = 1 −1 u k K(u)du is a diagonal matrix with the vector a on its main diagonal, while diag(A) creates a vector from the diagonal of matrix A; finally, let → P and → D denote convergence in probability and convergence in distribution, respectively.

Estimation and Asymptotics
In this section, we first prove the existence of a weakly dependent stationary approximation for the model (1.1) in Section 2.1; we then provide the estimation approach using the local linear quasi-maximum-likelihood estimation and establish the asymptotic properties of the proposed estimator in Section 2.2; Section 2.3 provides results on both point-wise and simultaneous inferences; Section 2.4 gives some detailed examples to justify the usefulness of our study.

Stationary Approximation
To study (1.1), the first challenge lies in the fact that the model may not be stationary.
Therefore, for ∀τ ∈ [0, 1], we initial our analysis by finding a stationary approximation for each x t with t ≥ 1. By doing so, we are able to measure the weak dependence of {x t } using the nonlinear system theory in Wu (2005), which then provides us a framework to derive the asymptotic properties accordingly.
To be clear on the dependence measure, consider an example in which e t is a stationary process, and admits a causal representation e t = J(ε t , ε t−1 , . . .) with J(·) being a measurable function. See Tong (1990) for discussion on nonlinear time series of this kind. For k ≥ 0, we define the following dependence measure: where ε * 0 is an independent copy of {ε j }. Being able to measure the time series dependence such as (2.1) is the starting point for time series analyses.
We now introduce some basic assumptions.
where z j and z j are the j th columns of z and z respectively.
Assumption 1.1 is standard when studying dynamic time series model (Lütkepohl 2005).
In Assumption 1.2, ϑ is a generic d × 1 vector, and has the same length as θ(·). This assumption imposes Lipschitz-type conditions on µ(·) and H(·), which are rather minor, and can be easily fulfilled by a variety of models such as those mentioned in Section 1.
stationary approximation for each x t , but also ensures the approximated process has some proper moments. Similar conditions have also been adopted in Bardet & Wintenberger (2009).
With these conditions in hand, we present the following proposition which facilitates the development in what follows.
Proposition 2.1. Let Assumption 1 hold. For any τ ∈ [0, 1], there exists a stationary process It is worth mentioning that for a univariate p-Markov process Karmakar et al. (2022) show that there exists 0 < ρ < 1 such that sup τ ∈[0,1] δ xp(τ ) r (k) = O(ρ k ) based on the development of Wu & Shao (2004). From a methodological viewpoint, we give a set of new proofs which allow us to measure the dependence of multivariate causal processes with infinity memory. The term ∞ j=p+1 [α j (θ(τ )) + β j (θ(τ ))] in the second result of Proposition 2.1 arises due to the infinity memory structure of x t (τ ). Thus, the dependence δ x(τ ) r (k) relies on the choice of p and the decay rates of the coefficients α j (θ(τ )) and β j (θ(τ )).
To ensure x t (τ ) can approximate x t reasonably well, we impose more structure below.
Using Assumptions 1-2, we can measure the distance between x t (τ ) and x t as follows.
Proposition 2.2. Suppose Assumptions 1-2 hold. Then We can consider Proposition 2.2 as the stochastic version of the Hölder continuity. Having established the stationary approximation in Proposition 2.2, we move on to investigate the estimation theory in the next subsection.

Estimation
We point out a few facts to facilitate the setup of the likelihood function. First, let z t = (x t , x t−1 , . . .) include all the information of x t up to the time period t. However, in practice, our observation on x t only starting from t = 1, so we have to work with the truncated version of z t for each t ≥ 1: Second, we note that when τ t is sufficiently close to τ , Therefore, we are able to parametrize θ(·), and consider the maximum-likelihood estimation for each given τ . Finally, since ε t may not be normally distributed, we consider the local linear quasi-maximum-likelihood estimation (QMLE) method.
We impose more structures in order to derive the asymptotic distribution.
2. There exists a nonnegative sequence {χ j } ∞ j=1 with χ j = O(j −(2+s) ) and some s > 0 such that for any z, z ∈ (R m ) ∞ and any ϑ, ϑ ∈ Θ r : , and k = 1, 2.  Assumption 4 imposes the Lipschitz-type conditions on the first and second order derivatives of µ(·) and H(·) to ensure the smoothness of their functional components.
Assumption 5 is a set of regular conditions on the kernel function and the bandwidth.
With these conditions in hand, we summarize the first theorem of this paper below.

Inference
In this section, we first discuss how to conduct point-wise inference, and then move on to derive the asymptotic results associated with the simultaneous inference. Specifically, for some preassigned significance level α ∈ (0, 1), we shall construct a 100(1 − α)% asymptotic Notably, the simultaneous inference nests the traditional constancy test as a special case.
It does not only allow one to examine whether a time-varying model should be preferred to its parametric counterpart, but also allows one to test any particular functional form of interest. For example, if a horizontal line can be embedded in the SCB {Υ(τ )}, then we accept the hypothesis that some elements of θ(τ ) are constant.
Point-wise Inference: First, we construct a bias-corrected estimator in order to remove the asymptotic bias of Theorem 2.1. Specifically, we let where θ h/ √ 2 (τ ) is defined in the same way as θ(τ ) but using the bandwidth h/ √ 2.
After tedious development (Lemma B.7 of Appendix B), we have uniformly over τ ∈ that is essentially a fourth-order kernel. It then infers that under the conditions of Theorem 2.1, where It is noteworthy that the construction of (2.6) is different from directly using the fourthorder kernel in the regression. In terms of bandwidth selection, the traditional methods (e.g., cross-validation) still remain valid for (2.6) . However, if one directly employs the fourth-order kernel in the regression, it remains unclear how to select the optimal bandwidth in practice.
Now we discuss how to estimate Σ θ (τ ) which is constructed by Σ(τ ) and Ω(τ ). Intuitively, we consider the following estimator , Note that we consider a local constant estimator in (2.8) rather than a local linear one, that is to avoid an implementation issue for finite sample studies (i.e., nonpositive definite covariance may occur when the local linear approach is employed). Such a numerical problem has been well explained and investigated in the literature. See Chen & Leng (2015) for example.
In Theorem 2.2, ν is slightly smaller than 1/2 as we only require r to be slightly larger than 6. Hence, the usual optimal bandwidth h opt = O(T −1/5 ) satisfies the conditions (log T ) 4 /(T ν h) → 0 and T h 7 log T → 0.
As shown in Theorem 2.2, the convergence rate of the simultaneous confidence intervals for θ C (·) is of logarithmic rate and is therefore slow. In order to improve the rate, we consider a bootstrap method which shows a much better finite sample performance. We summarize the result in the following corollary. and .
By Corollary 2.2, we propose the following numerical procedure to construct the SCB of θ C (τ ): Step 1 Use the sample {x t } T t=1 to estimate θ C (τ ) by (2.5), and compute θ C (τ ) based on (2.6).
Step 2 Generate i.i.d. k-dimensional standard normal variables {v * t } and calculate the quan- Step 3 Repeat Step 2 R times to obtain the empirical ( Step 4 Calculate Σ C (τ ) using (2.8), and construct the SCB of where B k = {u ∈ R k : |u| ≤ 1} is the unit ball, and k is the rank of C.

Examples
Below, we demonstrate the usefulness of the aforementioned results by considering Example 1 and Example 2 of Section 1.
Proposition 2.3. Let ε t r < ∞ for some r > 4. Suppose that there is a compact set L q are coprime and satisfy some necessary identification conditions.
Then, the results of Theorems 2.1 and 2.2 hold for model (1.2).
We note that the detailed identification conditions required for VARMA processes (e.g., the final equations form or echelon form) can be found in Lütkepohl (2005). We no longer discuss them here in order not to derivative from our main goal.
Example 2 (Cont.) -We further let (2.14) For ∀τ ∈ [0, 1], the corresponding approximated stationary process is defined as Note that Ψ j (τ ) is generated as follows: Consequently, we can present the following proposition.
Proposition 2.4. Suppose that there is a compact set for some r > 6, (3). all the roots of |I m − p j=1 C j − q j=1 D j | are outside the unit circle with C j 's and D j 's being squared matrices of nonnegative elements, (4). c 0 is a vector of positive elements, (5). C(L) and D(L) are coprime and the formulation of the GARCH part is minimal, where C(L) := C 1 L + · · · + C p L p and D(L) Then the results Theorems 2.1 and 2.2 hold for model (1.4).
For the identification conditions of the GARCH process, we refer readers to Proposition 3.4 of Jeantheau (1998), who proves that assuming the minimal representation is enough for ensuring Assumption 3 holds.
In the following section, we conduct numerical studies using both simulated and real data to evaluate the finite-sample performance of the proposed estimation and inferential methods.
In this section, we first present the details of the numerical implementations in Section 3.1, and then conduct extensive simulations in Section 3.2. Section 3.3 presents a real data example on the conditional correlations between the Chinese and U.S. stock markets.

Numerical Implementation
Throughout the numerical studies, the Epanechnikov kernel where h is the bandwidth selected by the cross-validation method of Richter et al. (2019).
Specifically, define the leave-one-out local linear QMLE Then, the bandwidth is chosen by As shown in Richter et al. (2019), this cross validation method works well as long as ∇l is uncorrelated, which implies that this desirable property should hold in our case.
Notably, when considering some specific models, the implementation may be further simplified. We provide more discussions along this line in Appendix B.4.

Simulation Results
In the simulation studies, we examine the empirical coverage probabilities of simultaneous confidence intervals for nominal levels α = 90%, 95%. We consider the time-varying VARMA(2, 1) and multivariate GARCH(1, 1) model as follows: Here we use final equations form to ensure the uniqueness of the VARMA representation.
2. DGP 2 : Let the sample size be T ∈ {500, 1000} (T ∈ {1000, 2000, 4000}) for the VARMA model (the GARCH model). We conduct 1000 replications for each choice of T . Several different bandwidths close to h are reported to check the sensitivity of bandwidth selection.
We present the empirical coverage probabilities associated with the SCB in Tables 1-2. For the vector-or matrix-valued unknown coefficients, we take an average across the elements. A few facts emerge from the tables. First, the finite sample coverage probabilities are smaller than their nominal level when T = 500 (T = 1000, 2000) for the VARMA model (the GARCH model), but are fairly close to their nominal level as T = 1000 (T = 4000) for the VARMA model (the GARCH model). Second, the behaviour of the estimated simultaneous confidence intervals is not sensitive to the choices of bandwidths. Third, the GARCH model requires more data to reach a reasonable finite sample performance.

A Real Data Example
In this subsection, we investigate the time-varying conditional correlations between the Chinese and U.S. stock markets using the time-varying multivariate GARCH model. Recently, there is a growing literature to study the relationship of the two stock markets (e.g., Zhang & Li 2014, Pan et al. 2022, as the Chinese stock market has become the world's  Previous research documents a strong positive link between the degree of globalization and equity market interdependence (Baele 2005). Along this line of research, one important question is that whether the interdependence between the Chinese and U.S. stock markets has increased over time due to globalization so that estimates from historical data are unre-liable for modern policy analysis, asset pricing and risk management. The existing results present many discrepancies, which may be due to the fact that the relationship evolves with time. Apparently, the results also indicate that one should use time-varying GARCH model to accommodate potential nonstationarity inherited in these financial variables. In addition, as pointed out by Caporin & McAleer (2013), dynamic conditional correlation (DCC) GARCH model represents the dynamic conditional covariances of the standardized residuals, and hence does not yield dynamic conditional correlations; DCC yields inconsistent two step estimators; DCC has no asymptotic properties. In what follows, we address these issues using the newly proposed approach. The estimation is conducted in exactly the same way as in Section 3.1, so we no longer repeat the details.
We calculate the Chinese and U.S. stock returns based on weekly Shanghai Stock Exchange (SSE) Composite Index and S&P 500 Index as they are the most comprehensive and diversified stock indices. The sample employed in this study spanning from January 2000 to February 2022 provides 1119 observations 1 . Figure 1 plots the two weekly returns as well as sample autocorrelation functions of squared data, which shows the typical "volatility clustering" phenomenon.
We next fit the data to a time-varying multivariate GARCH(1,1) model and are particularly interested in the estimates of time-varying conditional correlations, i.e., where ρ 1,2 (·) is defined in (2.14). Figure

Conclusions
In this paper, we consider a wide class of time-varying multivariate causal processes which nests many classic and new examples as special cases. We first prove the existence of a weakly dependent stationary approximation for the model (1.1) which is the foundation to establish the corresponding asymptotic properties. Afterwards, we consider the QMLE estimation approach, and provide both point-wise and simultaneous inferences on the coefficient functions. In addition, we demonstrate the theoretical findings through both simulated and real data examples. In particular, we show the empirical relevance of our study using an application to evaluate the conditional correlations between the stock markets of China and U.S. We find that the interdependence between the two stock markets is increasing over time.
There are several directions for possible extensions. The first one is to consider quantile regression methods for such locally stationary multivariate causal processes. The second one is to propose a more powerful L 2 test based on the weighted integrated squared errors for testing whether some coefficients are time-invariant. We wish to leave such issues for future study. In what follows, M and O(1) always stand for some bounded constants, and may be different at each appearance.

A.1 Technical Tools
Projection Operator: Define the projection operator where F t = σ(ε t , ε t−1 , . . .). By the Jensen's inequality and the stationarity of x t (τ ), for l ≥ 0, we The Class H(C, χ, M ): Recall that we have defined Θ r in Assumption1. Let χ = {χ j } ∞ j=1 be a sequence of nonnegative real numbers with |χ| 1 := ∞ j=1 χ j < ∞ and M > 0 be some finite constant. Let |z| χ := ∞ j=1 χ j |z j | for any z ∈ (R m ) ∞ and C ≥ 1, where z j is the j th column of z. A function g(z, ϑ) : If g is vector-or matrix-valued, g ∈ H(C, χ, M ) means that every component of g is in Analytical Gradient: where M(z; ϑ) = H(z; ϑ)H(z; ϑ) . Then the first partial derivative is as follows: where ϑ i is the i th element of ϑ.
By (A.1.1), the second partial derivative of l(x, z; ϑ) is given by

A.2 Proofs of the Main Results
Proof of Proposition 2.1.
By construction, we immediately obtain where the second inequality follows from Assumption 1.
According to the above development, we are readily to conclude that x p,t (τ ) → x t (τ ) as p → ∞ in the space of L r = {x | x r < ∞}. As a limit of strictly stationary process in L r , x t (τ ) is a stationary process and sup τ ∈[0,1] x t (τ ) r < ∞.
(2). Let {ε * t } be an independent copy of {ε t }. Similar to (A.2.1), we define the process { x * p,t (τ )}, in which the difference is that we use ε t when t = 0, and use ε * t when t = 0. In addition, define the process { x * t (τ )} as { x t (τ )}, in which again the difference is that we use ε t when t = 0, and use ε * t when t = 0. Further define u t = x * p,t (τ ) − x p,t (τ ) r .
Now, let v t = max k≥t u k . Using (A.2.3) and the fact that v t is a nonincreasing sequence, we The proof of the first result gives The same bound holds for the quantity x * t (τ ) − x * p,t (τ ) r . Thus, which completes the proof.
Proof of Proposition 2.2.
(1). Write where the second inequality follows from Assumption 1 and Assumption 2. In view of the stationarity of x t (τ ), rearranging the terms in the above inequality yields that (2). Write by the first result of this proposition, we have In addition, as The proof is now complete.
Proof of Theorem 2.1.
(1). First, we introduce a few notations to facilitate the development. Let η(τ ) : Recall that we have defined ∇ ϑ , and let ∇ η be defined similarly with respect to the elements of η.
The proof of the first result is now complete.
Then we have Ω(τ ) = −Σ(τ ) if ε t is normal distributed. The proof is now complete.
Proof of Corollary 2.1.
For Ω(τ ), as ∇ θ l(∇ θ l) ∈ H(6, χ, M ), here we use a different argument to prove the result, which leads to weaker moment conditions. By Lemma B.3.4 and Lemma B.8.2, we have By partial summation, we have Hence, we have sup τ ∈[0,1] |S T (τ )| ≤ M max t |S t,T |. Note that {P t−j g( y t (τ t ), ϑ)} t forms a sequence of martingale differences. By the Doob's L q maximal inequality, Burkholder's inequality and the elementary inequality ( i |a i |) q ≤ i |a i | q for 0 < q ≤ 1, we obtain that as T h 7 log T → 0 and T h 2 /(log T ) 4 → ∞. In addition, by Lemmas B.8, B.4.2 and B.9, we have Hence, we have Combining the above analyses, we then complete the proof.
Hence, we only need that the innovation process has 4 + s moments for some s > 0 compared to 6 + s moments needed in Theorem 2.2.
However, by using techniques which are more specific to the VARMA models, the condition ∞ j=1 |Γ j (τ )| < 1 can be weakened to det(A τ (L)) = 0 for all |L| ≤ 1. Similar to the above For the identification conditions stated in Assumption 3, it is well known that the final form or echelon form is enough to ensure the uniqueness of the VARMA representation.
The proof is now complete.

Proof of Proposition 2.4.
Since H(z; θ) is a positive and diagonal matrix, we have Then Assumption 1 is automatically met if η t (τ ) r ∞ j=1 |Ψ j (τ )| 1/2 < 1. In addition, as |Ψ j (τ )| converges to zero with exponential rate and ∂h the GARCH process, we refer readers to Proposition 3.4 of Jeantheau (1998), who proves that assuming the minimal representation is enough for ensuring Assumption 3 holds.
However, by using techniques which are more specific to the GARCH models, the condition . We first prove the existence of y t (τ ) r/2 (which implies the existence of x t (τ ) r ) as well as its weak dependence property by means of a chaotic expansion. Since y t (τ ) = V t (τ )α(τ )+ ∞ j=1 V t (τ )Ψ j (τ ) y t−j (τ ), by substitute y t−j (τ ) recursively, we have To prove the boundedness of y t (τ ) r/2 , since { V t (τ )} are independent random variables, it suffices to show that Hence, we have x t (τ ) r < ∞. Next, we show that δ By using the same arguments as in the proof of Proposition 2.1, we have δ |Ψ j (τ )| < 1 and |Ψ j (τ )| = O(ρ j ). Since |a − b| ≤ |a 2 − b 2 | 1/2 for a ≥ 0, b ≥ 0 and H(·) is a positive diagonal matrix, for t ≥ 1, we have for some 0 < ρ < 1.
The proof is now completed.

B.2 Secondary Lemmas
Before proceeding further, we introduce some extra notations. Assume that there exists some measurable function H(·, ·) such that for ∀τ ∈ Assume that Σ h (τ ) is Lipschitz continuous and its smallest eigenvalue is bounded away from 0 uniformly over τ ∈ [0,1]. In what follows, we let h 0,i (τ ) stand for the i th component of h t (τ ).
Then, we have Let S h (t) = t s=1 h s (τ s ). Then on a richer probability space, there exists i.i.d. k-dimensional standard normal variables v 1 , v 2 , . . . and a process S 0 Lemma B.10 is from Theorem 1 and Corollary 2 of Wu & Zhou (2011).

B.3 Proofs of Preliminary Lemmas
Proof of Lemma B.1.

Proof of Lemma B.2.
We first consider l(·). Write where the definitions of I 1 and I 2 should be obvious, and M(z; ϑ) = H(z; ϑ)H(z; ϑ) .
For I 2 , we have where the second inequality follows from the facts that by using Assumption 1.2 twice.
By Assumption 3, it is easy to know that det(M(z; ϑ)) ≥ H > 0, which in connection with the fact log(·) is Lipschitz continuous on [H, ∞) yields that
Similar to the development for l, we can show ∇l, ∇ 2 l ∈ H(3, χ, M ) and ∇l, ∇ 2 l ∈ The proof is now complete.
Proof of Lemma B.3.
In addition, by using the Lipschitz property of K(·), we have Combing the above analyses, the first result follows.
(2). By Lemma B.8.1 and Proposition 2.2, for |τ t − τ | ≤ h, we have Hence, we have by the definition of Riemann integral and the stationarity of y t (τ ).
(4). By Propositions 2.1.1 and 2.2.2, we have sup τ By Lemma B.8 and the definitions of y t and y c t , we have In addition, by Proposition 2.2.1 and χ j = O(j −(2+s) ) for some s > 0, we have The proof of the fourth result is now complete.
Proof of Lemma B.4.
(1). Let y * t (τ ) be a coupled version of y t (τ ) with ε 0 replaced by ε * 0 . By Lemma B.8, we have By Proposition 2.1.2 and the conditions on α j (θ(τ )) and β j (θ(τ )) in the body of this lemma, we have δ The proof of the first result of this lemma is now complete.
(2)-(3). Since the second result follows directly from the first result.
where m (2) t,i (τ, u) is yielded by the coupled version.
The proof is now complete.
Proof of Lemma B.5.
(1). Note that is a sequence of martingale differences.
Similarly, for q ≥ 2, by the Burkholder's inequality and the Minkowski inequality, we have The proof of the first result is now complete.
Proof of Lemma B.6.
The proof of the first result is now complete.
The proof is now complete.

B.4 Computation of the Local Linear ML Estimates
In our numerical studies, we use the function fminunc in programming language MATLAB to minimize the negative of log-likelihood function. The initial guess is important when using optimization functions because these optimizers are trying to find a local minimum, i.e. the one closest to the initial guess that can be achieved using derivatives. In this section, we give a possible choice of initial estimates.
We could estimate the coefficients of time-varying VARMA(p, q) model by kernel-weighted least squares method if the lagged η t were given. To obtain a preliminary estimator, we first fit a long VAR model and then use estimated residuals in place of true residuals.
Consider the VAR(p T ) model where p T is set to be 2(T h) 1/3 in our numerical studies. Then, we compute η t = x t − p T j=1 Γ j (τ t )x t−j , where { Γ j (τ )} are the local linear least squares estimators. Given η t , we are able to estimate {A j (τ )}, {B j (τ )} and Ω(τ ) as well as their derivatives by local linear least squares method.