Spectral Subsampling MCMC for Stationary Multivariate Time Series with Applications to Vector ARTFIMA Processes

Spectral subsampling MCMC was recently proposed to speed up Markov chain Monte Carlo (MCMC) for long stationary univariate time series by subsampling periodogram observations in the frequency domain. This article extends the approach to multivariate time series using a multivariate generalisation of the Whittle likelihood. To assess the computational gains from spectral subsampling in challenging problems, a multivariate generalisation of the autoregressive tempered fractionally integrated moving average model (ARTFIMA) is introduced and some of its properties derived. Bayesian inference based on the Whittle likelihood is demonstrated to be a fast and accurate alternative to the exact time domain likelihood. Spectral subsampling is shown to provide up to two orders of magnitude additional speed-up, while retaining MCMC sampling efficiency and accuracy, compared to spectral methods using the full dataset. Keywords: Bayesian, Markov chain Monte Carlo, Semi-long memory, Spectral analysis, Whittle likelihood.


INTRODUCTION
Recent technological developments in sensors, data storage and computing power make it possible to collect high frequency time series data at low cost; some examples are financial transaction data (Mykland and Zhang, 2012), neuroimaging data with high temporal resolution (Chen et al., 2019), sensor data from robots (Deisenroth et al., 2013) or meteorological weather stations, and GPS and smart card data used in transportation (Welch and Widita, 2019).
However, statistical analysis of time series with tens of thousands, hundreds of thousands, or even millions of data points is computationally challenging, especially when inferences are obtained with iterative methods such as Markov chain Monte Carlo (MCMC) simulation or stochastic optimisation algorithms, where the likelihood is evaluated a large number of times.It is therefore common to only use a portion of the data for inference, for example only the most recent observations, or by systematically selecting every kth data point over the study period.Such downsampling wastes valuable data, gives less precise inferences, and is not even an option when predictions are required at the original sampling frequency.Salomone et al. (2020) propose spectral subsampling MCMC to accelerate MCMC for long stationary univariate time series.They use the asymptotically motivated Whittle likelihood (Whittle, 1953) to approximate the likelihood of a stationary time series.The Whittle likelihood is based on the discrete Fourier transform with the important property of transforming a time series with dependent observations to asymptotically independent periodogram observations in the frequency domain.The key insight in Salomone et al. ( 2020) is that such independence makes it possible to extend subsampling MCMC approaches for independent data (Quiroz et al., 2018a(Quiroz et al., , 2019(Quiroz et al., , 2021;;Dang et al., 2019) to univariate stationary time series by systematic subsampling of periodogram observations.
All of the applications mentioned above are naturally analysed in a multivariate setting: financial portfolios consisting of many assets, neuroimaging data simultaneously measured at multiple brain locations, meteorological data collected at several spatial locations, and so on.The multivariate aspect naturally leads to even more demanding computations in a high frequency setting.
The main contribution in this article is extending the spectral subsampling MCMC methodology to stationary vector-valued time series by using a multivariate version of the Whittle likelihood based on the asymptotic properties of the matrix-valued periodogram.The proposed multivariate spectral subsampling MCMC algorithm is evaluated on three challenging large-scale multivariate time series applications from meteorology, hydrology and environmental science.
The multivariate Whittle likelihood is of general interest for large-scale likelihood and Bayesian inference beyond subsampling MCMC.Even without subsampling, the Whittle likelihood is substantially more scalable to large data as it sidesteps the costly matrix inversions needed for the exact likelihood in the time domain.But the independence of periodogram observations also makes the multivariate Whittle likelihood directly useful for computing the unbiased gradient estimates from random subsets/batches of frequencies needed for large-scale variational inference (Tran et al., 2017) or maximum likelihood estimates from stochastic gradient descent algorithms (Goodfellow et al., 2016).An additional contribution of our article is that our empirical results clearly demonstrate that the posterior based on the multivariate Whittle likelihood gives an excellent approximation to the exact time domain posterior, even in complex models such as the VARMA model.
A challenge for subsampling MCMC methods is keeping the variance of the likelihood estimator small enough for the MCMC chain to mix well.Previous literature is therefore typically restricted to applications using models with a very small number of parameters, or models with a moderate number of parameters with a simple structure, such as logistic regression.To test our multivariate spectral subsampling MCMC methodology on a challenging set of problems, we introduce a new multivariate model with semi-long range dependence that extends the autoregressive tempered fractionally integrated moving average (ARTFIMA, Sabzikar et al. (2019)) to the multivariate setting.Several properties of the vector ARTFIMA model are derived, including the spectral density matrix needed for the Whittle likelihood.For all examples considered, the tempered fractional differencing is shown to improve upon the standard multivariate vector autoregressive integrated moving average (VARIMA) model in terms of the Bayesian information criterion.Our empirical examples show that spectral subsampling works well and gives a very large speed-up for time series that are long enough to make the variance reducing control variates effective.We also highlight the limitation of subsampling MCMC by showing empirically that spectral subsampling MCMC can get stuck when estimating complex multivariate models on shorter time series.
The rest of the article is organised as follows.Section 2 presents the Whittle likelihood for multivariate stationary time series.Section 3 outlines the subsampling MCMC methodology.Section 4 presents the models considered in our applications and establishes properties needed for the implementation of spectral methods.Section 5 demonstrates the methodology and evaluates the efficiency of the proposed spectral subsampling algorithm on real data.Section 6 compares our proposed vector ARTIFIMA to the alternative of separately estimating univariate ARTFIMA models for each series (Sabzikar et al., 2019), both in terms of estimated parameters and in forecasting.Section 7 concludes and discusses future research.Appendix A derives some properties of the vector ARTFIMA model and Appendix B contains additional empirical results.

THE WHITTLE LIKELIHOOD FOR MULTIVARIATE TIME SERIES
Let X t ∈ R r be an r-variate zero mean stationary time series with absolutely summable autocovariance matrix function where Z is the set of integers.The spectral density matrix is with the diagonal elements being the usual spectral density for each univariate time series and the off-diagonal elements are the cross-spectral densities Since the elements of γ X (τ) are real, f X (ω) is Hermitian, i.e. f X (ω) H = f X (ω), where A H = (A) is the conjugate transpose of a matrix and A is the matrix of complex conjugates of the elements of A. Brillinger (2001)[Theorem 2.5.1]proves that f X (ω) is also non-negative definite.
The periodogram ordinate at frequency ω is defined as The periodogram ordinates are therefore asymptotically independent complex Wishart distributed with one degree of freedom I T (ω) ∼ CW r (1, f X (ω)) (Brillinger, 2001, Theorem 7.2.4),except for the frequencies ω k = 0 and ω k = π where the I T (•) instead follow a (real) Wishart distribution.The periodogram ordinates I T (ω) are singular matrices for r > 1.The density function of this singular Wishart distribution is derived for the real case by Uhlig (1994) with respect to the Hausdorff measure, and by Srivastava (2003) with respect the Lebesgue measure for the functionally independent elements of the matrix.The density for the complex singular Wishart distribution W ∼ CW r (ν, Σ) for ν < r over the space of r × r positive semidefinite matrices of rank ν is derived in Ratnarajah and Vaillancourt (2005)[Theorem 3] as where where E 1 is an r × ν complex orthogonal matrix and Λ = Diag( 1 , . . ., ν ).
Consider now inference for a parametric model with parameter vector θ and spectral density matrix f X,θ (ω k ).The Whittle log-likelihood exploits the asymptotic result for the periodogram and is defined using the complex singular Wishart distribution as where ΩT is the set of Fourier frequencies with the omission of ω k = 0 and ω k = π.The term for ω k = 0 is not included when the time series is demeaned since then J T (0) = 0 by construction; the term for ω k = π is removed for simplicity since it has a different distribution than the other frequencies and its influence is negligible asymptotically.Note that since f X,θ (ω k ) is Hermitian for an absolutely summable stationary process (Brillinger, 2001) are real-valued.

SUBSAMPLING MARKOV CHAIN MONTE CARLO
Subsampling MCMC uses the framework of pseudo-marginal MCMC (Andrieu and Roberts, 2009), in which an estimator of the likelihood is used within a Metropolis-Hastings algorithm.This section gives a briefly reviews subsampling MCMC; see Quiroz et al. (2019) and Salomone et al. (2020) for details on how pseudo-marginal methods are used for subsampling problems.
The cost of computing the likelihood L n (θ) in the acceptance probability (3.1) is a major concern when n is large.Quiroz et al. (2019) propose speeding up MCMC for large n by replacing L n (θ) with an estimate L(θ, u) based on a small random subsample of m n observations, where u = (u 1 , ..., u m ) indexes the selected observations.Their algorithm samples θ and u jointly from an extended target distribution π(θ, u) ∝ L(θ, u)p(θ)p(u).Andrieu and Roberts (2009) show that such pseudomarginal MCMC algorithms sample from the full-data posterior π(θ) if L(θ, u) is an unbiased and almost surely positive estimator of L n (θ), where the unbiasedness condition is E u L(θ, u) = L(θ, u)p(u)du = L n (θ).Quiroz et al. (2019) use an unbiased estimator of the log-likelihood (θ, u) and then debias exp( (θ, u)) to estimate the fulldata likelihood.This debiasing approach does not remove all bias and the marginal distribution of θ from their pseudo-marginal sampler is the slightly perturbed posterior The perturbation error of η(θ) in (3.2) is within O(n −1 m −2 ) distance in total variation norm of the true posterior π(θ), and the applications in Quiroz et al. (2019Quiroz et al. ( , 2021) ) and Dang et al. (2019) show negligible bias.Salomone et al. (2020) show that these results extend to subsampling for the Whittle likelihood, with the true posterior in this case based on the Whittle likelihood for the full data.
To apply subsampling MCMC, the log-likelihood needs to decompose as a sum (θ) = ∑ n k=1 k (θ); either by assuming independent data or by using the Whittle likelihood in the frequency domain for temporally dependent data as in (2.6).Estimating the log-likelihood is analogous to the problem of estimating a population total in survey sampling (Quiroz et al., 2018b).
It is by now well-known that subsampling MCMC requires a likelihood estimator with small variance, otherwise the sampler tends to get stuck (Quiroz et al., 2019(Quiroz et al., , 2018b)).Quiroz et al. (2019) propose using the difference estimator with control variates q k (θ) to reduce the variance.The second term in the first equation is an unbiased estimate of (θ) − q(θ) and is efficient if The control variates homogenise the log-density terms in the estimator so that the observations can be sampled by simple random sampling (Quiroz et al., 2019).Bardenet et al. (2017) propose setting q k (θ) equal to a second order Taylor expansion around θ (e.g. the posterior mode).An important property of this control variate is that the q(θ) term can be computed in O(1) time.
The control variate in Bardenet et al. (2017) works well when the log-density k (θ) of each observation is approximately quadratic in the neighbourhood of θ explored by the MCMC.There is, however, no guarantee that the individual log-densities k (θ) are close to quadratic in complex models, particularly in large parameter spaces where the posterior may not be highly concentrated around θ .Salomone et al. (2020) therefore propose a grouped quadratic control variate, in which observations are divided into G groups and the log-likelihood contribution for each group is approximated by a quadratic function.The idea is that the Bernstein-von Mises theorem (asymptotic normality of the posterior) suggests an approximately quadratic log-likelihood for the group, given that the number of observations in the group is large enough.G should be chosen such that there are enough observations in each group for the approximation to be accurate.Note that the sampling units in this approach are the groups, and hence there are G group log-likelihoods to subsample.For the Whittle likelihood, the order of the dataset corresponds to the order of the frequencies in the DFT (from low to high).The groups should then be formed via systematic sampling, so that each group contains periodogram ordinates over the whole frequency spectrum; Figure 1 graphically illustrates this and the block setup which is now described.
Pseudo-marginal methods can be made much more efficient by correlating the estimators used at the numerator and denominator of the Metropolis-Hastings acceptance Algorithm 1: Subsampling MCMC with a random walk Metropolis proposal for θ and a block proposal for u.
Input: data y, unbiased log-likelihood estimator (θ, u) based on Taylor control variates, variance estimator σ 2 (θ, u) of (θ, u), prior density p(θ), initial value θ (0) , initial subsample else set u (j) , θ (j) ← u (j−1) , θ (j−1)  end Output: autocorrelated random draws θ (1) , . . ., θ (N) from η(θ) in (3.2). ratio (Deligiannidis et al., 2018;Tran et al., 2016;Quiroz et al., 2021).We use the Tran et al. (2016) approach and divide the random numbers u (here the set of subsampled observations indices) into K blocks.By updating the u only within one of the blocks at each MCMC iteration, Tran et al. (2016) show that the correlation between the logs of the estimators in the numerator and denominator is approximately 1 − 1/K, and setting K large makes the pseudo-marginal method less sensitive to the variance of the log of the likelihood estimator; see Figure 1 again for the relation between the grouping described in the previous paragraph and the block setup.
Algorithm 1 sketches the implementation of the algorithm; see Quiroz et al. (2019) for details.Salomone et al. (2020) demonstrate that spectral subsampling MCMC can be successfully applied in univariate time series models with likelihood functions that are known to be non-Gaussian in the parameters due to local non-identification, such as ARMA and ARTFIMA.We explore the performance of spectral subsampling MCMC for multivariate versions of ARMA and ARTFIMA, which pose an even greater challenge since the number of parameters in multivariate models typically increase quadratically with the number of time series.4.1.Vector ARMA.The vector autoregressive moving average VARMA(p, q) model is

THE VECTOR ARTFIMA PROCESS
are the AR and MA lag polynomials, respectively.We assume that the usual conditions for stationarity and invertibility of the VARMA process hold: Assumption 1.The matrix polynomials Φ(z) and Θ(z) share no common zeros and |Φ(z The spectral density matrix of the VARMA(p, q) model is (Brockwell and Davis, 1991, Ch. 11 where A −H denotes the inverse of the conjugate transpose of the matrix A. The Whittle likelihood is valid if the process is stationary, i.e. if all roots of Φ(z) are outside of the unit circle.We therefore use the reparametrisation in Ansley and Kohn (1986) to map a set of unconstrained real-valued AR coefficient matrices to the set of stationary parameters.The same reparametrisation is also used on the MA parameter matrices to ensure invertibility.These reparametrisations are an additional source of non-linearity/non-Gaussianity; even a plain vector AR (VAR) model is no longer linear in the parameters.4.2.Vector ARTFIMA.Sabzikar et al. (2019) define the univariate ARTFIMA(p, d, λ, q) process for Y t as where {ε t } t∈Z is an iid sequence of zero mean random variables with variance , are the autoregressive and moving average lag polynomials, and L is the lag operator, i.e.L k Y t ≡ Y t−k .The tempered fractional differencing operator ∆ d,λ , where d / ∈ Z is the fractional differencing parameter and λ ≥ 0 is the tempering parameter, is defined by the generalised binomial theorem as We follow the convention in time series of not explicitly writing out the lag operator L in differencing operators unless needed for clarity, i.e., ∆ d,λ ≡ ∆ d,λ (L).
To explain the role of the parameters d and λ, note that for λ = 0 and d a nonnegative integer, ∆ d,λ Y t reduces to simple differencing of order d and the ARTFIMA model in (4.3) reduces to the autoregressive integrated moving average (ARIMA) process.For λ = 0 and fractional d we obtain the autoregressive fractionally integrated moving average (ARFIMA) model (Granger and Joyeux, 1980).The ARFIMA process is stationary and invertible for −0.5 < d < 0.5, and has long-range or long-memory dependence with an autocovariance function dying off so slowly that it is not absolutely summable.The tempering parameter λ > 0 in ARTFIMA allows for semi-long range dependence, i.e.ARFIMA-like long range dependence for a number of lags beyond which the autocovariances decay exponentially fast.Sabzikar et al. (2019) prove that the ARTFIMA process is stationary for any d / ∈ Z if λ > 0, provided that the univariate case of Assumption 1 holds.
The univariate ARTFIMA process is now extended to the multivariate case, where Y t is an r-dimensional vector-valued time series.We define the vector ARTFIMA (VART-FIMA) process as where ∆ d,λ is the multivariate tempered fractional differencing operator defined by hence allowing for different fractional differences and temporal differencing for the r time series.Theorem 1 generalises the spectral density result in Sabzikar et al. (2019) to the vector case.Its proof is in Appendix A.
Theorem 1.The multivariate ARTFIMA process is causal and stationary for all d / ∈ Z and all

APPLICATIONS
We now illustrate the proposed subsampling MCMC methodology on three long multivariate time series datasets of varying dimensions.
5.1.Datasets.The first dataset contains observations on mean water velocity from two measurement stations at St Clair River and Detroit River, located on opposite sides of Lake St Clair forming a connecting channel between Lake Huron and Lake Erie.The measurements are recorded every 12th minute.The final dataset contains 130,001 observations from Jan 3, 2016 at 00:00 hours until Dec 21, 2018 at 08:00 hours for each of the two locations.Missing observations are imputed using the na.interp function in the R package forecast.Figure 2 plots the data.
The second dataset contains 124,879 observations of Swedish temperatures at each of three airport locations (Arlanda, Bromma and Landvetter), giving a three-dimensional time series.The data are measured in an hourly scale for the time period February 1, 2008 until May 1, 2022, and are processed as follows.Each univariate series is preprocessed separately: missing observations are imputed with the value at a nearby location if available or otherwise imputed with the na.interp function in the R package forecast.Trend and seasonal components are then removed using the mstl function in the R package stats removing both a daily and an annual seasonal cycle.Figure 3 plots the processed data.The last 878 observations are used as a test set for evaluating prediction performance in Section 6 and are therefore not used for the posterior inference.The raw data before pre-processing are plotted in Figure 17 in Appendix B.
The third dataset contains 50,001 observations of nitrogen dioxide (NO2) and particulate matter (PM10) pollution at two streets in central Stockholm, giving a fourdimensional time series.The data are measured hourly for the time period February 16, 2010 until October 31, 2015.The missing values are imputed with the na.interp function for seasonal data in R. The series are analyzed in logs through the transformation log(x t − min{x t } T t=1 + 1) to make the series more normally distributed; we subtract the  for a demonstration on how varying G affects the variance of the estimator.Figure 1 illustrates how the gth group is chosen by including the gth lowest frequency and subsequently systematically sampling every Gth frequency, ensuring that each group contains periodogram ordinates over the whole frequency spectrum.We use 10 randomly selected groups (1% of the data) to estimate the full-data likelihood in each MCMC iteration.For the block pseudo-marginal, we divide the random numbers into K = 10 blocks (one group per block), resulting in a correlation between the log of the estimators of approximately 0.9 as discussed in Section 3.For all examples, we sample 55,000 draws from the posterior distribution and discard 5,000 draws as burn-in.
For the ARTFIMA models, we allow for different differencing parameters for each time series.The tempering parameters are restricted to be equal for each time series as the more general model with different λ gave estimates of λ that are very close across all series in a given dataset.
We use a Minnesota-style prior (Doan et al., 1984) for the autoregressive and moving average coefficients, which is a normal prior with diagonal covariance matrix with elements All prior means are set to zero.The hyper-parameter λ 0 controls how tightly the coefficient of the first lag is concentrated around 0; we set λ 0 = 1.Note that the prior variance decreases with increasing lag length l, allowing more shrinkage of coefficients corresponding to lags further back in time.The hyper-parameter θ 0 accounts for the belief that most of the variation in each variable is accounted for by its own lags and is hence set to a value in the range [0, 1).We set θ 0 = 0.2.The ratio σ 2 i /σ 2 j accounts for the difference in the variability of the variables.Following standard practice, σ 2 i is set to the residual variance computed by fitting a univariate AR model to the ith series.Note, however, that we use the above Minnesota prior on the unrestricted parameters in the Ansley and Kohn (1986) reparametrization so the prior is acting on (a rescaled version of) the partial autocorrelation matrices instead of the original AR and MA coefficients.We parameterise the covariance matrix Σ ε as a Cholesky factor with a logarithm transform on the diagonals and assign independent N(0, 0.1) priors for all elements in this parameterisation.We also use a log-transformation for the single λ and assign N(0, 0.1).Finally, we assign independent priors d k ∼ N(0, 1), for k = 1, . . ., r.We have verified that our results are robust to the choice of prior.

Model comparison and fit.
We perform model selection using the BIC approximation of the log marginal likelihood (Kass and Raftery, 1995) and Raftery (1995).
where k is the number of estimated parameters, n is the length of the time series and θ is the maximum likelihood estimate obtained by numerical optimisation.We fit all combinations of models for p + q ≤ 2 and Table 1 shows the log marginal likelihood for all models considered for the three datasets introduced in Section 5.1.The results show that models with tempered fractional differencing give a better model fit for all datasets and AR/MA orders.The two pure MA models, VARMA(0, 1) and VARMA(0, 2), perform very poorly on the temperature data, but improve dramatically when tempered fractional differencing is added.According to the BIC approximation of the marginal likelihood, we conclude the following: • VARTFIMA(0, 2) (14 parameters) is best for the water velocity dataset.
We implemented spectral subsampling MCMC successfully in all cases except for • Water velocity: VARMA(0,2).
The reason for the occasional failure of spectral subsampling MCMC for some models is that the control variates do not reduce the variance of the likelihood estimator sufficiently, even after grouping.The situation improves for longer time series, since the control variates typically improve with more data (Quiroz et al., 2019).To illustrate this, we consider the VARTFIMA(0,2) model for the Swedish temperature data, where spectral subsampling was successful.The top left graph in Figure 5 shows the standard deviation of the log-likelihood estimator over the MCMC iterations, which is the key quantity determining the efficiency of pseudo-marginal algorithms (Pitt et al., 2012); the optimal standard deviation from Lemma S8 in Quiroz et al. (2021) is also indicated in the graph.The variability of the estimator does not exceed the optimal value and spectral subsampling works well, as exemplified by the good mixing of the MCMC chain in the bottom left graph of Figure 5 for the AR parameter of the first series on its own first lag, Φ (1) 11 .The top right graph in Figure 5 shows the same quantity, but from spectral subsampling on a posterior based on only the last 60001 data points.The variability of the estimator is initially below the optimal value and the sampler mixes well, but the algorithm eventually moves to a region of the parameter space where the estimator is often much too variable, leading to low acceptance rates and the MCMC chain getting stuck for long spells (bottom right figure).
Since the control variates are based on a Taylor expansion around the posterior mode, they will work poorly when the posterior has multiple modes or a long ridge of fairly constant density.Such shapes are typical in models with local non-identification.VARMA models are, for example, known to have local non-identification problems resulting from near cancellation of roots in the AR and MA polynomials (Chib and Greenberg, 1994), i.e. the likelihood function is flat in the direction of certain linear combination of the parameters.There are also potential identification issues in the MA part of VARMA models (Lütkepohl, 2013, Ch. 7).Adding tempered fractional differencing provides more flexibility, but also additional local identification problems since the fractional differencing parameter d becomes non-identified when λ → ∞.Prior distributions alleviate these identification issues to some extent, but the posteriors in the models considered here are nevertheless very challenging.Luckily, subsampling MCMC fails on the worst fitting models in the Water velocity and Swedish temperature data, but unfortunately also gets stuck on the two best fitting models for the Pollution data.For the Pollution dataset we will therefore only present results using MCMC on the full data without subsampling, but using both the exact time domain likelihood as well as the approximate Whittle likelihood.
Figure 6 assesses the fit of the VARTFIMA(0,2) for the Water velocity data by how well the predictive distribution captures the univariate periodogram data.The predictive distribution is computed by, for each posterior draw θ, simulating 100 periodogram observations using the Whittle approximation where I j,T is the periodogram data for the jth time series.A nonparametric multitaper estimate (Barbour and Parker, 2014) is also shown as a grey line in the figures.The predictive distribution in Figure 6 captures the periodogram data quite well.The multitaper estimate suggest that there are some peaks in the spectral density not captured by the VARTFIMA(0,2) model.A VARTFIMA with higher AR and MA orders would be able to also fit those spectral bumps, but the peaks occur at different periods in the two series and may just be an artefact of the large number of observations in the dataset.Figures 19 and 20 in Appendix B present the same results for the two other datasets.The figures show that some seasonality seems to have survived the pre-processing of the datasets.An alternative is fitting a multivariate seasonal ARTFIMA model to the original data, but that is not pursued here.

Efficiency of spectral subsampling MCMC.
To measure the computational advantage of subsampling we use a performance measure that takes into account both where IACT ≡ 1 + 2 ∑ ∞ k=1 ρ k is the integrated autocorrelation time of the MCMC chain, and ρ k is the autocorrelation at lag k of the posterior draws.CT measures the execution time for obtaining the equivalent of a single iid draw from the posterior.To obtain an implementation independent measure, the computing time is set proportional to the number of density evaluations in a run of the algorithm, including the evaluations needed to construct the control variate for subsampling.We use the coda package (Plummer et al., 2006) in R to estimate the integrated auto-correlation time; see Quiroz et al. (2019) for more details.
Table 2 shows the relative computational time (RCT) of MCMC on the full datasets in relation to spectral subsampling MCMC.The RCT is defined as the ratio between the CT of full-data MCMC and that of spectral subsampling MCMC.Hence, values larger than one mean that spectral subsampling is more efficient when taking into account both the computing cost and the sampling efficiency.The results show that our subsampling algorithm is between 68-125 times faster than MCMC on the full dataset when RCT is the measure of computational efficiency.Similar speed-ups are observed on the Pollution dataset for the models where subsampling MCMC did not get stuck.2. Relative computational time (CT) of comparing MCMC using the full dataset to spectral subsampling MCMC.The value 1 indicates that spectral subsampling MCMC and MCMC are equally efficient, and values larger than 1 indicate that spectral subsampling MCMC is the better algorithm.The results for the pollution dataset are not shown as subsampling MCMC got stuck for the best models for that dataset (VARTFIMA(2, 0) and VARTFIMA(1, 1), see Table 1).FIGURE 7. Kernel density estimates of a subset of the marginal posterior densities for the VARTFIMA(0,2) model fitted to the Water velocity data.The solid densities are from MCMC using the Whittle posterior on whole dataset, and the dashed densities are obtained from spectral subsampling MCMC.

Accuracy of spectral subsampling for the Whittle posterior.
Whittle likelihood for the full data.The next section investigates how well the Whittle likelihood for the full data approximates the time domain likelihood in finite samples.
Figure 7 and 8 compare kernel density estimates of the posterior distribution of a subset of the parameters using subsampling MCMC and full-data MCMC.We conclude that the subsampling MCMC algorithm provides very similar answers, and Table 2 shows it is up to two orders of magnitude faster.FIGURE 8. Kernel density estimates of a subset of the marginal posterior densities for the VARTFIMA(2,0) model fitted to the Swedish temperature data.The solid densities are from MCMC using the Whittle posterior on the whole dataset, and the dashed densities are obtained from spectral subsampling MCMC.
The individual parameters in time series models are seldom of practical interest.We will therefore now explore how well our method approximates the posterior of summary quantities based on the spectral density matrix that are often used in practical work.The coherence between two time series X it and X jt at frequency ω is defined as measures the linear correlation between the pair of time series X it and X jt at frequency ω.Expressing the complex-valued cross spectral density f ij (ω) in polar form as where the phase spectrum ), measures the time shift of the signal at frequency ω, the time delay from variable X it to variable X jt is measured by −ϕ ij (ω)/ω (Wei, 1990).See Geweke (1982) and Ashby (2019) for connections between these frequency based measures of linear association FIGURE 9. Posterior for the spectral density matrix in the VART-FIMA(0,2) model fitted to the Water velocity data.The plots on the diagonal are the marginal spectral densities for each time series.The plots above the diagonal are the squared coherence, and the plots below the diagonal are the time delays from the phase spectrum.The dashed lines in each subplot displays the posterior median and 95% credible intervals from spectral subsampling MCMC.The shaded regions are the 95% credible intervals from the Whittle posterior on the whole dataset.and direction, and cross-correlation functions and Granger causality measures in the time domain.
Figure 9 plots the posterior distribution of these spectral density matrix quantities for the Water velocity dataset.The figure shows the marginal spectral densities (graphs on the diagonal) of each time series, squared coherence (above diagonal) and phase/time delay (below diagonal).Each figure displays the posterior mean and 95% credible intervals from spectral subsampling MCMC as lines.The 95% credible intervals from MCMC on the Whittle posterior based on the full dataset is shown as shaded regions.As expected from the previously shown parameter posteriors, we again see that subsampling gives virtually no distortion with respect to the Whittle posterior on the full dataset.The same is true for the temperature data; see Figure 21 in Appendix B.
The interpretation of the lag delay is clearest in systems without feedback loops.This is likely to be the case only in the Water velocity data where the Denver River location (x 2 ) is located downstreams of the St Clair River location (x 1 ).The top right graph in Figure 9 shows, however, that there is essentially no coherence between the FIGURE 10.Kernel density estimates of a subset of the marginal posterior densities for the VARMA(1,1) model fitted to the Water velocity data.The solid orange densities are from MCMC on the exact time domain posterior, the solid blue densities are from MCMC using the Whittle posterior on the whole dataset, and the dashed densities are obtained from spectral subsampling MCMC.two locations at any frequency, so the lag delay is insignificant.This is most probably because the two locations are quite far apart (approx 100 kilometers) and separated by Lake St Clair.The VARMA(1,1) model fitted in the next subsection picks up a sizeable coherence only at the very lowest frequencies and a positive delay at those frequencies (see Figure 11), which is the expected sign since Denver River is downstream of St Clair River.5.6.Accuracy of the Whittle posterior for the time domain posterior.The previous subsection shows that spectral subsampling MCMC gives very accurate approximations to the Whittle posteriors based on the full datasets.The Whittle likelihood can, however, be a poor approximation to the exact time domain likelihood (Contreras-Cristán et al., 2006), at least for shorter time series.However, the original motivation for spectral subsampling MCMC is settings where the time series are very long, and this section demonstrates that the Whittle approximation is excellent in the three datasets in Section 5.1.FIGURE 11.Posterior for the spectral density matrix in the VARMA(1,1) model fitted to the Water velocity data.The plots on the diagonal are the marginal spectral densities for each time series.The plot above the diagonal is the squared coherence, and the plot below the diagonal is the phase spectrums.The dashed lines in each subplot displays the posterior median and 95% credible intervals from the Whittle posterior on the whole dataset.The shaded regions are the 95% credible intervals from MCMC on the exact time domain posterior on the whole dataset.
The time domain likelihood is only computationally feasible for VARMA models, and hence we illustrate these results using a VARMA(1, 1) model for each dataset.Figure 10 shows the kernel density estimates of the posterior distribution of a subset of the parameters.Figure 11 plots the time domain and Whittle posteriors of the spectral density matrix based on the full Water velocity dataset.The results from the Whittle posterior are nearly indistinguishable from the exact time domain posteriors.Also, there are virtually no differences between the time domain posteriors and the whittle posteriors in the two other datasets; see Figure 12 and 13 for the temperature data and Figure 23 in Appendix B for the pollution data.In summary, we conclude from this and the previous subsection that spectral subsampling MCMC gives an accurate approximation of the posterior based on the exact time domain likelihood.
We have implemented the exact time domain likelihood for VARMA models using the state space representation in the Python package statsmodels and fair timing comparisons with spectral subsampling MCMC are therefore difficult, but it is well known that time domain likelihoods are substantially more costly than their frequency domain (Whittle) counterparts, even without the extra speed-up from subsampling.FIGURE 12. Kernel density estimates of a subset of the marginal posterior densities for the VARMA(1,1) model fitted to the Swedish temperature data.The solid orange densities are from MCMC on the exact time domain posterior, the solid blue densities are from MCMC using the Whittle posterior on the whole dataset, and the dashed densities are obtained from spectral subsampling MCMC.

COMPARING ARTFIMA AND VARTFIMA
Our paper focuses on the computational performance of spectral subsampling MCMC on challenging multivariate processes, and we leave a detailed analysis of the empirical performance of the VARTFIMA model to a separate paper.However, a reviewer suggested comparing the proposed multivariate VARTFIMA model with univariate ART-FIMA models (Sabzikar et al., 2019) for each series separately, and we present some results on this aspect below using the Swedish temperature data.
First, we compute the BIC approximation of the log marginal likelihood for a univariate version of the preferred VARTFIMA(2,0) model by restricting the diagonals of all Φ and Θ and Σ to zero in the estimation, but allowing for different tempering parameter and fractional differencing for each series.This corresponds to fitting ARTFIMA models to each series separately, but still obtaining a single BIC that approximates the marginal likelihood of all time series; these BIC values are therefore comparable to the ones from the multivariate VARTFIMA models.The BIC for this univariate version is 312824 which should be compared to the BIC of 335757 for the multivariate VART-FIMA(2,0) in Table 1.The massive improvement in BIC from jointly modeling the time FIGURE 13.Posterior for the spectral density matrix in the VARMA(1,1) model fitted to the Swedish temperature data.The plots on the diagonal are the marginal spectral densities for each time series.The plot above the diagonal is the squared coherence, and the plot below the diagonal is the phase spectrums.The dashed lines in each subplot displays the posterior median and 95% credible intervals from the Whittle posterior on the whole dataset.The shaded regions are the 95% credible intervals from MCMC on the exact time domain posterior on the whole dataset.series in the VARTFIMA(2,0) model is expected since the series are cross-correlated; see Figure 21.
Furthermore, the fractional differencing parameters can be quite different when modeling the time series jointly as in VARTFIMA compared to univariate ARTFIMA models.Figure 14 compares the marginal posteriors of the fractional differencing and tempering parameters in the multivariate VARTFIMA(2,0) model to univariate ARTFIMA(2,0) models in the Swedish temperature data.The VARTFIMA model in Figure 14 is restricted to have the same λ in the three series; the results are quite similar when separate λ are used.While the tempering parameters are rather similar for the univariate and multivariate models, the fractional differencing is substantially reduced for all three series when the multivariate model is used.
Finally, we compare the forecasting performance of VARTFIMA(2,0) model to the univariate ARTFIMA(2,0).The models are estimated on the same data as used above, i.e. the period from February 1, 2008 at 00:00 hours to March 25, 2022 at 16:00 hours, and the forecasts are evaluated over subsequent period from March 25, 2022 at 17:00 hours FIGURE 14. Comparing the marginal posteriors for the fractional differencing and tempering parameters in the multivariate VARTFIMA(2,0) model (blue) to univariate ARTFIMA models (orange) fitted to the Swedish temperature data.The multivariate model is restricted to have the same λ for each time series and the posterior density for this common λ is therefore repeated in each graph on the second row in the figure.Spectral subsampling MCMC is used for both models.
to May 1, 2022 at 6:00 hours, making up a test set with 878 number of observations.Since the training data is very large and the estimates are precise, we do not update the estimates as we move across the test data.
There are several ways of computing the forecasts from the VARTFIMA model.One way is by computing the autocovariance matrix function from an inverse FFT of the spectral density matrix at the estimated VARTFIMA parameter and then using the conditioning properties of the multivariate normal distribution to get the forecast, see Hamilton (1995) for the theory and McLeod et al. (2008) for an R package implementation.This is an elegant and general approach, but does not scale well to large data sets: the artfima package in R, which uses the McLeod et al. ( 2008) R package for forecasting, crashes when using more than 20% of the Swedish temperature data on a 32 GB RAM Linux machine.
We take a more direct and computationally faster approach here by approximating the VARTFIMA(p, q) with a VARMA(p , q) for a sufficiently large p .The VARTFIMA(p, q) model is (6.1)where from the definition of ∆ d,λ we can express (6.2) ) and b d k ,λ k j ≡ (−1) j ( d k j )e −λ k j for j = 1, 2, . ... We can therefore write (6.3) where the Π k can be found by equating the coefficients term by term for each L k to obtain (6.4) by defining B 0 = I r and B j = 0 for j < 0. Since b d k ,λ k j → 0 as j → ∞ exponentially fast eventually, if λ k > 0 (Sabzikar et al., 2019) we have that the elements of Π k goes to zero as k increases.We can therefore truncate this infinite order lag matrix polynomial at some lag p to get a VARMA(p , q) approximation (6.5) where I r − Π 1 L − . . .− Π p L p .Forecasts k steps ahead can now be obtained with established techniques for VARMA models (Tsay, 2013); we use the MTS package in R. Figure 15 shows that p = 10 is more than sufficient for approximating the VARTFIMA(2, 0) model used in the forecast evaluation below on the Swedish temperature data in that the forecast do not change beyond p = 10.
Figure 16 shows that the multivariate VARTFIMA(2,0) produces more accurate forecasts at all horizons than univariate ARTFIMA(2,0) models fitted to each time series FIGURE 16.Comparing the out-of-sample RMSE forecasting performance of the VARTFIMA(2,0) model (solid blue) to univariate ART-FIMA(2,0) models fitted to each series (dashed orange) in the Swedish temperature data.Both models were fitted to the training data (Jan 3, 2016 -to Dec 21, 2018) and the estimates were then kept fixed throughout the forecasting period (Dec 21, 2018-Dec 29, 2018).
separately, particularly at the longer horizons; note that a forecast horizon with h = 50 corresponds to 10 hours ahead.

CONCLUSIONS
Our paper proposes a subsampling MCMC approach for stationary multivariate time series models.Using a measure which takes into account both computing cost and the statistical inefficiency of likelihood estimators, we demonstrate a speed-up factor of up to two orders of magnitude on three datasets compared to MCMC using the full dataset.
To test the proposed spectral subsampling MCMC in challenging problems, we propose a new multivariate time series model by extending the univariate ARTFIMA model to a multivariate setting.Some properties of this new model are derived, including its spectral density matrix, and our results show that the vector ARTFIMA model outperforms VARMA models in all three datasets.This suggest the VARTFIMA model as a useful model for modelling a wide range of multivariate time series and future work should investigate this more fully.
Our work demonstrates that MCMC sampling to explore the posterior based on the Whittle likelihood scales well to very long multivariate time series, and that the approximation to the exact time domain posterior is excellent.We further show that spectral subsampling can give an additional speed-up of two orders of magnitude on large time series in rather complex multivariate time series models with semi-long range dependence.The biggest challenge is to obtain good control variates in high-dimensional parameter spaces for models with potentially very non-Gaussian likelihoods.We are unable to obtain good enough control variates for the best fitting models on the shortest of the three dataset.This is probably the result of the VARMA and VARTFIMA class of models having challenging likelihood functions, especially the moving average (MA) part of the model.We demonstrate that longer time series are more amenable to subsampling than shorter ones, so spectral subsampling MCMC is a method most suitable to large scale problems, at least when it comes to VARMA/VARTFIMA models.Future research will extend the methodology to high-dimensional complex models by developing better control variates in high-dimensions that consider the local nonidentification aspects, and exploring the potential of alternative inference algorithms such as variational inference (Blei et al., 2017).Variational inference with an estimated likelihood is less sensitive to the variability in the estimator (Tran et al., 2017), and is thus an appealing alternative to explore.
and A −H the inverse of the conjugate transpose of the matrix A.

FIGURE 3 .
FIGURE 3. Swedish temperature data.The final 878 time points in red are used for testing the forecast accuracy of the model.

FIGURE 5 .
FIGURE 5. Standard deviation of the log-likelihood estimator over the MCMC iterations for the VARTFIMA(2,0) model fitted to the Swedish termperature dataset using all the data (top left) versus when only the last 60001 observations was used in the estimation (top right).The optimal standard deviation (Quiroz et al., 2021, Lemma S8) is marked out with an orange line.The graphs in the bottom row of the figure show the MCMC chains for the AR parameter of the first series on its own first lag, Φ (1) 11 , for the respective model.

FIGURE 6 .
FIGURE 6. Posterior predictive fit of the univariate periodogram data for the VARTFIMA(1,1) model fitted to the Water velocity data.The predictive intervals are obtained by simulation from the asymptotic Whittle distribution I(ω) ∼ Expon( f θ (ω)) with parameters θ drawn from the VARTFIMA(1,1) posterior.
FIGURE 15.Investigating the convergence of forecasts as p increases in the VARMA(p , 0) approximation of the VARTFIMA(2, 0) model.The forecasts are all produced standing at the last time point in the training data.
FIGURE 17. Swedish temperature data after interpolation, but before deseasoning.

FIGURE 18 .
FIGURE 18. Stockholm air pollution data after interpolation and logarithmic transform, but before deseasoning.

FIGURE 19 .
FIGURE 19.Posterior predictive fit of the univariate periodogram data for the VARTFIMA(2,0) model fitted to the Swedish temperature data.The predictive intervals are obtained by simulation from the asymptotic Whittle distribution I(ω) ∼ Expon( f θ (ω)) with parameters θ drawn from the VARTFIMA(1,1) posterior.

FIGURE 20 .
FIGURE 20.Posterior predictive fit of the univariate periodogram data for the VARTFIMA(2,0) model fitted to the Stockholm pollution data.The predictive intervals are obtained by simulation from the asymptotic Whittle distribution I(ω) ∼ Expon( f θ (ω)) with parameters θ drawn from the VARTFIMA(1,1) posterior.

FIGURE 21 .
FIGURE 21.Posterior for the spectral density matrix in the VART-FIMA(2,0) model fitted to the Swedish temperature data.The plots on the diagonal are the marginal spectral densities for each time series.The plots above the diagonal are the squared coherences, and the plots below the diagonal are the time delays from the phase spectrum.The dashed lines in each subplot display the posterior median and 95% credible intervals from spectral subsampling MCMC.The shaded regions are the 95% credible intervals from the Whittle posterior on the whole dataset.

FIGURE 22 .
FIGURE 22. Posterior for the spectral density matrix in the VART-FIMA(2,0) model fitted to the Stockholm pollution data using the Whittle posterior on the whole dataset.The plots on the diagonal are the marginal spectral densities for each time series.The plots above the diagonal are the squared coherence, and the plots below the diagonal are the time delays from the phase spectrum.The posterior mean is marked out with a solid line and the shaded regions are the 95% credible intervals.

TABLE 1 .
BIC approximation of the log marginal likelihood for different models for each of the three datasets in Section 5.1.A higher value indicates a better model fit.The AR and MA columns indicate the lag order in the AR and MA component.The No TFI and TFI columns indicate if the process has tempered fractional differencing.The model with the largest marginal likelihood for each dataset is marked in bold font.Both the VARTFIMA(2,0) and VARTFIMA(1,1) are in bold font for the Pollution data since the difference between them is 'not worth more than a bare mention' on the modified Jeffreys' scale of evidence in Kass This subsection explores how well spectral subsampling MCMC approximates the posterior based on the