Continuous-time multivariate analysis

,


Introduction
The generic mathematical object that underpins much of multivariate analysis (MVA) is an n × p data matrix X whose n rows represent observations and whose p columns represent variables.While classical MVA typically assumes the n observations to be independent, there is a long history of applications of MVA to multivariate time series, notably in econometrics and atmospheric science (Jolliffe, 2002).In this paper we argue that some time-indexed multivariate data sets may be best conceptualized not as a time series of n discrete p-variate observations, but as p curves or functions defined on a common time interval; and we introduce a framework for straightforwardly extending several techniques of MVA to this setting.
Heuristically, our proposed framework, henceforth continuous-time multivariate analysis (CTMVA), is the "transpose" of functional data analysis (FDA; Ramsay and Silverman, 2005), in the following sense.In a programmatic address that introduced the term "functional data" and foreshadowed the FDA paradigm, Ramsay (1982) argued that some modern data sets call for replacing the traditional p variables with infinitely many points on a continuum, as in the upper right picture in Figure 1, and suggested that spline methodology could play a key role in handling such infinite-dimensional data.He briefly noted that instead of taking p = ∞ as in FDA, one might consider situations in which n = ∞, as in the lower left picture in Figure 1; but he did not develop the n = ∞ case.The CTMVA framework is an attempt to do so, while maintaining an emphasis on the role of spline and other basis-function representations as in Ramsay (1982) and Ramsay and Silverman (2005).
To illustrate the contrast between CTMVA and FDA it may be instructive to reconsider what is perhaps the most famous example of functional data: the Canadian temperature data of Ramsay and Silverman (2005), consisting of mean temperatures on each day of the year at 35 weather stations in Canada.In FDA, these are viewed not as n = 35 observations of dimension p = 365, but as smooth curves x 1 (t), . . ., x 35 (t) on the time interval I = [0, 365], so that in effect p = ∞ (uncountably infinite).Figure 2 displays both the raw data (upper panel) and the curves following roughness-penalty smoothing with Figure 1: Schematic diagram presented by Ramsay (1982) to explain "the concept of a functional datum."The diagram appeared as Figure 2 of that paper above the caption: "Possible domains for statistical observations.These domains depend on whether the number of replications or cases (n) is finite or infinite and whether the number of variables or points of observation (p) is finite or corresponds to the points on a continuum."The upper right represents functional data, whereas the lower left scenario, mentioned but not developed by Ramsay (1982), is the continuous-time multivariate analysis setup of the present paper.Figure reproduced with permission from Springer Nature.respect to a 45-element Fourier basis (middle panel), with the smoothing parameter chosen by restricted maximum likelihood (Ruppert et al., 2003;Reiss and Ogden, 2009;Wood, 2011) .In CTMVA we again view the data as smooth curves x 1 (t), . . ., x 35 (t), but these are now understood as n = ∞ observations of p = 35 variables.
In FDA, given a sample of curves arising from a stochastic process on a finite interval I ⊂ R, much interest focuses on Γ(s, t), the covariance function of the underlying process.
In particular, in functional principal component analysis one estimates the eigenfunctions of the associated covariance operator.In CTMVA, by contrast, the covariances or correlations of interest are those among the curves-representing, in this example, relationships among the various weather stations.In §4 below we discuss the matrices shown at the bottom of Figure 2, representing correlations among the 35 stations' temperature curves, and in Appendix A we consider correlations among precipitation curves from the same data set.
The practical impetus for CTMVA is twofold.First, the strategy we present below, of representing a multivariate time series as a set of smooth curves rather than a discrete Yellowknife Iqaluit Inuvik Resolute  (Ramsay and Silverman, 2005).Middle: The same data after smoothing with respect to a Fourier basis.
Below: Continuous-time correlations among daily mean temperatures in the 35 weather stations, visualized by the corrgram method (Friendly, 2002) implemented in Wei and Simko (2021), before (left) and after (right) removing the common seasonal trend.set of MV observations, can sometimes improve performance of MVA techniques.We demonstrate this below for correlation estimation, in the simulation study of §8, and for clustering, with the environmetric data analyzed in §9.Second, CTMVA can sometimes be applied when classical MVA cannot-for example, if the p variables are not measured at a uniform set of time points (as in §3.2 below).
After presenting the basic setup and notation in §2, in §3 we define continuous-time versions of the sample mean vector and the sample covariance and correlation matrices.As part of this development, §4 clarifies the basic but possibly confusing role of centering, which can have an important effect on correlation estimation.In §5- §7 we develop continuous-time (CT) versions of three key MVA techniques, namely principal component analysis, Fisher's linear discriminant analysis and k-means clustering.A simulation study and an application to air pollution data are presented in §8 and §9, respectively.The paper concludes with a brief discussion in §10.

Setup
In what follows we assume x(t) ≡ [x 1 (t), . . ., x p (t)] T to be a realization of a stochastic vector process X : I → R p .Depending on the application, the process X (t) = [X 1 (t), . . ., X p (t)] T may or may not be stationary.If it is, the mean vector do not depend on t and thus may be denoted by µ, Σ.In some applications the p functions exhibit a common trend, i.e. a time-varying smooth function m(t) = E[ 1 p p u=1 X u (t)], an estimate of which should be subtracted from each component of x(t) before proceeding with data analysis.We explain this with reference to an example in §4.
Following Ramsay and Silverman (2005), we make the fundamental simplifying assumption that, for u = 1, . . ., p, the uth function can be represented by means of an expansion T is a vector of a priori smooth basis functions on I. Thus, if C is the K × p coefficient matrix with uth column c u , then (1) Advantages of basis functions over other smoothing techniques include flexibility, compact representation of even large MV data sets, and not requiring equally spaced time points or a uniform set of times for the component functions.
The observed data from which this basis expansion is derived may be a traditional n × p data matrix, representing p variables observed at a common set of n time points, with smoothing applied to each of the p columns (see §3.3 below).But in some applications (e.g., in §3.2), the variables may be observed at different time points.The only requirement here is that the same basis is employed for all p curves.For curves that are cyclic it is natural for ϕ to be a Fourier basis: for the weather data, for example, this ensures fits that transition smoothly from December 31 to January 1.For other applications, B-spline bases, most often cubic B-splines, are a very popular choice due to their computational efficiency and favorable approximation properties.In practice, the representation is typically obtained by penalized smoothing (see §3.3), and this renders the results fairly insensitive to the choice of K (e.g., Ruppert et al., 2003).
Our implementation of CTMVA in the R environment (R Core Team, 2023) depends on the B-spline and Fourier basis functions from the fda package for functional data analysis (Ramsay et al., 2009(Ramsay et al., , 2023)).In some applications it might be desirable to employ datadriven basis functions such as functional principal components (Silverman, 1996).
If S is a subinterval, or finite union of subintervals, of I, we define Let φ = φI and Q = Q I .These expressions will be used repeatedly in what follows.
Computation of φ and Q is discussed in Appendix B.
3 Continuous-time sample mean vector and covariance and correlation matrices

Definitions
In what follows we motivate the continuous-time sample mean vector and sample covariance matrix as analogues, or limits, of their classical counterparts.These will be seen, moreover, to be natural estimates of corresponding stochastic process parameters (see §3.3 below).
In classical multivariate analysis, given a data matrix X = (x ij ) 1≤i≤n,1≤j≤p representing n observations of a p-dimensional random vector, the p-dimensional sample mean vector and p×p covariance matrix are given, respectively, by x = (x 1 , . . ., xp Mardia et al. (1979) in dividing by n rather than by n − 1].
The CT (sample) mean vector x * and covariance matrix S * (here and below we use * to signify CT analogues of classical MVA estimators) are simply the limits, as n → ∞, of their classical counterparts x, S, when the rows of X are x(t 1 ) T , . . ., x(t n ) T , where t 1 , . . ., t n form a grid of points spanning I at equal intervals of |I|/n (e.g., t i = i/n for each i, for I = [0, 1]).In that case |I|x = |I| x u (t)dt. (3) The basis representation (1) makes the integrals in (3), (4) computable even though x(t) is observed at only finitely many times t, and (2) yields convenient and succinct expressions for the CT mean and covariance matrix: x * = C T φ and S * = C T QC.
The CT (sample) correlation between the uth and vth functions is similarly defined as where As in classical MVA, the matrix of CT correlations equals R * = D * −1/2 S * D * −1/2 , where D * is the diagonal matrix with the same main diagonal as S * .

An example: World Development Indicators
To illustrate the use of CT correlation for time series with irregularly spaced times and substantial missingness, we consider several of the World Development Indicators (WDI; https://datacatalog.worldbank.org),variables that are estimated annually for each nation in the world.Quantitative political scientists use such national time series to investigate relationships among different indicators.Missing values are common, and may be addressed with ad hoc fixes (e.g., Baum and Lake, 2003) or complex imputation schemes (e.g., Honaker and King, 2010).CT correlation offers a simple and fast way to assess relationships between a potentially large number of indicators, while allowing for substantial missing data.Such applications of CT correlation may prove highly useful for exploratory data analysis and hypothesis generation, which are increasingly important goals in light of the popularity of online resources such as Gapminder and Our World in Data, and the large number of available candidate variables.
We focus here on five variables taken from Baum and Lake (2003): fertility, gross domestic product (GDP) per capita, gross domestic fixed investment, female life expectancy, and female secondary school enrollment.At left in Figure 3 are plots of investment and female enrollment in the Democratic Republic of the Congo, which illustrate the difficulty of assessing correlation between highly incomplete time series.In only 12 years are both variables estimated, so that ordinary correlation can be computed with only 12 observations.But the CT correlation is obtained by using all observations for 1991-2021, the overlap of the two variables' time ranges, to estimate the trend in each variable.
The resulting correlation estimate of 0.977 is not very precise, with a 95% confidence interval from 0.500 to 0.997 based on 999 bootstrap replicates.(In §10 we mention a weighting approach that might help to improve the precision of CT correlation.)But while individual correlation estimates from this data set lack precision, it is possible to discern interesting patterns from a collection of such estimates.We computed CT correlation for each pair of variables, for each of the 216 countries in the data set, subject to the requirements that (i) each variable have at least 8 observations and (ii) the overlap of the two time ranges be at least 12 years.These requirements were met for 85% of the

Estimators or estimands?
In classical MVA, the sample mean x is an estimate, based on observations x 1 , . . ., x n from a common multivariate distribution, of the population mean µ = E(X ) of a random vector X arising from that distribution, while the sample covariance matrix S is an estimate of the population covariance matrix Σ But in many applications, interest centers not on the underlying process but on the specific instance x(•).It is then more appropriate to regard x * , S * , as well as R * , not as estimators but as estimands, since in practice the function x(•) on which they are based is not observed perfectly, for three reasons: (i) the p component functions are observed only at finitely many time points, (ii) they are generally observed with noise and/or measurement error, and (iii) the basis representation (1) introduces approximation error.
Formally, for u = 1, . . ., p, at finitely many times t iu we observe not the true x u (t iu ) but where ε iu represents zero-mean error.We apply penalized smoothing (Ruppert et al., 2003;Wood, 2017) to the z iu 's for each u, to obtain estimates x(t) = C T ϕ(t) ∈ R p for the p functions.We note that for given u, the ε iu 's may not be independent.The penalized spline literature includes methods to take residual dependence into account, as well as some evidence that likelihood-based smoothness selection is robust to such dependence (Krivobokova and Kauermann, 2007).In this setup, x * , S * , R * are unobservable but are estimated, respectively, by , where D * is the diagonal matrix with the same main diagonal as S * . For simplicity, in what follows we omit the "hats" and treat x and consequently x * , S * , R * as given, except when presenting the simulation study of §8, which shows the advantage of CT over ordinary correlation for stochastic processes observed with noise.
To restate the above argument more succinctly and concretely, let us consider a bivariate process that is stationary, so that the covariance and hence the correlation ρ between the two components are time-independent.Classical MVA distinguishes the population parameter ρ from the sample quantity r.But in CTMVA there are not two but three quantities: the process correlation ρ gives rise to the CT correlation r * between a pair of functions drawn from the process, which is estimated from data by r * -or schematically, (where the arrows informally denote "gives rise to" rather than denoting convergence).
Depending on the context we may be more interested in step (i) or in step (ii) of ( 7).

Two forms of centering
The mean-centered variable vector x c (t) = x(t) − x * appeared in §3.1 and has a number of other uses, such as for CT principal component analysis (see i.e. the centered variables are linear combinations of ϕ 1 (t)− φ1 , . . ., ϕ K (t)− φK , but it would be more convenient if we could represent them as linear combinations of the original (not centered) basis functions ϕ 1 (t), . . ., ϕ K (t).It turns out that we can do so, under the simple condition that there exists w ∈ R K such that w T ϕ(t) = 1 for all t, (8) , the centered curves have a basis representation as in ( 1), but with coefficient matrix [ Condition (8) holds for B-spline bases, with w = (1, . . ., 1) T , and for Fourier bases, with Centering in the above sense is the continuous-time version of column-centering an n×p data matrix.As alluded to in §2, in some applications, a CT analogue of row -centering is warranted, i.e., an estimated common trend m(t) = 1 p p u=1 x u (t) should be subtracted from each of the p curves before computing the covariance or correlation.Consider, for example, the Canadian temperature curves in Figure 2. Without removal of the common trend, the CT correlation matrix R * for this example has all elements above 0.9, since the temperatures at all stations are relatively high in the summer and low in the winter, yielding the uninformative correlation matrix in the lower left panel of Figure 2.But with detrending applied, the result is the more interesting correlation matrix in the lower right panel.The 35 stations are listed geographically, roughly from east to west, except for the last six stations, which are located in the Canadian North.The top and middle panels of Figure 2 suggest that in many cases, stations with nearly the same longitude have similar temperature curves.This explains why the CT correlation matrix has several blocks of very high positive correlations along the main diagonal: for instance, the first six stations are those located in the Atlantic provinces, and the correlations among them are all above 0.78 and mostly above 0.93.In Appendix C we propose a simple diagnostic to help determine, for a given data set, whether detrending is advisable.
Since the Canadian weather curves are spatially indexed, Scheipl et al. (2015) modeled the spatial correlation to improve precision in an FDA context (see also Liu et al., 2017).In the above analysis, however, the correlations among stations are the estimands of interest, rather than a means to a further inferential end as in spatial FDA.

CT principal component analysis
In classical MVA, the (sample) principal components are linear combinations v T 1 x, . . ., v T p x that have maximal variance, subject to mutual orthonormality of the v j 's.If the sample covariance matrix S has orthonormal eigenvectors e 1 , . . ., e p with corresponding eigenvalues λ 1 ≥ . . .≥ λ p , then by a standard result on Rayleigh quotient maximization, the solution to this iterative optimization problem is v m = e m for m = 1, . . ., p.
Continuous-time PCA seeks linear combination functions v T m x(t) that achieve maximal variance in the above sense, but with the variance of v T x(t) defined as in (4) with respect to time, as By the same argument as in the classical case, but with S * in place of S, the solution is given by the eigenvectors of S * : that is, v m = e * m for m = 1, . . ., p, where e * 1 , . . ., e * p are orthonormal eigenvectors of S * in descending order of the corresponding eigenvalues λ * 1 ≥ . . .≥ λ * p .
Similarly to the classical setting, we then have the expansion x c (t) = p j=1 e * j s j (t), where s j (t) is the jth score function, defined by s j (t) = e * T j x c (t), and it follows straightforwardly from this definition that the CT covariance matrix (4) of [s 1 (t), . . ., s p (t)] T is Diag(λ * 1 , . . ., λ * p ). Assuming (8), since each component of x c (t) is a linear combination of the basis functions ϕ 1 (t), . . ., ϕ K (t), the same is true for each of the scores s 1 (t), . . ., s p (t).
Appendix D presents an application of CT PCA to electroencephalography data.
Canonical correlation analysis (CCA) is another classical MVA method that is somewhat related to PCA.Here one has two sets of variables and seeks pairs of linear combinations, one for each set, having maximal correlation in an iterative sense.A continuous-time version of CCA is presented in Appendix E.

CT Fisher's linear discriminant analysis
In its classical form, Fisher's linear discriminant analysis (LDA; see Appendix F) seeks linear combinations v T x that optimally separate G ≥ 2 sets of observations X 1 , . . ., X G into which the data set is divided a priori.Like PCA and CCA, Fisher's LDA finds the desired linear combinations of the variables by solving an iterative optimization problem, in this case based on a partition of the total sums of squares and cross-products matrix T = nS into within-groups and between-groups components, T = W + B.
We propose a continuous-time variant of Fisher's LDA that optimally discriminates among G ≥ 2 subintervals I 1 , . . ., I G into which I is partitioned.For g ∈ {1, . . ., G}, Analogously to the classical setting, we have the decomposition T * = W * + B * , where (see Appendix F for a derivation).The CT Fisher's linear discriminants are then func-tions v T 1 x(t), . . ., v T s x(t) that optimally separate the subintervals in the sense of iteratively maximizing By the same argument as in the classical setting, the optimally separating vectors v 1 , . . ., v s are eigenvectors of W * −1 B * corresponding to the positive eigenvalues in descending order.Once again, our implementation relies on the basis representation (1) along with (2): we have 1.Randomly select centers m 1 , . . ., m k ∈ R p , by evaluating x(t) at k randomly chosen time points.

CT k-means clustering
2. Until a convergence criterion is met: , whose first term does not depend on i, it follows that finding the cluster center nearest to x(t) reduces to finding the i ∈ {1, . . ., k} that minimizes Suppose for simplicity that k = 2. Then each of C 1 = {t ∈ I : A 1 (t) < A 2 (t)} and } is a union of subintervals of I, where the boundaries between these subintervals are the points t at which A 1 (t) and A 2 (t) cross, i.e., the zero crossings of If the basis functions are cubic splines then this expression is a cubic polynomial in t on each of the intervals between knots.Thus finding the transition points between the two clusters reduces to finding, for the cubic polynomial on each of these intervals, any zero crossings that occur within that interval.In this way the problem of partitioning infinitely many time points among k clusters is reduced to identifying the zeros of a finite set of polynomials.Appendix G provides a formula for the polynomial coefficients, and discusses how this CT approach to k-means clustering may improve computational efficiency.The k > 2 case is similar to the above, but not every crossing point of any pair of A i (t)'s is then a transition point.
Analogously to ordinary k-means, a CT k-means clustering decomposes the total integrated sum of squared distances from φ into between-cluster and within-cluster components.The latter component, more precisely the total within-cluster integrated squared distance from the cluster mean, is the objective function of CT k-means and is readily Like ordinary k-means, the above CT algorithm is not guaranteed to converge to the global minimum of this objective function.
It is thus recommended to run the algorithm repeatedly with different starting values, and choose the solution that minimizes the objective.
In Appendix H we present a continuous-time variant of the silhouette width (Rousseeuw, 1987), which indicates how clearly a given time point belongs to its assigned cluster.The silhouette width can be plotted for each of a grid of time points, and its average value provides a simple index of how tight the clusters are.As with ordinary k-means clustering, the average silhouette width may be used to guide the choice of k.

A simulation study
We conducted a simulation study to assess the performance of CT correlation estimation relative to classical (discrete-observation) correlation between p = 2 variables.True curves where the length parameter ℓ > 0 determines the wiggliness of the curves, with higher ℓ implying greater smoothness.For each pair of curves we obtain noisy observations (6) at each of 500 common observation times, with noise standard deviation σ = 0.5.For a given pair of curves x(•), the true CT correlation between them is r * = r * 12 [see ( 5)]; the CT correlation estimate, based on an estimate x(•) of the pair of curves, is denoted by r * (see §3.3).We compare the ordinary correlation r, based on n = 500 bivariate observations, versus CT correlation r * , based on treating the number of observations n as infinite and smoothing both curves with respect to a B-spline basis of dimension K = 40.
Figure 4 demonstrates how the CT correlation estimate r * outperforms the ordinary sample correlation r.The top row displays example curve pairs x 1 (t), x 2 (t) generated from a bivariate Gaussian process on [0, 1] with underlying between-curve correlation 0.5 as above, and within-curve (auto)correlation (11) for length parameter ℓ = 0.02, 0.1, 0.3.For each of the above three values of ℓ, we generated 50 pairs of curves x 1 , x 2 , like those shown in the top row of Figure 4, and for each pair we sampled noisy observations (as described in the previous paragraph), as displayed in the middle row.In the bottom row we have plotted the realized correlations r * of the noiseless curves x 1 , x 2 , against the corresponding correlation estimates, r (ordinary) and r * (continuous-time); each of the 50 points in each panel represents a pair of curves like those in the top row.
The horizontal spread of the points in the bottom row represents variation of the realized r * about the underlying process correlation 0.5 [step (i) of ( 7)], which is seen to increase as the length parameter ℓ increases.Deviation of the points from the line of identity represents error in the estimates with respect to the realized r * [step (ii) of ( 7)].This error also increases with ℓ, in particular for ordinary correlation, which is markedly biased toward zero.This bias is easy to explain: the noise term in (6) inflates variance without affecting between-curve covariance, resulting in diminished correlation.Such attenuation of correlation due to noise is a long-established phenomenon (e.g., Spearman, 1910) and is sometimes corrected by multiplying the estimated correlation by the geometric mean of the two variables' reliabilities (Cohen et al., 2013).For time series as opposed to independent observations, such a correction is not straightforward to implement, but the CT correlation approach proceeds by instead denoising the observed curves, resulting here in consistently accurate estimation of the between-curve correlation r * .This simulation study, then, offers clear and concrete evidence for the benefits of CTMVA, even when ordinary MVA is feasible.
Appendix J displays the results of (i) simulations as above but with underlying correlation 0.2 and 0.8, and (ii) simulations in which we vary the noise standard deviation σ as well as n, the number of points at which the curves are observed.The overall conclusion is that unless n is small and/or the curves are very bumpy, the CT correlation outperforms ordinary correlation, in some cases dramatically.9 Example: Temperature and air pollution in Chicago Wood (2017) performed several analyses relating air pollution to death rates, using a data set of daily indicators for Chicago for the years 1987 to 2000 made available by Peng and Welty (2004).Here we focus on the four predictors in these analyses, namely PM 10 (median 2.5-to 10-µm particles per m 3 ), ozone (in parts per billion), SO 2 (median sulfur dioxide measurement), and temperature.Our goal here is exploratory analyses of these four metrics using the CTMVA methods of §5- §7.All four variables were standardized and were smoothed using a 200-dimensional B-spline basis and residuals with an AR(1) residual structure (Wood, 2017).13 2 3 1 3 2 3 1 3 2 3 1 3 2 3 1 3 2  3  2 3 1 3 2 3 1 3 2 3 1 3 2 3 1 3 2 3 1 3 2 3 1 3 2 3 1 3 2 3 1 3   the three clusters.As can be discerned more clearly in the middle panel, which shows the clustering separately by year, there is a fairly consistent seasonal pattern: cluster 1 corresponds roughly to the winter, cluster 2 to the summer, and cluster 3 to the spring and autumn.Thus each calendar year is divided into portions belonging to clusters 1, 3, 2, 3, 1, in that order, with one exception: the winter of 1991-92, in which cluster 1 does not appear.The clear seasonal pattern in the clusters seems to be driven largely by temperature variations, along with the high positive correlations of PM 10 and ozone with temperature.SO 2 , on the other hand, is negatively correlated with temperature.In the winter of 1991-92, however, SO 2 uncharacteristically dropped along with the other three variables, and this may help explain the non-occurrence of cluster 1. Appendix H provides further details of this analysis, in particular the silhouette width and the stability of the 3means solution.We note that when ordinary 3-means clustering is applied to the raw time series, there are over 1100 transitions among the resulting three clusters, making it difficult at best to detect the cyclic seasonal pattern that emerges clearly from the continuous-time analysis.
The above results are complemented by CT principal component analysis, which found that most of the variation is explained by two PCs: 77.3% by the first PC, with loadings 0.524, 0.505, -0.417, 0.544 for PM 10 , ozone, SO 2 and temperature, respectively, and 15.0% by the second PC, with loadings -0.351, -0.347, -0.870, -0.006.In view of the correlations noted above, the first PC is expected to be highest in the summer and lowest in the winter.This is confirmed by the bottom panel of Figure 5, which displays the trajectory of the first two PCs for the 14-year period, with labels (1987, . . . , 2000) marking the end of each year.
The colors are lightest at the beginning and end of each year and darkest in the middle of the year, and this makes clear the annual oscillation of the first PC.Both PCs' year-end scores are highest for 1991, suggesting again that the winter of 1991-92 was atypical.
Continuous-time Fisher's linear discriminant analysis, with the G = 14 calendar years as subintervals among which we wish to discriminate, offers a way to compare and contrast the different years.The coefficients of PM 10 , ozone, SO 2 and temperature, respectively, are given by v 1 = (0.0302, −0.0037, −0.0186, −0.0380) T for the first discriminant and 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999  (−0.0235, 0.0134, −0.0105, 0.0034) T for the second.The plot of the time series of the first two discriminants in Figure 6 suggests that, as in the cluster analysis, the year 1992 stands out: the value of LD 1 tends to be very high during that year.This impression is confirmed by the middle panel, which shows the average values of the first two discriminants in each year and reveals that 1992 is a clear outlier for LD 1 .This finding is explained by the very high mean PM 10 and extremely low mean SO 2 for 1992, as seen in the right panel.

Discussion
Our simulation study in §8 and 3-means cluster analysis in §9 suggest that the denoised, smooth representation in CTMVA can, at least in some cases, lead to improved performance over standard MVA with the raw data.In addition, as illustrated by the WDI example of §3.2, CTMVA can be applied in certain settings for which classical MVA is unavailable, such as variables observed at disparate times.Another such setting is correlated point processes, such as inhomogeneous Poisson processes.Ramsay and Silverman (2005) present a smooth basis function approach to estimating a time-varying intensity function ι(t).
Given p inhomogeneous Poisson processes with intensities ι 1 (t), . . ., ι p (t) on a common time interval, one can compute CT correlations among estimates of these intensities.However, more work is required on reliable smoothing parameter estimation for intensity functions.
While the applicability of CT covariance and correlation even for pairs of curves whose observation times have little or no overlap (as in the WDI example of §3.2) is a clear advantage of the CT approach, there is an accompanying pitfall: as seen in the left panel of Figure 3, curves are estimated less precisely within time ranges with fewer observations, and intuitively these times should be weighted less when estimating covariance.Formally, the covariance between the uth and vth curves would then be defined as Inverse variance is a standard choice for such a weight function, but it is not obvious how to incorporate the inverse variances of two function estimates into a single function w(t).Minimizing the coefficient of variation, as in Chen et al. (2014), is a possible way forward.
A key objective of ongoing work is interval estimation for CT correlation, via analytic and resampling-based approaches.Optimal weighting, as described in the previous paragraph, may sometimes yield narrower confidence intervals.
Another aim of future work is to develop multilevel CTMVA for multiple instances of continuous-time multivariate data, such as the full WDI and EEG data sets of §3.2 and Appendix D, respectively.This may be contrasted with two complex data structures extending the n × ∞ setup of traditional FDA: multilevel and multivariate FDA.As a crude oversimplification, we may say that multilevel FDA (e.g.Crainiceanu et al., 2009) considers (n × m) × ∞ data, in which a given function is observed m times in each of n individuals, whereas multivariate FDA (e.g.Happ and Greven, 2018) data, with n instances of a collection of v functions.The multilevel CTMVA setup, of multiple instances of ∞×p data, differs from the either multilevel or multivariate FDA and may be more similar to the setup of Dubin and Müller (2005).But whereas these authors estimate population dynamical correlation from a sample of ∞ × p-type observations, the aim of multilevel CT correlation would be to decompose the correlations among p variables into "within" and "between" components.Multilevel data will also allow for assessing the reliability of CTMVA results, via extensions of intraclass correlation (Xu et al., 2021).
Most of the methods of this paper are CT extensions of MVA methods involving p × p matrices.Certain other MVA methods, such as multidimensional scaling, begin with an n × n matrix, often representing dissimilarities or distances among n observations, or alternatively similarities among them.Continuous-time (n = ∞) extensions are then more challenging than for methods based on p × p matrices, since instead of n × n matrices we must work with their infinite-dimensional counterparts, namely bounded linear operators on I.We are pursuing CT methods of this type in ongoing work.
Our methods are implemented in the R package ctmva (Paul and Reiss, 2024), and code for our real-data analyses can be found at https://github.com/reissphil/ctmva.

APPENDICES A Daily average precipitation in Canada
The simulation study of §8 demonstrated the contrast between ordinary versus CT correlations.As a real-data illustration of this phenomenon, we consider the Canadian precipitation data of Ramsay and Silverman (2005).As in Figure 2, the data consist of daily values for the 35 Canadian weather stations, but here the variable of interest is base-10 log of the average precipitation in millimeters (see Ramsay et al., 2023).Figure A7  support viewing the former as more accurate than the latter.For each correlation matrix, the stations are ordered by the angles formed by the leading two eigenvectors; this ordering helps to reveal patterns and groupings among the correlations (Friendly and Kwan, 2003).
In both correlation matrices the reordering puts the same nine stations at the end: the first six of the original ordering, which are located on the Atlantic coast, along with Vancouver, Victoria and Prince Rupert, on the Pacific coast.The CT correlation matrix suggests that these nine stations form a tight cluster, with positive correlations among them and negative correlations between them and the inland stations.This pattern is much less clear in the ordinary correlation matrix, due perhaps to the attenuation phenomenon discussed in §8.
As can be seen in the middle panel of Figure A7, the Atlantic and Pacific stations tend to have the least precipitation in the summer, while the inland stations have an opposite pattern.
The CT correlation matrix of the log mean precipitation values for the 35 Canadian weather stations, displayed in Figure A7, reveals that the nine coastal stations (six on the Atlantic and three on the Pacific coast) tend to be highly correlated with each other and    anticorrelated with the 26 inland stations.Figure A8 provides a map of this division of the 35 weather stations into three subsets, using the same color-coding as in Figure A7.We remark that Ramsay and Silverman (2005), as well as R package fda (Ramsay et al., 2023) and a number of subsequent analyses of the Canadian weather data, divide the stations into four regions, including 15 "Atlantic" and 5 "Pacific" stations.The nine stations referred to in the previous paragraph, and indicated in Figure A8, meet a stricter criterion of proximity to the Atlantic or Pacific.

B Computing (2)
For simplicity we consider only S = I.It is readily shown that the matrix 2) can be written as If ϕ is an orthonormal Fourier basis on I then G = I K and φ = (|I| −1/2 , 0, . . ., 0) T , so that Q is simply the diagonal matrix with main diagonal (0, |I| −1 , . . ., |I| −1 ) T .
For a B-spline basis, φ and G are obtained by numerical integration, with quadrature points t 1 , . . ., t Q and corresponding weights w 1 , . . ., w Q .Let Φ be the Q × K matrix of evaluation of the basis functions at the quadrature points, i.e., the (q, k) entry of Φ is ϕ k (t q ).For basis objects in the R package fda (Ramsay et al., 2023), this matrix can be generated with a call to function eval.basis from that package.Let W be the Q×Q diagonal matrix whose diagonal elements are the quadrature weights.Then for k = 1, . . ., K, Similarly, Φ T W Φ has (k, ℓ) element Q q=1 w q ϕ k (t q )ϕ ℓ (t q ) ≈ I ϕ k (t)ϕ ℓ (t)dt, and thus Our implementation uses the the Newton-Cotes 7-point rule for numerical quadrature, as suggested by Wand and Ormerod (2008).For cubic or lower-order B-splines, the Newton-Cotes 7-point rule makes the above integrals exact rather than approximate, so that Q is obtained exactly.

C Diagnosing the need for detrending
We present here a simple diagnostic to indicate, for a given data set, whether removing a common trend before computing the CT covariance or correlation (see §4) would be beneficial.If we adopt an FDA viewpoint toward the p functions x 1 (t), . . ., x p (t), a strong common trend implies that all p functions are quite similar to an underlying mean function µ x (t).This similarity can be quantified by using the function pffr (Ivanescu et al., 2015;Scheipl et al., 2015) from the R package refund (Goldsmith et al., 2023) to fit the simplest possible functional-response model, namely is a "functional intercept," in line with the syntax for this model, pffr(X ∼ 1) where X is a p-row matrix containing the functions.The (adjusted) R 2 outputted by pffr for this model (based on the underlying model fitted by the R package mgcv of Wood, 2017) gives the variance explained, for the collection of all points x i (t) in all p functions, by the fitted values xi (t) = μx (t).This "pointwise" R 2 (as opposed to the functional R 2 of Müller and Yao, 2008, which equals zero for this model) provides a measure of the strength of the common trend.For the Canadian weather data set, R 2 is found to be 71% for the temperature curves, but only 4.9% for the log precipitation curves; hence our decision to detrend the former (see §4) but not the latter (see Appendix A).

D Electroencephalography data
We applied continuous-time PCA to a portion of the Begleiter electroencephalography As suggested in §10, a natural next step would be multilevel CT-PCA to analyze multiple trials nested within individuals, as are available with this data set.The aim would be similar to the multilevel, longitudinal and structured variants of functional PCA (Di et al., 2009;Greven et al., 2010;Shou et al., 2015;Cui et al., 2023), but rather than decomposing a covariance operator, the structure being decomposed is a p × p covariance or correlation matrix as in Timmerman and Kiers (2002).We leave this development to future work.

E Continuous-time canonical correlation analysis
In classical canonical correlation analysis (CCA) we are given an n × p matrix X and an n × q matrix Y , and we seek pairs of unit vectors a ∈ R p , b ∈ R q such that the sample correlation of a T x, b T y is maximized in an iterative sense: 1. Then (a T m x, b T m y) are the mth pair of canonical variables, and their correlation is the mth canonical correlation.
The solution to the iterative maximization problem is given in terms of the covariance matrices of the two sets of variables, as well as their cross-covariance matrix: where X c , Y c are column-centered versions of the two matrices and S xx , S xy are assumed to be nonsingular.It can be shown (e.g., Rencher and Christensen, 2012) that have the same positive eigenvalues r 2 1 ≥ r 2 2 ≥ . . .≥ r 2 min{p,q} , and that if a i , b i are eigenvectors of M 1 , M 2 respectively corresponding to the ith largest eigenvector, then a T i x, b T i y are the ith canonical variates, with canonical correlation r i , which can be forced to be positive by multiplying either a i or b i by -1 if needed.
To develop a continuous-time version of CCA, suppose that two sets of functions, x(t) = [x 1 (t), . . ., x p (t)] T and y(t) = [y 1 (t), . . ., y q (t)] T , are defined on I.We now seek pairs (a, b) of unit vectors defining linear combinations a T x(t), b T y(t) of the two sets of functions that have maximal correlation in the above iterative sense, but here, generalizing (4), the covariance of a T x(t), b T y(t) is given by and their correlation is then given by These pairs of linear combinations may be called the canonical functions.
We assume for simplicity that both sets of functions are represented with respect to a common K-dimensional basis ϕ, i.e., for some matrices C x , C y of dimension K × p, K × q respectively.As with PCA, the continuous-time solution to the iterative maximization problem ensues upon "translating" the relevant matrices to the continuous-time setting.We do so by defining F Classical Fisher's linear discriminant analysis, and the CT within-between decomposition As noted in §6, classical Fisher's linear discriminant analysis (LDA) seeks linear combinations v T x that optimally separate G ≥ 2 a priori subsets X 1 , . . ., X G of the data set.
Assume that, for g = 1, . . ., G, X g is n g × p with ith row x T gi ; let be the n × p combined data set, where n = n 1 + . . .+ n G ; and let x = n −1 X T 1 n and xg = n −1 g X T g 1 ng denote the overall and group-g mean vectors.We proceed by partitioning the total sums of squares and cross-products matrix T = nS = G g=1 ng i=1 (x gi − x)(x gi − x) T into within-groups and between-groups components, T = W + B, where With these definitions, v T W v is the sum of squared deviations of the linear combinations v T x gi from their within-group means, while v T Bv is a weighted sum of these withingroup means from the overall mean, with weight n g for group g. (Some authors formulate Fisher's LDA differently, with these weights omitted.Moreover, some authors divide W and B by constants that allow them to be interpreted as within-and between-group covariance matrices, but this does not change the solution.)The linear combination v T 1 x where v 1 = arg max v v T Bv/v T W v can be said to maximally separate the groups, relative to the within-group variation.It can be shown that v 1 is the eigenvector of W −1 B corresponding to its largest eigenvalue.If s (≤ min{G − 1, p}) is the number of positive eigenvalues of W −1 B, then for m = 2, . . ., s, v m , the maximizer of v T Bv/v T W v subject to v T W v 1 = . . .= v T W v m−1 = 0, is given by the eigenvector of W −1 B corresponding to its mth-largest eigenvalue.The linear combinations v T 1 x, . . ., v T s x, which optimally separate the groups in the above iterative sense, are known as Fisher's linear discriminants.
Since the integral in the last line equals zero by the definition of x * g , we conclude that T * = W * + B * as claimed.

G Computational aspects of CT k-means clustering
If the basis ϕ(t) consists of cubic splines then, within the jth inter-knot interval (for j = 1, . . ., K − 3; see Ramsay and Silverman, 2005), ϕ(t) T = (1 t t 2 t 3 )L j for a 4 × K matrix L j of polynomial coefficients.Substituting this into (10), we conclude that Consequently, the zeroes of the cubic polynomial (10) in the jth inter-knot interval can be found by inputting this vector v j to R function polyroot, which implements the Jenkins-Traub zero-finding algorithm.
Once the data have been reduced to a spline representation, the computing time for CT k-means is insensitive to the number n of observations in the raw data.In experiments with 20 spline basis functions and p = 4 (using the same laptop that was used for the WDI analysis in §3.2), it took about 5 seconds to run CT k-means for k = 2, 3, . . ., 15.
By contrast, for ordinary k-means with the same range of values of k, as implemented by the standard R function kmeans, we found the computing time for the default algorithm of Hartigan and Wong (1979) to increase slightly faster than linearly with n, reaching about 80 seconds when n ≈ 3 × 10 6 .Moreover, in many instances, convergence was not attained at the "quick-transfer" stage (step 6 in Hartigan and Wong, 1979).Computing times were slightly lower with two less refined algorithms implemented in kmeans, one due to Lloyd (1982) and Forgy (1965) and the other due to MacQueen (1967), but these often struggled to converge.It must be acknowledged that for large n the time required to smooth the data with respect to the basis may at least partly offset the computation-time advantage of the CT method.This depends, however, on how one chooses the smoothing parameter in penalized basis smoothing.The R package fda (Ramsay et al., 2023) offers the function Data2fd with a simple default choice of the smoothing parameter.Wood (2017) summarizes some techniques for optimal smoothness selection with large n, while Li et al. (2024) propose a faster technique suitable for even larger n.

H Silhouette width and stability of the 3-means analysis
Given a clustering of n discrete observations into k > 1 clusters, Rousseeuw (1987) defines the silhouette, or silhouette width, of the ith observation as s , where a(i) is the mean distance from the given observation to other elements of the same cluster, and b(i) is the minimum, over the other k − 1 clusters, of the mean distance from the ith observation to elements of the cluster.When b(i) > a(i), as is usually but not always the case, the silhouette reduces to 1 − a(i)/b(i), an intuitive measure of how clearly the ith observation belongs to its assigned cluster.An obvious CT extension is to define, for t ∈ I, s(t) = b(t)−a(t) max{a(t),b(t)} , where the mean distances a(t), b(t) are defined as above but as integrals over clusters, divided by the length of the relevant cluster.It is straightforward to approximate these integrals by the trapezoidal rule, and thereby obtain s(t), at each of a grid of equally spaced time points spanning I.The left panel of Figure A10 displays s(t) for the CT 3-means clustering of the Chicago air pollution data of §9.Within most of the subintervals into which each cluster is divided, the silhouette width is low near the left boundary (as one would expect near a transition between clusters), rises to a peak, then falls again near the right boundary.For the long 1991-92 subinterval of cluster 3 discussed in the main text, there are two peaks with a trough in between.This trough occurs around December 1991 and represents times that are on the borderline between the assigned cluster (3) and the cluster to which the winter is ordinarily assigned (1).
The right panel of Figure A10 shows the average of s(t), over 6000 grid points, for CT k-means clustering of the air pollution data with k ranging from 2 to 9. Rousseeuw (1987) suggests that "one way to choose k 'appropriately' is to select that value of k for which s(k) [mean silhouette width] is as large as possible."According to this widely adopted proposal, the plot indicates that k = 3, while preferable to k − 4, . . ., 9, gives a less successful clustering of this data set than k = 2 (due, apparently, to the relatively low s(t) within the spring/autumn cluster).On the other hand, Rousseeuw (1987) did not advocate a rigid policy of treating the mean-silhouette-maximizing k as the true value to the exclusion of all others; he left open the possibility that different numbers of clusters might provide alternative informative views of the data.This is implied both by the scare quotes in the above quotation, and the fact that Rousseeuw (1987) himself presented both 2-and 3-cluster partitions of an example data set from political science.Adopting a similarly flexible approach, we felt that notwithstanding the mean silhouette results, k = 3 gives a more interesting cluster partition than k = 2 for the air pollution data, and we therefore chose to present the 3-means clustering in the text.
The 3-means clustering of this data set was run 50 times.In 46 instances, there were 54 cluster transitions, exactly as in the solution described in §9.The timings of each of these transitions never differed by more than one day between any two of these 46 clusterings.In the other 4 instances, there were 40 cluster transitions, whose timings were again very consistent, with discrepancies that were uniformly less than one day.The 54transition solution that was usually chosen had lower objective function (total within-cluster integrated squared distance) than the 40-transition solution; the ratio of between to total integrated squared distance, a metric that is analogous to a standard measure of clustering quality, is about 88% for the former solution and about 80% for the latter.These results suggest that for this application, 3-means clustering is quite stable, with about 90% of starting values leading to an optimal partition.At the same time, the suboptimal solution attained in several instances underscores the importance of trying multiple starting values.

I Multivariate Gaussian processes
The true functions x(•) in the simulation study of §8 arise from a multivariate Gaussian process, which can be defined as follows (Chen et al., 2020).Given a mean function µ : I → R p ,  n = 50, but as n increases its median error drops well below that for ordinary correlation.
The lower part of Figure A12 shows the ratio of the root mean square error (RMSE) for CT versus ordinary correlation.For ℓ = 0.02, regardless of σ, CT correlation performs much worse with n = 50 and somewhat worse with n = 100 in terms of RMSE, but outperforms ordinary correlation for n ≥ 200.For ℓ = 0.1, 0.3, i.e. for smoother curves, CT correlation has lower RMSE for all values of n.In summary, CT correlation is more accurate than ordinary correlation except when the function is too bumpy and/or is observed too sparsely.
Computing time for the CT correlation estimate r * , including B-spline smoothing of the two functions, depended primarily on n.On an HP Pavilion x360 with 8 GB of RAM, with the other settings fixed at ℓ = 0.02, σ = 0.5, the computing time in seconds was 0.35, 0.53, 0.97, 1.42 for n = 100, 500, 1000, 2000, respectively.

Figure 2 :
Figure 2: Top: Mean daily temperatures at 35 Canadian weather stations (Ramsay and t i ) is a Riemann sum converging to I x(t)dt as n → ∞; dividing by |I| yields x → x * , where x * = (x * 1 , . . ., x * p ) T with x * u = |I| −1 I

Figure 3 :
Figure 3: Left: Domestic fixed investment (above) and female secondary school enrollment (below) for DR Congo, with spline smooths ± 2 standard errors.CT correlation is computed for 1991-2021, the overlap of the two variables' time ranges, resulting in an estimate of 0.977.Right: Histograms of country-level CT correlations for each pair of variables among the five WDIs considered.
Continuous-time k-means clustering partitions I into temporal clusters, each of which is a subinterval, or a finite union of subintervals, of I. Our algorithm is a CT extension of the basic k-means procedure introduced in a 1957 manuscript later published asLloyd (1982): x(t) were generated from a bivariate Gaussian process on [0, 1] (see Appendix I) with between-curve covariance matrix Σ = thus correlation ρ = 0.5, and within-curve covariance function Γ having the squared exponential (radial basis function) form

Figure 4 :
Figure 4: Top row: Examples of bivariate Gaussian processes on [0, 1] with underlying between-curve correlation 0.5 and squared exponential within-curve correlation (11), with length parameter ℓ = 0.02, 0.1, 0.3.Middle row: Noisily observed versions of the same example pairs of curves.Bottom row: Realized correlations r * , and estimates r, r * by ordinary and continuous-time correlation, for simulated bivariate processes, with each panel displaying results for 50 simulated noisy curve pairs akin to that shown in the panel directly above it.Solid line is the line of identity; dashed vertical line indicates the process correlation 0.5.

Figure 5
Figure 5 displays results from continuous-time 3-means clustering.The top panel shows the four standardized variables' time series with vertical lines indicating transitions among

Figure 5 :
Figure 5: Top: The four standardized and smoothed variables from the Chicago air pollution data.Vertical lines indicate transitions among the clusters found by CT 3-means clustering, with cluster numbers shown along the top of the plot.Middle: The same cluster transitions displayed separately by year.Bottom: Trajectory of the first two PCs of the Chicago air pollution data.Each label (1987, . . ., 2000) marks the end of the year indicated.

Figure 6 :
Figure 6: Left: The first two linear discriminants for the Chicago air pollution data, displayed as functions of date.Center: Mean values of the first two discriminants for each of the 14 years.Right: Mean PM 10 and mean SO 2 for each year.
displays the data in raw form (upper panel) and after smoothing (middle panel), using the same strategy as for the temperature data in Figure 2. The lower left panel of Figure A7 presents the ordinary Pearson correlation among the 35 stations' precipitation levels, when the data are treated as n = 365 discrete observations of p = 35 variables.The lower right panel shows the CT correlations based on the smooth curves.The CT correlations are much stronger than raw correlations, and the simulation results of §8

Figure A7 :
Figure A7: Base-10 log mean daily precipitation in mm for the 35 Canadian weather stations of Ramsay and Silverman (2005), in raw form (above) and after smoothing (middle).Below: Correlations among the 35 weather stations, obtained by treating the data as n = 365 discrete observations (left) and by the continuous-time approach (right).

Figure A8 :
Figure A8: The 35 Canadian weather stations, color-coded by location on the Atlantic or Pacific coast or inland.

Figure A9 :
Figure A9: Loadings of the first three continuous-time principal components for an example EEG trial.panel titles give the percent variance explained by each PC.

(
EEG) data set from the UCI Machine Learning Repository, made available in R byHel- wig (2018).The data consist of multiple EEG trials for each of a set of individuals, with each trial consisting of electrical signals at p = 64 channels with 256 time points within 1 second.Loadings for the first 3 PCs for a randomly selected trial are shown in FigureA9, and these have reasonably clear interpretations: PC1 roughly captures the balance between signal in brain regions that are more central versus more peripheral (as viewed from above), PC2 represents anterior versus posterior regions, and PC3 represents left versus right hemisphere.
a T 1 x, b T 1 y attain the maximal sample correlation among all pairs (a, b) of unit vectors a ∈ R p , b ∈ R q .2. For m = 2, . . ., min{p, q}, a T m x, b T m y attain the maximal sample correlation among all pairs (a, b) of unit vectors such that a T x is uncorrelated with a T 1 x, . . ., a T m−1 x and b T y is uncorrelated with b T 1 y, . . ., b T m−1 y.
Similarly to the derivations of x * , S * in §3.1, the matrices T * , W * , B * in §6 can be derived as limits of their classical counterparts T , W , B times |I|/n for observations on a uniform grid of n time points, as n → ∞.Thus the CT decomposition T * = W * + B * can be inferred from the classical decomposition T = W + B. A more direct proof of the CT decomposition is as follows:

Figure A10 :
Figure A10: At left, a plot of the silhouette s(t) for the CT 3-means clustering of the Chicago air pollution data presented in §9.At right, average silhouette width of k-means clusterings of the data, for k = 2, 3, . . ., 9.

Figure A11 :
Figure A11: Simulation results, analogous to those in the bottom row of Figure 4, but with underlying between-curve correlation 0.2 (above) and 0.8 (below).
with Q defined as in (2).Then, as in the classical case, M * 1 , M * 2 have the same positive eigenvalues; and if a i , b i are eigenvectors of M * 1 , M * 2 respectively corresponding to the ith largest eigenvector, then a T i x(t), b T i y(t) are the ith canonical functions.