Matrix factorization for multivariate time series analysis

Matrix factorization is a powerful data analysis tool. It has been used in multivariate time series analysis, leading to the decomposition of the series in a small set of latent factors. However, little is known on the statistical performances of matrix factorization for time series. In this paper, we extend the results known for matrix estimation in the i.i.d setting to time series. Moreover, we prove that when the series exhibit some additional structure like periodicity or smoothness, it is possible to improve on the classical rates of convergence.


Introduction
Matrix factorization is a very powerful tool in statistics and data analysis. It was used as early as in the 70's in econometrics in reduced-rank regression [27,22,29]. There, matrix factorization is mainly a tool to estimate a coefficient matrix under a low-rank constraint. There was recently a renewed interest in matrix factorization as a data analysis tool for huge datasets. Nonnegative matrix factorization (NMF) was introduced by [35] as a tool to represent a huge number of objects as linear combinations of elements of "parts" of objects. The method was indeed applied to large facial image datasets and the dictionary indeed contained typical parts of faces. Since then, various methods of matrix factorization were successfully applied such various fields as collaborative filtering and recommender systems on the Web [34,50], document clustering [45], separation of sources in audio processing [42], missing data imputation [26], quantum tomography [23,25,52,7,39], medical image processing [21] topics extraction in texts [43] or transports data analysis [12]. Very often, matrix factorization provides interpretable and accurate representations of the data matrix as the product of two much smaller matrices. The theoretical performances of matrix completion were studied in a series of papers by Candès with many co-authors [10,11,9]. Minimax rates for matrix completion and more general matrix estimation problems were derived in [32,8,30,31,41]. Bayesian estimators and aggregation procedures were studied in [1,46,38,2,37,4,17,36,16,15].
To apply matrix factorization techniques to multivariate time series is a very natural idea. First, the low-rank structure induced by the factorization leads to high correlations that are indeed observed in some applications (this structure is actually at the core of cointegration models in econometrics [20,28,5]). Moreover, the factorization provides a decomposition of each series in a dictionary which member that can be interpreted as latent factors used for example in state-space models, see e.g. Chapter 3 in [33]. For this reasons, matrix factorization was used in multivariate time series analysis beyond econometrics: electricity consumptions forecasting [18,40], failure detection in transports systems [47], collaborative filtering [24], social media analysis [44] to name a few.
It is likely that the temporal structure in the data can be exploited to obtain an accurate and sensible factorization: autocorrelation, smoothness, periodicity... Indeed, while some authors use matrix factorization as a black box for data analysis, others propose in a way or another to adapt the algorithm to the temporal structure of the data [53,44,14,24]. However, there is no theoretical guarantee that this leads to better predictions or better rates of convergence. Even worse, all the aforementionned theoretical studies are only valid for i.i.d observations. The objective of this paper is to address both issues.
Consider for example that one observes a d series (x i,t ) T t=1 = X and assume that X = M + ε where M is a rank k matrix and ε is some noise. Theorem 3 in [32] implies that there is an estimatorM 1 of M, such that 1 ), up to log terms, with large probability, under the assumption that the noise is i.i.d. Moreover, Theorem 5 in the same paper shows that this rate cannot be improved.
) even when each noise series (ε i,t ) T t=1 is auto-correlated. Moreover, we design estimators taking into account a possible periodicity or smoothness of the series. This is done by rewritingŴ =VΛ where Λ is a τ × T matrix encoding the temporal structure.
We obtain the following rates: All the results are first stated under a known structure, that is, we assume that we know the rank k, the period τ or the smoothness β of the series. We provide at the end of the paper a model selection procedure that allows to obtain the same rates of convergence without assuming this prior knowledge.
The paper is organized as follows. In Section 2 we introduce the notations that will be used in all the paper. In Subection 3.1 we study matrix factorization without additional temporal structure. In Section 3.2, we study the estimatorM =ÛVΛ in the general case, and show how it improves the rates of convergence for a well chosen matrix Λ for periodic and/or smooth series. Finally, adaptation to unknown rank, periodicity and/or smoothness is tackled in Section 4. The proofs are given in Section 5.

Setting of the problem and notations
Assume that we observe a multivariate series where d ∈ N * and T ∈ N\{0, 1}. This multivariate series is modelled as a stochastic process. We actually assume that where ε is a noise and M is a matrix of rank k ∈ 1, T . Then, there exist U ∈ M d,k (R) and W ∈ M k,T (R) such that M = UW. We will refer to W as the dictionary or as the latent series.
We also want to model more structure in M. This is done by rewritting W = VΛ, where τ ∈ N \ {0}, V ∈ M k,τ and Λ ∈ M τ,T , where Λ is a known matrix. The matrix Λ depends on the structure assumed on M.
Example 2.1 (Periodic series). Assume that T = pτ with p ∈ N * for the sake of simplicity. To assume that the latent series in the dictionary W are τ -periodic is exactly equivalent to writing W = VΛ where V ∈ M k,τ (R) and Λ = (I τ | . . . |I τ ) ∈ M τ,T (R) is defined by blocks, I τ being the indentity matrix in M τ,τ (R).

Example 2.2 (Smooth series).
We can assume that the series in W are smooth. For example, say that they belong a a Sobolev space with smoothness β, we have where (e n ) n∈N is the Fourier basis (the definition of a Sobolev space is reminded is Section 3.2 below). Of course, there are infinitely many coefficients U i,n and to estimate them all is not feasible, however, for τ large enough, the approximation will be suitable, and can be rewritten as W = UΛ where Λ i,t = e i (t/T ). More details will be given in Section 3.2, where we actually cover more general basis of functions.
So our complete model will finally be written as such that rank(Λ) = τ are known (note that the unstructured case corresponds to τ = T and Λ = I T ). Note that more constraint can be imposed on the estimator. For example, in nonnegative matrix factorization [35], one imposes that all the entries inÛ andŴ are nonnegative. Here, we will more generally assume thatÛV belongs to some prescribed subset S ⊆ M d,T (R).
In what follows, we will consider two norms on M d,T . For a matrix A, the Frobenius norms is given by

and the operator norm by
Ax where · is the Euclidean norm on R T .
2.1. Estimation by empirical risk minimization. By multiplying both sides in (2) by the pseudo-inverse Λ + = Λ * (ΛΛ * ) −1 , we obtain the "simplified model" In this model, the estimation of M can be done by empirical risk minimization: where Therefore, we can define the estimator M S = M S Λ of M.
In Section 3, we study the statistical performances of this estimator. The first step is done in Subsection 3.1, where we derive upper bounds on

Oracle inequalities
Throughout this section, assume that ε fulfills the following..
We remind (see e.g Chapter 1 in [13]) that when X ∼ N (0, I n ), for some universal constant C > 0 (that is, C does not depend on n). Thus, for Gaussian noise, Assumption 3.1 is satisfied and K ε = C does not depend on the dimension T .
As a consequence, if we have indeed M ∈ S, then with large probability, Thus, we recover the rate O( k(d+T ) dT ) claimed in the introduction, up to the constant Σ ε op . Remark 3.3. Since the bound relies on the constant Σ ε op , let us provide its value in some special cases: More generally, when ε 1,1 , . . . , ε 1,T are uncorrelated, (2) Let (η t ) t∈Z be a white noise of standard deviation σ > 0 and assume that there exists θ ∈ R * such that ε 1,t = η t − θη t−1 for every t ∈ 1, T . In other words, (ε 1,t ) t=1,...,T is the restriction of a MA(1) process to 1, T . So, and then (3) Let (η t ) t∈Z be a white noise of standard deviation σ > 0 and assume that there is a ρ with |ρ| < 1 such that ε 1,t = ρε 1,t−1 + η t . So (ε 1,t ) t=1,...,T is the restriction of a AR(1) process to 1, T . So, where 3.2. The general case. Let us now come back to the general case. An application of Theorem 3.2 to the "simplified model" (2.1) shows that for any λ ∈]0, 1[ and s ∈ R + , with probability larger than 1 − 2e −s .
In order to obtain the desired bound on M S − M 2 F , we must now understand the behaviour of Σ εΛ + op and K ε .
The situation regarding Kε is different, we are not aware of a general simple upper bound on Kε = K εΛ + in terms of K ε and Λ + . Still, there are two cases where we actually have Kε = K ε . Indeed, in the Gaussian case, Kε = K ε = C, see (4) above. For non Gaussian noise, we have the following result.

Note that the assumption on Λ is fullfilled by the examples covered in Subsections 3.3 and (3.4).
The previous discussion legitimates the following assumption.
Finally, note that and in the same way By Inequality (5) together with Lemmas 3.4 and 3.5, we obtain the following result.
We will now apply this corollary to various examples.
3.3. Application: periodic time series. In the case of τ -periodic time series, remind that we assumed for simplicity that there is an integer p such that τ p = T and we defined Λ = (I τ | . . . |I τ ) ∈ M τ,T (R). Then Therefore, by Corollary 3.7, for every λ ∈]0, 1[ and s ∈ R + , under Assumptions 3.1 and 3.6, with probability larger than 1 − 2e −s . Now, define S = {A ∈ M n,T (R): rank(A) k and ∀i, ∀t, A i,t+τ = A i,t } and assume that M ∈ S. Then: which is indeed an improvement with respect to the rate obtained without taking the periodicity into account, that is O( k(d+T ) dT ).

3.4.
Application: time series with smooth trend. Assume we are given a dictionary of functions (e n ) |n| N for some finite N ∈ N. This dictionary can for example be a finite subset of a basis of an Hilbert space (e n ) n∈Z , like the Fourier basis or a wavelet basis. Define Note that Λ N is a τ × T matrix where τ = 2N + 1.

Assume that
This implies that (Λ N Λ * N ) −1 op = 1/T and Λ N Λ * N op = T . This can be the case for a well-chosen basis, otherwise, we can apply the Gram-Schmidt to the dictionary of functions. Therefore, by Corollary 3.7, for every λ ∈]0, 1[ and s ∈ R + , under Assumptions 3.1 and 3.6, with probability larger than 1 − 2e −s . We will now show the consequences of these results when the rows of M are smooth in the sense that they belong to a given Sobolev ellipsoid. In this case, we will not have a A such that AΛ − M 2 F = 0, but this quantity will be small and can be controlled as a function of N . We introduce a few definitions. From now, we assume that e n (x) = e 2iπnx is the Fourier basis. It is well-known from Chapter 1 in [49] that any f ∈ W (β, L) and x ∈ [0, 1], and that there is a (known) constant C(β, L) > 0 such that 1, (2) for any ℓ ∈ 1, k and t ∈ 1, T , W ℓ,t = f ℓ t T for some f ℓ ∈ W (β, L). Denote V N,W = (c n (f ℓ )) ℓ k,|n| N .
Pluging this into (6) gives If β is known, an adequate optimization with respect to N gives the following result.
However, in practice, β is not known -nor the rank k. This problem is tackled in the next section.
Consider s ∈ R + and the penalized estimator M s = M S τs , ks with with probability larger than 1 − 2e −s .
Remark 4.2. The reader might feel uncomfortable with the fact that the model selection procedure leads tok s andτ s that depend on the prescribed confidence level s. Note that if k = k 0 is known, that is K = {k 0 }, then it is clear from the definition thatτ s actually does not depend on s.
As an application, assume that M ∈ S(k, β, L) where k is known, but β is unknown. Then the model selection procedure is feasible as it does not depend on β, and it satisfies exactly the same rate asM S(k,β,L) in Corollary 3.11.

Additional notations. Let us first introduce a few additional notations.
First, for the sake of shortness, we introduce the estimation risk R and the empirical risk r. These notations also make clear the fact that our estimator can be seen as an empirical risk minimizer.

Some lemmas.
Let us now state the key lemmas for the proof of our results.
The first one will be used to estimate how far from the minimizer of R is the minimizer of r.
Then, for any λ ∈]0, 1[, On the one hand, by Equation (10) together with the classic inequality 2ab a 2 +b 2 for every a, b ∈ R,

So, Inequality (8) it true.
On the other hand, by Equation (10) together with the classic inequality −2ab a 2 + b 2 for every a, b ∈ R, So, Inequality (9) it true.
In the proof of the theorems, A will be replaced by an estimator of M that will be data dependent. Thus, it is now crucial to obtain uniform bounds on the scalar product in Lemma 5.1. In machine learning theory, concentration inequalities are the standard tools to derive such a uniform bound, see [6] for a comprehensive introduction to concentration inequalities for independent observations, and their applications to statistics. Some inequalities for time series can be found for example in [19], and were applied to machine learning in [3]. Here, we require more specifically a concentration inequality on random matrices. Such inequalities can be found in [48,51]. We will actually use the following result (Theorem 5.39 and Remark 5.40.2 from [51]). As the proof can be found in [51], we don't reproduce it here.
Proposition 5.2. Under Assumption 3.1, there exists a deterministic constant m > 1, not depending on ε, d and T , such that for every s ∈ R + , with probability larger than 1 − 2 exp(−s/K 4 ε ). We are now in position to provide a uniform bound on the scalar product in Lemma 5.1.
Finally, follow the proof of Corollary 3.7 to obtain, on the same event with probability at least 1 − 2e −s , for any τ and k, Plugging this into (14) gives, with probability at least 1 − 2e −s , Finally, note that k d so This ends the proof. *CREST, ENSAE ParisTech, Université Paris Saclay, Palaiseau, France E-mail address: pierre.alquier@ensae.fr