Randomized CP tensor decomposition

The CANDECOMP/PARAFAC (CP) tensor decomposition is a popular dimensionality-reduction method for multiway data. Dimensionality reduction is often sought after since many high-dimensional tensors have low intrinsic rank relative to the dimension of the ambient measurement space. However, the emergence of ‘big data’ poses significant computational challenges for computing this fundamental tensor decomposition. By leveraging modern randomized algorithms, we demonstrate that coherent structures can be learned from a smaller representation of the tensor in a fraction of the time. Thus, this simple but powerful algorithm enables one to compute the approximate CP decomposition even for massive tensors. The approximation error can thereby be controlled via oversampling and the computation of power iterations. In addition to theoretical results, several empirical results demonstrate the performance of the proposed algorithm.


Introduction
Advances in data acquisition and storage technology have enabled the acquisition of massive amounts of data in a wide range of emerging applications. In particular, numerous applications across the physical, biological, social and engineering sciences generate large multidimensional, multi-relational and/or multi-modal data. Efficient analysis of this data requires dimensionality reduction techniques. However, traditionally employed matrix decompositions techniques such as the singular value decomposition (SVD) and principal component analysis (PCA) can become inadequate when dealing with multidimensional data. This is because reshaping multi-modal data into matrices, or data flattening, can fail to reveal important structures in the data.
Tensor decompositions overcome the information loss from flattening. The canonical CP tensor decomposition expresses an N-way tensor as a sum of rank-one tensors to extract multi-modal structure. It is particularly suitable for data-driven discovery, as shown by Hong et al (2020) for various learning tasks on real world data. The idea of tensor compression, for instance, eases computational bottlenecks by constructing smaller (compressed) tensors, which are then used as a proxy to efficiently compute CP decompositions. Compressed tensors may be obtained, for example, using the Tucker decomposition of a tensor into one small core tensor and N unitary matrices However, this approach requires the expensive computation of the left singular vectors for each mode.
Related work. This computational challenge can be tackled using modern randomized techniques developed to compute the SVD. Tsourakakis (2010) presents a randomized Tucker decomposition algorithm based on the random sparsification idea of Achlioptas and McSherry (2007) for computing the SVD. Zhou et al (2014) proposed an accelerated randomized CP algorithm using the ideas of Halko et al (2011) without the use of power iterations. An alternative randomized tensor algorithm proposed by Drineas and Mahoney (2007) uses random column selection. Related work by Vervliet et al (2014) proposed a sparsity-promoting algorithm for incomplete tensors using compressed sensing.
Another efficient approach to building large-scale tensor decompositions is the subdivision of a tensor into a set of blocks. These smaller blocks can then be used to approximate the CP decomposition of the full tensor in a parallelized or distributed fashion (Phan and Cichocki 2011). Sidiropoulos et al (2014) fused random projection and blocking into a highly computationally efficient algorithm. More recently, Vervliet and Lathauwer (2016) proposed a block sampling CP decomposition method for the analysis of large-scale tensors using randomization, demonstrating significant computational savings while attaining near optimal accuracy. These block based algorithms are particularly relevant if the tensor is too large to fit into fast memory.

Contributions.
We present the randomized CP algorithm, which is closely related to the randomized methods of Halko et al (2011). Our method proceeds in two stages. The first stage uses random projections with power iterations to obtain a compressed tensor (Algorithm 1). The second stage applies either alternating least squares (ALS) or block coordinate descent (BCD) to the compressed tensor (Algorithms 2 and 3), at significantly reduced cost. Finally, the CP decomposition of the original tensor is obtained by projecting the compressed factor matrices back using the basis derived in Algorithm 1.
Randomized algorithms for CP decomposition have been proposed (Zhou et al 2014, Battaglino et al 2018, albeit without incorporating power iterations. The power iteration concept is fundamental for modern high-performance randomized matrix decompositions, but to the best of our knowledge, has not been applied in the context of tensors. Without power iterations, the performance of randomized algorithms based on random projections can suffer considerably in the presence of white noise. We combine power iterations with oversampling to further control the error of the decomposition.
Embedding the CP decomposition into this probabilistic framework allows us to achieve significant computational savings, while producing near-optimal approximation quality. For motivation, figure 1 shows the computational savings of a rank k = 20 approximation for varying tensor dimensions. Here we compare the ALS and BCD solver for computing the deterministic and randomized CP decomposition. The computational advantage of the randomized algorithms is significant, while sacrificing a negligible amount of accuracy.
Organization. The paper is organized as follows: section 2 briefly reviews the CP decomposition and randomized matrix decompositions. Section 3 introduces the randomized CP tensor decomposition algorithm. Section 4 presents the evaluation of the computational performance, and examples. Finally, section 5 summarizes the research findings.

Background
Ideas for multi-way factor analysis emerged in the 1920s with the formulation of the polyadic decomposition by Hitchcock (1927). However, the polyadic tensor decomposition only achieved popularity much later in the 1970s with the canonical decomposition (CANDECOMP) in psychometrics, proposed by Carroll and Chang (1970). Concurrently, the method of parallel factors (PARAFAC) was introduced in chemometrics by Harshman (1970). Hence, this method became known as the CP (CANDECOMP/PARAFAC) decomposition. However, in the past, computation was severely inhibited by available computational power. Today, tensor decompositions enjoy increasing popularity, yet runtime bottlenecks persist.

Notation
Scalars are denoted by lower case letters x, vectors as bold lower case letters x, and matrices by bold capitals X. Tensors are denoted by calligraphic letters X . The mode-n unfolding of a tensor is expressed as X (n) , while the mode-n folding of a matrix is defined as X (n) . The vector outer product, the Kronecker product, the Khatri-Rao product and the Hadamard product are denoted by • , ⊗, ⊙, and * , respectively. The inner product of two tensors is expressed as ⟨·, ·⟩, and ∥ · ∥ F denotes the Frobenius norm for both matrices and tensors.

CP decomposition
The CP decomposition is the tensor equivalent of the SVD since it approximates a tensor by a sum of rank-one tensors. Here, tensor rank is defined as the smallest sum of rank-one tensors required to generate the tensor (Hitchcock 1927). The CP decomposition approximates these rank-one tensors. Given a third order tensor X ∈ R I×J×K , the rank-R CP decomposition is expressed as where • denotes the outer product. Specifically, each rank-one tensor is formulated as the outer product of the rank-one components a r ∈ R I , b r ∈ R J , and c r ∈ R K . Components are often constrained to unit length with the weights absorbed into the vector λ = [λ 1 , ..., λ R ] ∈ R R . Equation (1) can then be re-expressed as (See figure 2) More compactly the components can be expressed as factor matrices, i.e. A = [a 1 , a 2 , ..., a R ], Using the Kruskal operator as defined by Kolda and Bader (2009)

Randomized matrix algorithms
The efficient computation of low rank matrix approximations is a ubiquitous problem in machine learning and data mining. Randomized matrix algorithms have been demonstrated to be highly competitive and robust when compared to traditional deterministic methods. Randomized algorithms aim to construct a smaller matrix (henceforth called sketch) designed to capture the essential information of the source matrix (Mahoney 2011, Drineas andMahoney 2016). The sketch can then be used for various learning tasks. There exist several strategies for obtaining such a sketch, with random projections being the most robust off-the-shelf approach. Randomized algorithms have been in particular studied for computing the near-optimal low-rank SVD (Frieze et al 2004, Liberty et al 2007, Woolfe et al 2008. Following the seminal work by Halko et al (2011), a randomized algorithm computes the low-rank matrix approximation where the target rank is denoted as k, and assumed to be k ≪ min{m, n}. Intuitively, we construct a sketch Y ∈ R m×k by forming a random linear weighted combination of the columns of the source matrix A. More concisely, we construct the sketch as where ∈ R m×k is a random test matrix with entries drawn from, for example, a Gaussian distribution. If A has exact rank k, then the sketch Y spans with high probability a basis for the column space of the source matrix. However, most data matrices do not feature exact rank (i.e. the singular values {σ i } n i=k+1 are non-zero). Thus, instead of just using k samples, it is favorable to construct a slightly oversampled sketch Y ∈ R m×l which has l = k + p columns, were p denotes the number of additional samples. In most situations small values p = {10, 20} are sufficient to obtain a good basis that is comparable to the best possible basis (Martinsson 2016). An orthonormal basis Q ∈ R m×l is then obtained via the QR-decomposition Y = QR, such that where B ∈ R l×n . (Note, that equation (4) requires a second pass over the source matrix.) The matrix B can then be used to efficiently compute the matrix decomposition of interest, e.g. the SVD. The approximation error can be controlled by a combination of oversampling and power iteration (Rokhlin et al 2009, Halko et al 2011, Gu 2015. Randomized matrix algorithms are not only pass efficient, but they also have the ability to exploit modern computational parallelized and distributed computing architectures. Implementations in MATLAB, C and R are provided by Szlam et al (2014), Voronin and Martinsson (2015), and Erichson et al (2016).

Randomized CP decomposition
Given a third order tensor X ∈ R I×J×K , the objective of the CP decomposition is to find a set of R normalized rank-one tensors {a r • b r • c r } R r=1 which best approximates X , i.e. minimizes the Frobenius norm The challenge is that this problem is highly nonconvex and unlike PCA there is no closed-form solution. Solution methods for this optimization problem therefore rely on iterative schemes (e.g. alternating least squares). These solvers are not guaranteed to find a global minimum, yet for many practical problems, they can find high-quality solutions. Iterative schemes, however, can be computationally demanding if the dimensions of X are large. Fortunately, only the column spaces of the modes are of importance for obtaining the factor matrices A, B, C, rather then the individual columns of the mode matricizations X (1) , X (2) , X (3) of the tensor X . This is because the CP decomposition learns the components based on proportional variations in inter-point distances between the components. Therefore, a compressed tensor B ∈ R k×k×k must preserve pairwise Euclidean distances, where k ≥ R. This in turn requires that column spaces, and thus pairwise distances, are approximately preserved by compression -this can be achieved by generalizing the concepts of randomized matrix algorithms to tensors. We build upon the methods introduced by Martinsson Figure 3. Schematic of the randomized CP decomposition architecture. First, a degree of randomness is employed to derive a smaller tensor B from the big tensor X . Then, the CP decomposition is performed on B. Finally, the near-optimal high-dimensional factor matrices A, B and C are recovered from the approximate factor matricesÃ,B andC.
et al (2011), as well as related work on randomized tensors by Drineas and Mahoney (2007), who proposed a randomized algorithm based on random column selection. Figure 3 shows the schematic of the randomized CP decomposition architecture. Our approach proceeds in two stages. The first stage, detailed in section 3.1, applies random projections with power iteration to obtain a compressed tensor B, with expressivity analysis in section 3.1.1. In section 3.2, we describe two algorithms for performing CP on the compressed tensor B and approximating the original CP factor matrices. Section 3.3 provides additional details about our implementation in Python.

Randomized tensor algorithm
The aim is to use randomization as a computational resource to efficiently build a suitable basis that captures the action of the tensor X . Assuming an N-way tensor X ∈ R I1×···×IN , the aim is to obtain a smaller compressed tensor B ∈ R k×···×k , so that its N tensor modes capture the action of the input tensor modes. Hence, we seek a natural basis in the form of a set of orthonormal matrices {Q n ∈ R In×Rn } N n=1 , so that Here the operator × n denotes tensor-matrix multiplication defined as follows.
Definition 1. The n-mode matrix product X × n Q n Q ⊤ n multiplies a tensor by the matrices Q n Q ⊤ n in mode n, i.e. each mode-n fiber is multiplied by Given a fixed target rank k, these basis matrices can be efficiently obtained using a randomized algorithm. First, the approximate basis for the nth tensor mode is obtained by drawing k random vectors ω 1 , . . . , ω k ∈ R ∏ i̸ =n I i from a Gaussian distribution. (Note that if N is large, it might be favorable to draw the entries of ω from a quasi random sequence, e.g. Halton or Sobol sequence. These so-called quasi-random numbers are known to be less correlated in high-dimensions.) Stacked together, the k random vectors ω form the random test matrix n ∈ R ∏ i̸ =n I i ×k used to sample the column space of where Y n ∈ R In×k is the sketch. The sketch serves as an approximate basis for the range of the nth tensor mode. Probability theory guarantees that the set of random vectors {ω i } k i=1 are linearly independent with high probability. Hence, the corresponding random projections y 1 , . . . , y k efficiently sample the range of a rank deficient tensor mode X (n) . The economic QR decomposition of the sketch Y n = Q n R n is then used to obtain a natural basis, so that Q n ∈ R In×k is orthonormal and has the same column space as Y n . The final step restricts the tensor mode to this low-dimensional subspace Thus, after N iterations a compressed tensor B and a set of orthonormal matrices is obtained. Since this is an iterative algorithm, we set X ← B n after each iteration. The number of columns of the basis matrices form a trade-off between accuracy and computational performance. We aim to use as few columns as possible, yet allow an accurate approximation of the input tensor. Assuming that the tensor X exhibits low-rank structure, or equivalently, the rank R is much smaller than the ambient dimensions of the tensor, the basis matrices will be an efficient representation. In practice, the compression performance is improved through using oversampling, i.e. drawing l = k + p random vectors where k is the target rank and p the oversampling parameter.
The randomized algorithm as presented requires that the mode-n unfolding of the tensor has a rapidly decaying spectrum in order to achieve good performance. However, this assumption is often not suitable, and the spectrum exhibits slow decay if the tensor is compressed several times. To overcome this issue, the algorithm's performance can be substantially improved using power iterations (Rokhlin et al 2009, Halko et al 2011, Gu 2015. Power iterations turn a slowly decaying spectrum into a rapidly decaying one by taking powers of the tensor modes. Thus, instead of sampling X (n) we sample from the following tensor mode where q denotes a small integer. This power operation enforces that the singular values σ j of X q (n) are {σ 2q+1 j } j . Instead of using (7), an improved sketch is computed However, if (9) is implemented in this form the basis may be distorted due to round-off errors. Therefore in practice, normalized subspace iterations are used to form the sketch, meaning that the sketch is orthornormalized between each power iteration in order to stabilize the algorithm. For implementation, see Voronin and Martinsson (2015) and Szlam et al (2014). The combination of oversampling and additional power iterations can be used to control the trade-off between approximation quality and computational efficiency of the randomized tensor algorithm. Our results, for example, show that just q = 2 subspace iterations and an oversampling parameter of about p = 10 achieves near-optimal results. Algorithm 1 summarizes the computational steps.

Expressivity analysis.
The average behavior of the randomized tensor algorithm is characterized using the expected residual error Theorem 1 generalizes the matrix version of Theorem 10.5 formulated by Halko et al (2011) to the tensor case.
Theorem 1. Consider a low-rank real N-way tensor X ∈ R I1×···×IN . Then the expected approximation error, given a target rank k ≥ 2 and an oversampling parameter p ≥ 2 for each mode, is where σ nj denotes the j singular value of the mode-n unfolding of the source tensor X .
For the proof see appendix A. Intuitively, the projection of each tensor mode onto a low-dimensional space introduces an additional residual. This is expressed by the double sum on the right hand side. If the low-rank approximation captures the column space of each mode accurately, then the singular values j > k for each mode n are small. Moreover, the error can be improved using the oversampling parameter. The computation of additional power (subspace) iterations can improve the error further. This result again follows by generalizing the results of Halko et al (2011) to tensors. Sharper performance bounds for both oversampling and additional power iterations can be derived following, for instance, the results by Witten and Candes (2015).

Optimization strategies
There are several optimization strategies for minimizing the objective function defined in (12), of which we consider, alternating least squares (ALS) and block coordinate descent (BCD). Both methods are suitable to operate on a compressed tensor B ∈ R k×···×k , where k ≥ R. The optimization problem (5) is reformulated Algorithm 1 A prototype randomized tensor compression algorithm.
Require: An N-way tensor X , and a desired target rank k. Optional: Parameters p and q to control oversampling, and power iterations.
(1) B = X initialize compressed tensor (2) forn = 1, . . . , N iterateoveralltensormodes (3) l = k + p slight over sampling (4) I, J = dim(B (n) ) dimension of the n−th tensor mode (5) = rand(J, l) generate random test matrix (6) Y = B (n) compute sampling matrix (7) for j = 1, . . . , q normalized power iterations (optional) orthonormalize sampling matrix (13) B = B ×n Q ⊤ n project tensor to smaller space (14) end for Return: Compressed tensor B of dimension l × · · · × l, and a set of orthonormal basis matrices {Qn ∈ R In×l } N n=1 . Remark 1 Due to the convenient mathematical properties of the normal distribution, a Gaussian random test matrix is often assumed in theoretical results, however, in practice a uniform distributed random test matrix is sufficient. The performance can be further improved using structured random matrices (Woolfe et al 2008). If information is uniformly distributed across the data, randomly selected columns can also be used to build a suitable basis as well, which avoids the matrix multiplication in step (6). Remark 2 For numerical stability, normalized power iterations using the pivoted LU decomposition are computed in step 7-11. We recommend a default value of q = 2. Remark 3 In practice, the user can decide which modes to compress and specify different oversampling parameters for these modes.
Once the compressed factor matricesÃ ∈ R k×R ,B ∈ R k×R ,C ∈ R k×R are estimated, the full factor matrices can be recovered where Q 1 ∈ R I×k , Q 2 ∈ R J×k , Q 3 ∈ R K×k denote the orthonormal basis matrices. For simplicity we focus on third order tensors, but the result generalizes to N-way tensors.

ALS algorithm.
Due to its simplicity and efficiency, ALS is the most popular method for computing the CP decomposition (Comon et al 2009, Kolda andBader 2009). We note that the optimization (12) is equivalent to with respect to the factor matricesÃ,B andC. The tensor B can further be expressed in matricized form where ⊙ denotes the Khatri-Rao product. The optimization problem in this form is non-convex. However, an estimate for the factor matrices can be obtained using the least-squares method as follows. The ALS algorithm updates one component, while holding the other two components fixed, in an alternating fashion until convergence. It iterates over the following subproblems Algorithm 2 A prototype randomized CP algorithm using ALS.
Require: An I × J × K tensor X , and a desired target rank R. Optional: Parameters p and q to control oversampling, and power iterations.
(1) B, Q A , Q B , Q C = compress(X , R, p, q) compress tensor using algorithm recover factor matrices (13) re-normalize the factor matrices and update the scaling vector λ Return: Normalized factor matrices A, B, C and the scaling vector λ.
Each step therefore involves a least-squares problem which can be solved using the Khatri-Rao product pseudo-inverse. Algorithm 2 summarizes the computational steps.

Definition 2. The Khatri-Rao product pseudo-inverse is defined as
where the operator * denotes the Hadamard product, i.e. the elementwise multiplication of two equal sized matrices.
There exist few general convergence guarantees for the ALS algorithm (Uschmajew 2012, Wang andChu 2014). Moreover, the final solution tends to depend on the initial guessÃ 0 ,B 0 andC 0 . A standard initial guess uses the eigenvectors of (Bader and Kolda 2015). Further, it is important to note that normalization of the factor matrices is necessary after each iteration to achieve good convergence. This prevents singularities of the Khatri-Rao product pseudo-inverse Kolda and Bader (2009). The algorithm can be further improved by reformulating the above subproblems as regularized least-squares problems, see for instance Li et al (2013) for technical details and convergence results. The structure imposed by the ALS algorithm on the factor matrices permits the formulation of non-negative, or sparsity-constrained tensor decompositions .

BCD algorithm.
While ALS is the most popular algorithm for computing the CP decomposition, many alternative algorithms have been developed. One alternate approach is based on block coordinate descent, BCD (Xu and Yin 2013).  first proposed this approach for computing non-negative tensor factorizations. The BCD algorithm is based on the idea of successive rank-one deflation. Unlike ALS, which updates the entire factor matrix at each step, BCD computes the rank-1 tensors in a hierarchical fashion. Therefore, the algorithm treats each component A r , b r , c r as a block. First, the most correlated rank-1 tensor is computed; then the second most correlated rank-1 tensor is learned on the residual tensor, and so on. Assuming that R = r − 1 components have been computed, then the r-th compressed residual tensor Y res is defined The algorithm then iterates over the following subproblems Algorithm 3 A prototype randomized CP algorithm using BCD.
Require: An I × J × K tensor X , and a desired target rank R. Optional: Parameters p and q to control oversampling, and power iterations.
(1) B, Q A , Q B , Q C = compress(X , R, p, q) compress tensor using Algorithm 1 (2) Note that the computation can be more efficiently evaluated without explicitly constructing the residual tensor Y res (Kim et al 2014). Algorithm 3 summarizes the computation.

Implementation details
The algorithms we present are implemented in the programming language Python, using numerical linear algebra tools provided by the SciPy (Open Source Library of Scientific Tools) package (Jones et al 2001). SciPy provides MKL (Math Kernel Library) accelerated high performance implementations of BLAS and LAPACK routines. Thus, all linear algebra operations are threaded and highly optimized on Intel processors. The implementation of the CP decomposition follows the MATLAB Tensor Toolbox implementation (Bader and Kolda 2015). This implementation normalizes the components after each step to achieve better convergence. Furthermore, we use eigenvectors (see above) to initialize the factor matrices. Interestingly, randomly initialized factor matrices have the ability to achieve slightly better approximation errors, but re-running the algorithms several times with different random seeds can display significant variance in the results. Hence only the former approach is used for initialization. We note that the randomized algorithm introduces some randomness and slight variations into the CP decompositions as well. However, randomization can also act as an implicit regularization on the CP decomposition (Mahoney 2011), meaning that the results of the randomized algorithm can be in some cases even 'better' than the results of the corresponding deterministic implementation. Finally, the convergence criterion is defined as the change in fit, following Bader and Kolda (2015). The algorithm therefore stops when the improvement of the fit ρ is less then a predefined threshold, where the fit is computed using

Numerical Results
The randomized CP algorithm is evaluated on a number of examples where the near optimal approximation of massive tensors can be achieved in a fraction of the time using the randomized algorithm. Approximation accuracy is computed with the relative error ∥X −X ∥ F /∥X ∥ F , whereX denotes the approximated tensor.

Computational performance
The robustness of the randomized CP algorithm is first assessed on random low-rank tensors. Here we illustrate the approximation quality in the presence of additive white noise. Figure 4 shows the average relative errors over 100 runs for varying signal-to-noise ratios (SNR). In the case of high SNRs, all algorithms (a) Tensor of dimension 100 × 100 × 100.
(c) Tensor of dimension 100 × 100 × 100 × 100. Figure 5. Random tensor approximation and performance for rank R = 50 tensors: rCP methods achieve speedups by 1-2 orders of magnitude and the same accuracy as their deterministic counterpart.
converge towards the same relative error. However, at excessive levels of noise (i.e. SNR< 4) the deterministic CP algorithms exhibit small gains in accuracy over the randomized algorithms using q = 2 power iterations, with which both the ALS and BCD algorithm show the same performance. The performance of the randomized algorithm without power iterations (q = 0) is, however, poor, and stresses the importance of the power operation for real applications. The oversampling parameter for the randomized algorithms is set to p = 10. Increasing p can further improve accuracy. Figure 6. Illustration of the multiscale toy video. The system is governed by four spatial modes experiencing intermittent oscillations in the temporal direction. The top row shows the clean system, while the bottom row shows the noisy system with a signal-to-noise ratio of 2.
Next, the reconstruction errors and runtimes for tensors of varying dimensions are compared. Figure 5 shows the average evaluation results over 100 runs for random low-rank tensors of different dimensions, and for varying target ranks k. The randomized algorithms achieve near optimal approximation accuracy while demonstrating substantial computational savings. Interestingly, the relative error achieved by the BCD decreases sharply by about one order of magnitude when the target rank k matches the actual tensor rank (here R = 50).
The computational advantage becomes more pronounced with increasing tensor dimensions, and as the number of iterations required for convergence increase. Using random tensors as presented here, all algorithms rapidly converge after about 4 to 6 iterations. However, it is evident that the computational cost per iteration of the randomized algorithm is substantially lower. Thus, the computational savings can be even more substantial in real world applications that may require several hundred iterations to converge. Overall, the ALS algorithm is computationally more efficient than BCD, i.e. the deterministic ALS algorithm is faster than the BCD by nearly one order of magnitude, while the randomized algorithms exhibit similar computational timings.
Similar performance results are achieved for higher order tensors. Figure 5(c) shows the computational performance for a 4-way tensor of dimension 100 × 100 × 100 × 100. Again, the randomized algorithms achieve speedups of 1-2 orders of magnitude, while attaining good approximation errors. Once more, the BCD algorithm achieves better approximation at the true tensor rank.

Examples
The examples demonstrate the advantages of the randomized CP decomposition. The first is a multiscale toy video example, and the second is a simulated flow field behind a stationary cylinder. Due to the better and more natural interpretability of the BCD algorithm, only this algorithm is considered in subsequent sections.

Multiscale toy video example.
The approximation of the underlying spatial modes and temporal dynamics of a system is a common problem in signal processing. In the following, we consider a toy example presenting multiscale intermittent dynamics in the time direction. The data consists of four Gaussian modes, each undergoing different frequencies of intermittent oscillation, for 215 times steps in the temporal direction on a two-dimensional spatial grid (200 × 200). The resulting tensor is of dimension 200 × 200 × 215. Figure 6 shows the corresponding modes and the time dynamics. This problem becomes even more challenging when the underlying structure needs to be reconstructed from noisy measurements. Traditional matrix decomposition techniques such as the SVD are known to face difficulties capturing such intermittent, multiscale structure. Figure 7 displays the decomposition results of the noisy (SNR = 2) toy video for both the randomized CP decomposition and the SVD. The first subplot shows the results of a rank k = 4 approximation computed using the rCP algorithm with q = 2 power iterations, and a small oversampling parameter p = 10. The

Parameters
Time ( method faithfully captures the underlying spatial modes and the time dynamics. For illustration, the second subplot shows the decomposition results without additional power iterations. It can clearly be seen that this approach introduces distinct artifacts, and the approximation quality is relatively poor. The SVD, shown in the last subplot, demonstrates poor performance at separating the modes and mixes the spatiotemporal dynamics of modes 2 & 3. Table 1 further quantifies the observed results. Interestingly, the relative error using the randomized algorithm with q = 2 power iterations is slightly better than the deterministic algorithm. This is due to the intrinsic regularizing behavior of randomization. However, the reconstruction error without power iterations is large, as is the error resulting from the SVD.
The achieved speedup of randomized CP is substantial, with a speedup factor of about 15.
Note, the SVD requires the tensor to be reshaped in some direction. The comparison illustrates the striking difference between compression ratios. It is evident that the CP decomposition requires computing many fewer coefficients in order to approximate the tensor. Thus, less storage is required to approximate the data. This can be of importance if the bandwidth is constrained and only a limited amount of information can be transmitted, in which case the CP decomposition may be advantageous. While the CP decomposition  yields a parsimonious approximation that is more robust to noise, the advantage of the SVD is that data can be approximated with an accuracy as low as machine precision.

Flow behind a cylinder.
Extracting the dominant coherent structures from fluid flows helps to better characterize them for modeling and control (Brunton and Noack 2015). The workhorse algorithm in fluid dynamics and for model reduction is the SVD. However, fluid simulations generate high-resolution spatiotemporal grids of data which naturally manifest as tensors. In the following we examine the suitability of the CP decomposition for decomposing flow data, and compare the results to those of the SVD.
The fluid flow behind a cylinder, a canonical example in fluid dynamics, is presented here. The data are a time-series generated from a simulation of fluid vorticity behind a stationary cylinder (Colonius and Taira 2008). The corresponding flow tensor has dimension 199 × 449 × 151, consisting of 151 snapshots on a 449 × 199 spatial grid. Figure 8 shows a few example snapshots of the fluid flow.  Figure 9 shows both the approximated spatial modes and the temporal dynamics for the randomized CP decomposition and the SVD. Observe that both methods extract similar patterns, although unlike SVD, rCP spatial modes are sparse on the domain. The reason is two-fold: (i) CP does not impose orthogonality constraints (which in general reveal dense structure), and (ii) CP imposes rank-one outer product structure in the x and y directions via the columns of the factor matrices. In doing so, CP isolates the contributions of single wavenumbers (spatial frequencies) to the steady-state vortex shedding regime. These are commonly analyzed to study energy transfer between scales in complex flows and turbulence (Meneveau andKatz 2000, Sharma andMcKeon 2013), hence CP can help reveal new physically meaningful insights. In particular, randomization enables tensor decomposition of high-resolution vector fields that otherwise may not be tractable using deterministic methods. Thus rCP provides a novel decomposition of multimodal fields for downstream tasks in model reduction, pattern extraction and control. Note, that for a fixed target rank of k = 30 across all methods, the SVD achieves a substantially lower reconstruction error (see table 2).
However, the compression ratios for the CP and SVD methods are c CP ≈ 562.17 and c SVD ≈ 5.02, i.e. the CP compresses the data nearly two orders of magnitude more.

Results in presence of white noise.
Next, the analysis of the same flow is repeated in the presence of additive white noise. While this is not of concern when dealing with flow simulations, it is realistic when dealing with flows obtained from measurement. We chose a signal-to-noise ratio of 2 to demonstrate the robustness of the CP decomposition to noise.  Figure 10 shows again the corresponding dominant spatial modes and temporal dynamics. Both the SVD and the CP decomposition faithfully capture the temporal dynamics. However, comparing the modes of the SVD to figure 9, it is apparent that the spatial modes contain a large amount of noise. The spatial modes revealed by the CP decomposition provide a significantly better approximation. Again, it is crucial to use power iterations to achieve a good approximation quality (see table 3). By inspection, the relative reconstruction error using the SVD is poor compared to the error achieved using the CP decomposition. Here, we show the error for a rank k = 30 and k = 6 approximation. The target rank was determined using the optimal hard threshold for singular values (Gavish and Donoho 2014).
The CP decomposition overcomes this disadvantage, and is able to approximate the first k = 30 modes with only a slight loss of accuracy. Note that here the randomized CP decomposition performs better than the deterministic algorithm. We assume that this is due to the favorable intrinsic regularization effect of randomized methods.

Conclusion
The emergence of massive tensors require efficient algorithms for obtaining tensor decompositions.
To address this challenge, we have presented a randomized algorithm which substantially reduces the computational demands of the CP decomposition.
Indeed, randomized algorithms have established themselves as highly competitive methods for computing traditional matrix decompositions. A key advantage of the randomized algorithm is that modern computational architectures are fully exploited. Thus, the algorithm benefits substantially from multithreading in a multi-core processor. In contrast to previously proposed high-performance tensor algorithms which are based on computational concepts such as distributed computing, our proposed randomized algorithm provides substantial computational speedups even on standard desktop computers. Our proposed algorithm achieves these speedups by reducing the computational costs per iteration, which enables the user to decompose real world examples that typically require a large number of iterations to converge.
In addition to computational savings, the randomized CP decomposition demonstrates outstanding performance on several examples using artificial and real-world data, including decompositions of high-resolution flow fields that may not be tractable with deterministic methods. Moreover, our experiments show that the power iteration concept is crucial in order to achieve a robust tensor decomposition. Thus, our algorithm has a practical advantage over previous randomized tensor algorithms, at a slightly increased computational cost due to additional power iterations.

Appendix A. Proof of theorem 1
In the following, we sketch a proof for Theorem 1, which yields an upper bound for the approximate basis for the range of a tensor.
To assess the quality of the basis matrices {Q n } N n=1 , we first show that the problem can be expressed as a sum of subproblems. Defining the residual error Note that the Frobenius norm of a tensor and its matricized forms are equivalent. Defining the orthogonal projector P n ≡ Q n Q ⊤ n , we can reformulate (A1) compactly as ∥E∥ F = ∥X − X × 1 P 1 × 2 · · · × N P N ∥ F .
Proof. Assuming that P n yields an exact projection onto the column space of the matrix Q n , we need to show first that the error can be expressed as a sum of the errors of the n projections where I denotes the identity matrix. Following Drineas and Mahoney (2007), let us add and subtract the term X × N P N in equation (A2) so that we obtain ∥E∥ F = ∥X − X × N P N + X × N P N − X × 1 P 1 × 1 · · · × N P N ∥ F (A4) ≤ ∥X − X × N P N ∥ F + ∥X × N P N − X × 1 P 1 × 1 · · · × N P N ∥ F (A5) = ∥X − X × N P N ∥ F + ∥(X − X × 1 P 1 × 1 · · · × N−1 P N−1 ) × N P N ∥ F (A6) The bound (A5) follows from the triangular inequality for a norm. Next, the common term P N is factored out in equation (A6). Then, the bound (A7) follows from the properties of orthogonal projectors. This is because the range(X × 1 P 1 × 1 · · · × N−1 P N−1 ) ⊂ range(X × 1 P 1 × 1 · · · × N P N ), and then it holds that ∥X − X × 1 P 1 × 1 · · · × N P N ∥ F ≤ ∥X − X × 1 P 1 × 1 · · · × N−1 P N−1 ∥ F . See Proposition 8.5 by Halko et al (2011) for a proof using matrices. Subsequently the residual error E N−1 can be bounded From this inequality, equation (A3) follows. We take the expectation of equation (A3) Recalling that Theorem 10.5 formulated by Halko et al (2011) states the following expected approximation error (formulated here using tensor notation) assuming that the sketch in equation (7) is constructed using a standard Gaussian matrix . Here σ j denotes the singular values of the matricized tensor X (n) greater then the chosen target rank k. Combining Equations (A9) and (A10) then yields the results of the theorem (11).
Figures A1 evaluates the theoretical upper bound over 100 runs for both a third and fourth order random low-rank R = 25 tensor. Here, we use a fixed oversampling parameter p = 2. The results show that the empirical error is faithfully bounded by the theoretical upper bound for varying target ranks.