Cross Tensor Approximation Methods for Compression and Dimensionality Reduction

Cross Tensor Approximation (CTA) is a generalization of Cross/skeleton matrix and CUR Matrix Approximation (CMA) and is a suitable tool for fast low-rank tensor approximation. It facilitates interpreting the underlying data tensors and decomposing/compressing tensors so that their structures, such as nonnegativity, smoothness, or sparsity, can be potentially preserved. This paper reviews and extends state-of-the-art deterministic and randomized algorithms for CTA with intuitive graphical illustrations. We discuss several possible generalizations of the CMA to tensors, including CTAs: based on fiber selection, slice-tube selection, and lateral-horizontal slice selection. The main focus is on the CTA algorithms using Tucker and tubal SVD (t-SVD) models while we provide references to other decompositions such as Tensor Train (TT), Hierarchical Tucker (HT), and Canonical Polyadic (CP) decompositions. We evaluate the performance of the CTA algorithms by extensive computer simulations to compress color and medical images and compare their performance.


I. INTRODUCTION
Tensor decompositions are efficient and widely used tools for multi-way data processing (analysis) and, in particular, they can be utilized to compress the data tensors without destroying their intrinsic multidimensional structure. In the past few decades, several types of tensor decompositions have been introduced, such as CANDECOMP/PARAFAC also called Canonical Polyadic Decomposition (CPD) [1], [2], Tucker decomposition [3]- [5] and its special case, Higher-Order SVD (HOSVD) [6], Block Term decomposition (BTD) [7]- [9], Hierarchical Tucker (HT) decomposition [10], [11], Tensor Train/Tensor Chain (TT-TC) decomposition [12]- [14], tubal SVD (t-SVD) [15]- [17], each of which generalizes the notion of matrix rank to tensors in an efficient way. They have been successfully applied in many applications such as signal processing [18], machine learning [19], The associate editor coordinating the review of this manuscript and approving it for publication was Manuel Rosa-Zurera. blind source separation [18], [20], chemometrics, [21], feature extraction/selection [22]. For a comprehensive study on tensors and their applications, we refer to [23]- [25]. It is of interest to decompose a data tensor in such a way that some parameters of the underlying tensor decomposition, e.g., the factor matrices or the core tensor of the Tucker decomposition, preserve the structure of the original data tensor, e.g., smoothness, nonnegativity, or sparsity. The main reasons for such interests are 1-fast low tensor rank approximation, 2-achieving higher compression ratio, and 3-data interpretation issues. Skeleton or Cross tensor/matrix approximation (CTA/CMA) or equivalently tensor/matrix CUR approximation is an efficient framework for attaining the mentioned goals. The main characteristic of the CTA algorithms, which makes them useful for handling very large-scale data tensors, is to occupy less memory and reduce computational complexity. Regarding the higher compression capability, for example, in the case of matrices (second-order tensors), the SVD of sparse matrices does not provide sparse factor matrices while VOLUME 9, 2021 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ the cross-skeleton approximation achieves this goal, leading to more compact data representation. The CTA methods have similar properties for compressing structured data tensors. Besides, regarding the interpretation issue, for example, as mentioned in [26], a vector [1/2] age − (1/ √ 2) height + (1/2) income being as one of the significant uncorrelated factors for a data set of people's features is difficult to be interpreted. Kuruvilla et al. in [27] also claimed: ''it would be interesting to try to find basis vectors for all experiment vectors, using actual experiment vectors and not artificial bases that offer little insight.'' This motivates computing a low-rank approximation as close as SVD but with individual columns/rows of the original data matrix; see [26], [27], for a detailed discussion on this topic.
In most types of tensor decompositions, there are no guarantees that the factor matrices or core tensors involved in the decompositions preserve the structure of the underlying data tensors unless additional constraints are imposed. As a result, the decomposition procedure should be formulated as constrained optimization problems. For instance, the l 1 -norm as the tightest convex surrogate of l 0 norm imposes sparsity constraint on factor matrices and/or core tensor [28]. An alternative to the optimization approach is the Cross/Skeleton technique which uses a part of the data tensor to compute a tensor decomposition. The CMA methods compute a low-rank approximation based on a part of individual columns and rows. They have found applications in deep learning [29], signal processing [30], [31], scientific computing [32]- [34] and machine learning [35]- [37]. A closely related matrix approximation is called matrix column selection or interpolative matrix decomposition. Here, just a part of columns are sampled [38]- [44], and it can be considered as a special CMA. The CMA/CTA can be used similarly to approximate continuous functions; however, throughout the paper we only focus on the discrete variables (for details, please see [45]- [47]).
For the CMA methods, the problem is not formulated as a constrained optimization problem; but instead, a matrix is represented based on a part of its columns and rows. Generally, this can be done based on a part of its components which includes rows and columns as special cases. The sampled matrix is also called sketched matrix. Four main algorithms for the CMA are, (see Figure 1): • Maxvol algorithm [48], [49] • Cross2D algorithm [50], [51] • Discrete Empirical Interpolation Method (DEIM) [52], [53] • Random sampling algorithms [54]. The main difference between these algorithms is the procedure of column or row selection. The Maxvol, Cross2D, and DEIM algorithms are deterministic, while the random sampling algorithms are probabilistic (e.g., using uniform distribution). The DEIM algorithm needs estimations of the top left/right singular vectors to proceed, and this is also required in a kind of probabilistic sampling technique, i.e., leverage scores [54]. The Maxvol and Cross2D algorithms do not need the top singular vectors, and they heuristically look for optimal or the most representative columns and rows.
The main building block of all algorithms for tensor decomposition is computing low-rank approximation of unfolding matrices [23], [24]. Here, of course, the CMA algorithms (either deterministic or randomized algorithms) can be utilized instead of expensive alternative techniques, e.g., SVD. For a comprehensive study on randomized algorithms for the Tucker decomposition and TT-TR decomposition, we refer to [55]- [57]. Motivated by the limitation of algorithms for handling big data tensors and the need for fast low-rank approximation methods, the classical CMA was generalized to tensors in [58]- [63]. The work [60] is related to the TT decomposition, whereas the works [59], [61], [63] are for the Tucker decomposition and [62] is associated with the t-SVD. These generalizations have been made from different perspectives and to the best of our knowledge, there is no survey paper discussing their similarities and differences. In particular, it seems that it is rather difficult to distinguish one CTA from another. This paper attempts to achieve this goal where different CTA algorithms and their properties are unified and discussed in detail. Moreover, there is a lack of graphical illustrations in the literature for describing the CTA algorithms making the topic difficult to understand, especially for those who have an engineering background. Here, we provide some useful and intuitive illustrations with unifying notations. We mainly discuss three possible generalizations of the CMA to tensors including, see Figure 2: • Fiber selection • Slice and tube selection • Lateral and horizontal slices selection where the last one deals with tubal product (t-product) [15]- [17], [64].
In general, fiber or slice selection can be performed in either a deterministic or randomized way. If fibers or slices are sampled based on prior probability distributions such as uniform, length-squared, or leverage score probabilities [54], then the algorithms are referred to as randomized CTA; otherwise, they are called deterministic CTA. However, for the brevity of the presentation, in the rest of the paper, we only use the CTA term and highlight when the underlying selection is randomized. In the randomized CMA case, the accuracy of obtained approximations highly depends on the prior probability distributions [54]. Actually, it turns out that this is also true for slice-fiber selection. Please note that some properties of the underlying data matrix determine which sampling strategy should be used. For example, uniform sampling works well for matrices with low coherence [54]. The best existing random sampling algorithms [54] use leverage score probabilities, for which the relative-error approximation is achieved. However, they require the computation of top singular values, which is computationally expensive. It is known that randomized algorithms facilitate the tensor decomposition procedure not only by reducing the computational complexity of deterministic algorithms but also by reducing the communication among different levels of memories, which is the main bottleneck in modern computing environments and architectures for handling large-scale data tensors.
The structure of this paper is organized as follows. In Section II, we introduce basic definitions and concepts used throughout the paper. In Section III, we discuss the CTA algorithms and several generalizations of the CMA to tensors. Section IV, describes the tubal CTA, which is defined based on t-product. The number of parameters of the CTA models and the space/time complexity of the CTA algorithms are discussed in Section V. Section VI, compares the CTA algorithms in the sense of the number of passes to compute a low-rank tensor approximation. The main challenges and future research directions are discussed in Section VIII. Simulations are provided in Section IX to validate and evaluate the performance of the presented algorithms. Finally, a conclusion is drawn in Section X.

II. PRELIMINARIES
In this section, we present notations and concepts used throughout the paper. Tensors, matrices, and vectors are denoted by underlined bold upper case letters, e.g., X, bold upper case letters, e.g., X, and bold lower case letters, e.g., x, respectively. Fibers are first-order tensors produced by fixing all modes except one, while slices are second-order tensors generated by fixing all modes except two of them. For a third-order tensor X, the slices X(:, :, k), X(:, j, :), X(i, :, :) are called frontal, lateral and horizontal slices, respectively. Fibers X(i, :, j), X(:, j, k) and X(i, j, :) are called rows, columns, and tubes, respectively [23], see Figure 3 for graphical illustration of slice and fiber concepts for a 3rd order tensor. The rows, columns, and tubes are also called mode-1, mode-2, and mode-3 fibers. In general, for an N order tensor, N types of fibers can be defined, i.e., n-mode fibers for n = 1, 2, . . . , N . The notations X + and X T denote the Moore-Penrose pseudoinverse (MP 1 ) and the transpose of matrix X, respectively. The Frobenius and Chebyshev (infinity) norms of tensors/matrices are denoted by . F and . ∞ . The first norm is the square root of the sum of the square of tensor components, while the second is the maximum of the absolute value of tensor components. We use |X| to indicate a matrix whose components are the absolute value of the components of the data matrix X, i.e., y i,j = x i,j where Y = |X|. The notation [N ] denotes the set of natural numbers [1, 2, . . . , N ] and n(A) means the number of elements of the set A. For the simplicity of the presentation, all notations used in the paper are summarized in Table 1.
Definition 1: n-unfolding 2 matrices: Given an N th-order tensor X ∈ R I 1 ×I 2 ×···×I N , then the n-unfolding of the sketched data tensor X is denoted by X (n) ∈ R I n ×I 1 ···I n−1 I n+1 ···I N , whose components are This is an element-wise representation of the n-unfolding transformation.
The n-unfolding operator can be introduced in a more understandable way by stacking the mode-n fibers of a tensor. See Figure 4 for a graphical illustration for this concept for a 3rd order tensor. The reverse of this procedure is called 1 The MP concept is also defined for tensors based on the t-product. (see Section IV.) 2 It is also called matricization or flattening. n-folding which transforms a matrix into a tensor. There are MATLAB libraries to perform these operations, such as tensorlab [65] and tensor toolbox [66]. For a given data tensor X ∈ R I 1 ×I 2 ×···×I N , an intersection subtensor is produced by considering only a part of indices in the modes of the original data tensor X. For example, if I n ⊂ [I n ], n = 1, 2, . . . , N , then the subtensor X(I 1 , I 2 , . . . , I N ) ∈ R n(I 1 )×n(I 2 )×···×(I N ) can be generated.
Tensors and matrices can be multiplied in modes that have the same dimensions. This is called tensor mode-n product and is a generalization of matrix-matrix multiplication. To be precise, let X ∈ R I 1 ×I 2 ×···×I N and B ∈ R J ×I n , then the tensor-matrix multiplication along mode n is denoted by and defined as follows The tensor-matrix multiplication can be done in the unfolding form. Given a tensor X ∈ R I 1 ×I 2 ×···×I N and a matrix A ∈ R K ×I n , then we have Let X ∈ R I 1 ×I 2 ×···×I N be a given data tensor, then the Tucker decomposition model is defined as follows: [3]- [5] where S ∈ R R 1 ×R 2 ×···×R N is called the core tensor, and the matrices A n ∈ R I n ×R n , R n ≤ I n , n = 1, 2, . . . , N are called factor matrices. A shorthand notation for the Tucker decomposition is The N -tuple (R 1 , R 2 , . . . , R N ) is called multilinear or Tucker rank where R n is the rank of the mode-n unfolding matrix X (n) , n = 1, 2, . . . , N . For a matrix X as a secondorder tensor, the rank of X (1) and X (2) are the same because X 1 = X T 2 . For tensors of order higher than 2, the components of the multilinear rank can be different [6]. Higher-order SVD (HOSVD) [6] is a special Tucker decomposition in which the factor matrices are orthogonal, and the core tensor S satisfies two properties called all-orthogonality and pseudodiagonality [6]. It is not difficult to see that for the case of matrices, the SVD of a matrix X can be regarded as a special case of the HOSVD, where U R and V R are the matrices containing the top left and right singular vectors, respectively, and is a diagonal matrix containing the top singular values.

A. CROSS MATRIX APPROXIMATION (CMA) AND MATRIX COLUMN SELECTION
Theory and algorithms for low-rank matrix approximation based on a part of its individual columns and rows were first developed in [67]. This framework for low-rank approximation of matrices is known as skeleton/cross matrix approximation. It is called cross approximation because the matrix intersecting the columns and rows is used within the approximation procedure. As we will discuss later, the procedure of column or row selection highly affects the approximation error. In the cross/skeleton approximation, the columns and rows are selected in heuristic ways and using deterministic algorithms. If the columns or rows are selected based on random sampling techniques, this framework is often known as randomized CUR decomposition.
Given a data matrix X ∈ R I ×J , let us select R 1 columns and R 2 rows from X with corresponding indices I and J ,    respectively. Assume that the selected columns and rows are stored as C ∈ R I ×R 1 , R ∈ R J ×R 2 respectively, then the intersection submatrix is W = X(I, J ) ∈ R R 1 ×R 2 , see Figure 5. The low-rank CMA is computed as follows: where U ∈ R R 1 ×R 2 should be computed to yield the smallest error. Two main motivations for using such an approximation are • Fast low-rank approximation • Compression and interpretability issues in which the factors C and R should have the same structure as the original data matrix X. Concerning the second motivation, note that the middle matrix U does not necessarily preserve the structure of the data matrix, and only C and R have this property. As a result, for nonnegative matrices, one should distinguish this with the nonnegative matrix factorization in which everything should be nonnegative. However, sometimes it is desirable that only C and R have the same structure as the original data matrix and not the middle matrix U. For example, for sparse data matrices, the factor matrices C and R are sparse and a higher compression rate than SVD can be achieved. The interpretability issue is another application of VOLUME 9, 2021 these techniques where the data matrix is represented as a linear combination of individual columns, and it can be better interpreted. The best middle matrix U in the least-squares sense is [67] for which the approximation is exact if rank (X) ≤ min {R 1 , R 2 }. However, this is of less practical interest because we need to access the whole data matrix X. This causes expensive communication costs, especially when the data matrix is stored out-of-core and the cost of communication may exceed our main computations. An alternative and more efficient method is to use the intersection submatrix W and compute U = W + . The two mentioned techniques are not necessarily equivalent, but for a data matrix with exact rank (noiseless) in [68], the conditions under which they will be the same are discussed. For the noisy data matrices, Formula (5) provides a better approximation. When the intersection submatrix is square and nonsingular, the approximation is known as skeleton, but when it is singular, the algorithm is called pseudo-skeleton approximation. Here, the CMA X ∼ = CW + R is used, which is computationally less expensive than the approach using Formula (5). The first approximation is called a two-passes algorithm, while the second is called a single-pass algorithm. Computing the Moore-Penrose (MP) pseudoinverse of matrices incurs instability issues, especially when they are ill-conditioned. Because of this, a wellconditioned intersection submatrix should be selected. It is known that the quality of the intersection submatrix quite depends on the module of the determinant of the intersection submatrix, which is called matrix volume. 3 More precisely, a set of columns and rows should be selected whose intersection submatrix has as much volume as possible. Clearly, this is an NP-hard problem because we need to check the volume of all possible intersection matrices produced by different selections of columns on rows. However, heuristic algorithms do exist for computing suboptimal solutions [69]- [72]. Three known heuristic CMA algorithms are: • Maxvol-based algorithm [48], [49] • Cross2D algorithm [50], [51] • Discrete Empirical Interpolatory Method (DEIM) [52], [53].
As mentioned earlier, these selection techniques are deterministic, and one should distinguish them from the random sampling approaches [54], [73], [74] or the random projection approaches [75], which basically are randomized algorithms. The DEIM algorithm was first introduced in model order reduction [76], and deals only with column selection in which an estimation of the top right singular vectors are required, while in the Cross2D and the maxvol-based low-rank approximations, both columns, and rows are involved. Remark 1: If a given matrix X is positive semi-definite, then the CMA approach is called the Nyström method, i.e., C = R, and has several applications in machine learning and data science [77]- [80].
A special case of the CMA in which only columns are sampled is called matrix column selection, interpolative matrix decompositions, and in some contexts, it is also referred to as CX decomposition. This problem is known as column (feature) selection in the field of machine learning and data analysis. Let X ∈ R I ×J is a given data matrix and C ∈ R I ×R is a matrix containing the R selected columns from the matrix X. Then the matrix column selection is formulated as follows where C ∈ R I ×R is a matrix containing the selected columns and X ∈ R R×J should be computed in such a way that the approximate error should be as small as possible. The best solution to the problem (6) in the least-squares (LS) sense is X = C + X, and if rank(X) = R, then this approximation is exact, i.e., X = CC + X.

III. CROSS TENSOR APPROXIMATION (CTA)
Existing tensor decompositions such as HOSVD, CPD, TT/TR decompositions without constraints do not preserve the natural structure of the original data tensors, such as nonnegativity or sparsity. To solve this problem, we can impose additional constraints or use cross-types tensor decompositions, which are based on a part of its fibers or slices (or both of them). Here, the latter approach is taken. In the next subsequent sections, we explain how the CMA can be generalized to tensors.

A. CTA BASED ON FIBER SELECTION
A possible generalization of the CMA to tensors in the sense of selecting only columns and rows can be performed based on the Tucker-2 decomposition in which the compression is performed only in the first and second modes. Given a 3rd order data tensor X ∈ R I ×J ×K , here, the factor matrices C ∈ R I ×R 1 and R ∈ R J ×R 2 contain the selected columns and rows and the middle core tensor U ∈ R R 1 ×R 2 ×K should be computed to yield the smallest error (see Figure 6). The Tucker-2 rank is accordingly defined as (R 1 , R 2 ). The optimal middle core tensor can be computed as follows Natural and straightforward generalizations of the CMA to tensors are proposed in [59], [81], and [61]. Quite similar to the CMA in which parts of columns and rows of a given matrix are sampled, here a set of fibers (along different modes) are selected, and the goal is computing a Tucker approximation based on these sampled fibers. For instance, for 3rd-order tensors, we should sample columns, rows, and tubes. The approach proposed in [81] is randomized, while those proposed in [59], [61] are deterministic. These works can be considered as the starting points of generalization of the CMA to tensors, and in the sequel, we explain the idea behind each of these Tucker approximations.
For noiseless data tensor, the existence of an exact Tucker model whose factor matrices are taken from the fibers of the original data tensor is straightforward. To be more precise, let X be an N th-order tensor of size I 1 × I 2 × · · · × I N and of Tucker rank (R 1 , R 2 , . . . , R N ). Now, if we generate any full-rank factor matrices A n ∈ R I n ×R n , n = 1, 2, . . . , N , by sampling fibers in each mode and computing the core tensor as then the obtained Tucker decomposition has the exact Tucker rank (R 1 , R 2 , . . . , R N ). So an exact Tucker decomposition of the tensor X whose factor matrices were taken from the original data tensor is computed. For noisy tensors, similar to the CMA, the accuracy of approximation depends on the number of selected fibers and the list of sampled fibers. The idea of sampling fibers and considering them as the factor matrices, first was proposed in [81]. In the first step, the factor matrices are generated, after which the core tensor is computed through (7) as described above. This is summarized in Algorithm 1. It is possible to sample fibers only in some modes and not in all of them. Motivated by the matrix case where it is possible to only select columns and not rows, in [82] it is suggested to sample fibers only in some of the modes and not all of them. This is considered as a generalization of the CX matrix approximation to tensors.

B. LOW-RANK APPROXIMATIONS BASED ON MULTILINEAR PROJECTIONS
If one needs the Tucker approximation of a data tensor for several multilinear ranks, then formulation (7) needs passing the original data tensor X multiple times. This burdens high communication costs. To resolve this problem, consider the following relation where n ∈ R S n ×I n , n = 1, 2, . . . , N are random matrices preserving the structure of the data tensor X and the tensor W ∈ R S 1 ×S 2 ×···×S N is the compressed or sketched tensor. For example, for non-negative data tensors, we can use uniform random matrices. The reduction dimension, S n should be selected carefully to capture the range of the unfolding matrices. From the theoretical point of view, it is required to have S n > 2R n , for a detailed discussion on this, see [83], [84].

Algorithm 1:
The Sampling Tucker (STucker) Algorithm [81] Input : A data tensor X ∈ R I 1 ×I 2 ×···×I N , positive integer numbers R n , n = 1, 2, . . . , N Output: Tucker approximation X ∼ = S; A 1 , A 2 , . . . , A N 1 for n = 1, 2, . . . , N do 2 Sample R n Mode-n Fibers and Generate Approximate Factor Matrix A n ∈ R I n ×R n 3 end 4 Compute the Core Tensor S ∈ R R 1 ×R 2 ×···×R N as in (7) Algorithm 2: The Sampling Single-Pass Tucker (SPSTucker) Algorithm [84] Input : A data tensor X ∈ R I 1 ×I 2 ×···×I N , positive integer numbers R n , n = 1, 2, . . . , N Output: Tucker approximation Sample R n Mode-n Fibers in n-Th Mode and Generate Approximate Factor Matrix A n ∈ R I n ×R n 3 end 4 Compute the Core Tensor S ∈ R R 1 ×R 2 ×···×R N as in (10) The size of the tensor W is smaller than the original data tensor X and is easier to be handled. By substituting (2) in (8) we have From (9), the core tensor of the Tucker decomposition can be computed by [83], [84] which is independent of the original data tensor X, i.e., it is a single-pass algorithm. This procedure is summarized in Algorithm 2. The most expensive part of Algorithm 2 is computing the sketching tensor W in (9), which can be accelerated by exploiting the structured random matrices such as sparse random matrices [85], [86], subsampled random Fourier transform (or SRFT) [87], [88] subsampled Hadamard transforms, and sequence of Givens rotations [89]. Algorithm 2 combines sampling and multilinear random projection techniques for tensor decomposition since in the first step, fibers are selected based on some probability distribution (sampling step) while the compression step (8) is performed via multilinear projection technique (random projection step). It is also possible to perform the first step by using randomized QR decomposition [90], [91].
Another approach is based on multilinear projections, either random or deterministic, was proposed in [92], which generalizes the idea of Compressed Sensing (CS) [93], [94] for data tensors. This approach, is called Multi-Linear Projection (MLProj) method, and allows one to recover some VOLUME 9, 2021 potentially big data tensor from a few multilinear projection measurements when the original data tensor has a low Tuckerrank (R 1 , R 2 , . . . , R N ) structure. More specifically, in this method, it is assumed that the following multilinear projection measurement are available: for n = 1, 2, . . . , N , where n ∈ R R n ×I n . Then, an approximation of the original tensor X can be obtained by the following formula: where Z n = (Z (n) ) (n) and W was defined in (8). This low Tucker-rank reconstruction is proved to be numerically stable and robust [92]. Matrices n can be random (e.g., Gaussian distributed) or deterministic. For example, by choosing n as a subset of columns in the identity matrix, the obtained projections in eq. (11) are actual fibers in mode-n of the original data tensor. Another option is to construct matrices n composed by subsets of columns in the Fourier transform matrix. In this case, the obtained measurements are fibers in the multidimensional Fourier domain. Section IX, illustrates how to apply the MLProj method to reconstruct cardiac fMRI images by sub-sampling the Fourier domain, which allows a significant speedup in signal acquisition with negligible loss of information.
The first question regarding the sampling approaches is: what will be the approximation error by fiber sampling algorithms? On one side, the method based on multilinear projections [92] is equipped with error bounds for the 2D and 3D cases. On the other side, using the existing theory of randomized sampling techniques on unfolding matrices, an additive-error bound on the accuracy of the solution is proven in [81]. It is known that the relative error accuracy is also achievable if the leverage score probabilities are used [54]. This idea is used in [91]. The computation of leverage scores is expensive because of the requirement of computation of the SVD, but fast algorithms for their computations are proposed in [95]. The procedure of column selection can be performed using the leverage scores sampling or the DEIM algorithm. These techniques were studied in [91].

C. THE CROSS3D ALGORITHM
In [59], the Tucker decomposition is tackled by applying the Cross2D algorithms to unfolding matrices, and also efficient variants with linear complexity are developed. They use the theory of cross-approximation of matrices to tensors, and in particular, it is proven that if a data tensor X ∈ R I 1 ×I 2 ×I 3 admits the following Tucker model , and A n ∈ R I n ×R n , n = 1, 2, 3. Then it is possible to find a new Tucker approximation whose factor matrices A n , n = 1, 2, 3 are taken from some fibers of the original data tensor X and satisfying The above result can be easily generalized to higher-order tensors. Later on, tighter error bounds compared to (13) were presented in [96]. Depending on the factor matrices A i , i = 1, 2, 3 and the core tensor S used for the approximation, the error term E F (expressed as the Frobenious norm), is different. For example, in the case of matrices for which the HOSVD is reduced to the SVD, it is known that the truncated SVD, X ∼ = U R R V T R , provides the best rank R matrix approximation, where U R and V R are the top left and right singular vectors and R is a diagonal matrix containing the R largest singular values σ r , r = 1, 2, . . . , R. To be more precise, let X ∈ R I ×J be a given data matrix with rank R < min(I , J ), . . , min(I , J ) are the last (smallest) min(I , J )−R nonzero singular values of the matrix X. It is clear that the error term achieved by the truncated SVD provides a lower bound for the CMA (4) obtained by sampling columns and rows. However, in the case of tensors, the truncated HOSVD does not provide the best multilinear rank approximation in the least-squares sense while a quasi-best approximation can be achieved [6] as follows where X best is the best multilinear rank approximation of the tensor X. The error term E F can be represented in terms of the singular values of the unfolding matrices; see [6], [97] for details. The proposed algorithm in [59] for computing the Tucker approximation with error accuracy (13) is constructive and the idea is applying the Cross2D algorithm to unfolding matrices sequentially (to be discussed later) and this algorithm is called Cross3D. The main superiority of the Cross3D algorithm over fiber sampling technique [81] is that it can find a set of good fibers in a heuristic and efficient way. This is crucial, especially when the data tensor does not have exact Tucker rank, i.e., the data tensor is corrupted by noise and the selection of columns affects the approximation error. The cross approximation is able to handle this issue by heuristically looking for good columns and rows. In the sequel, we describe the Cross3D algorithm. For the first unfolding matrix X (1) , the following cross approximation can be computed by applying the Cross2D algorithm to the first unfolding matrix as where C ∈ R I 1 ×R , U ∈ R R×R , R ∈ R R×I 2 I 3 and the error term E 1 ∈ R I 1 ×I 2 I 3 . It is not difficult to see that each row of the matrix R is a vectorization of a horizontal slice of the Algorithm 3: The Cross3D Algorithm [59] Input : A data tensor X ∈ R I 1 ×I 2 ×···×I N , a Tucker rank original tensor X. Applying the n-folding operator on both sides of (14) and using the identity (1), we have In the next step, a cross approximation of the unfolding matrix R (2) ∈ R I 2 ×RI 3 is computed as where C ∈ R I 2 ×R , U ∈ R R×R , R ∈ R R×RI 3 and the error term E 2 ∈ R I 2 ×RI 3 or equivalently Ignoring the error terms in (14)- (16), and combing all terms, we have Next, the core tensor can be computed as This procedure is summarized in Algorithm 3. In Algorithm 3, at each iteration, the size of unfolding matrices to which the Cross2D algorithm is applied is reduced. This is the same as the Sequentially Truncated HOSVD (ST-HOSVD) algorithm [98] except that instead of the SVD, the cross approximation is used. Please note that unlike (7) where we need to access the whole data tensor, the core tensor in the Cross3D is computed automatically in the final step via (17). The computational complexity of the Cross3D algorithm is O(I 2 R 2 ) for I 1 = I 2 = I 3 = I because the size of rows in the unfolding matrices are very large. However, each row represents a horizontal slice of the original tensor and it can be approximated by the cross approximation again. An efficient algorithm based on this with computational complexity O(IR 4 ) (for Tucker rank (R, R, R)) is developed in [59], where each long row is treated as a matrix and cross approximation of it is computed. A multilevel version of Algorithm 3 is developed in [99]. The CMA approaches can also be combined with other types of tensor decompositions. For example, the so-called TT-cross approximation developed in [60] is for computation of the TT decomposition, while the algorithms developed in [100], [101] are for the HT decomposition, which includes TT decomposition as a special case. The TR-ALS algorithm [13] and the CP-ALS algorithm [25] are known algorithms for computation of the TR [13] and the CP decompositions [2], respectively, where each iteration solves an over-determined least-squares problem. The idea of solving the underlying over-determined least-squares problem by sampling a part of rows of the coefficient matrix and solving the smaller least-squares problems, was proposed in [102] and [103], respectively.

D. FAST SAMPLING TUCKER DECOMPOSITION (FSTD) ALGORITHM
Now we introduce the third kind of CTA methods. For the simplicity of presentation, we first study the 3rd-order tensors and then outline generalization to higher-order tensors. Unlike the matrices where each column and row always intersect, columns, rows, and tubes for a 3-th order tensor may not intersect each other. As a result, an analogous intersection subtensor like what we have in the matrix case may not arise. The idea in [61] is to first produce an intersection subtensor W by sampling some indices in different modes and then selecting some fibers. Let X ∈ R I 1 ×I 2 ×I 3 be a given 3rd-order tensor and I 1 ⊆ I 1 , I 2 ⊆ I 2 , I 3 ⊆ I 3 , be subsets of indices The question is, how to compute an approximate Tucker decomposition for the tensor X based on the intersection subtensor W? Motivated by the fact that which is used as the middle matrix in the CMA, it is suggested [61] to compute the approximate core tensor in the Tucker decomposition as VOLUME 9, 2021 FIGURE 7. Illustration of the FSTD algorithm for a 3rd-order low-rank tensor. For simplicity of presentation, we assume that all fibers build up to block sub-tensors, [23].

Algorithm 4: Fast Sampling Tucker Decomposition (FSTD) Algorithm for 3rd-Order Tensors [61]
Input : A data tensor X ∈ R I 1 ×I 2 ×I 3 , indices I n ⊆ [I n ], n = 1, 2, 3 Output: Tucker approximation of the tensor X 1 Generate the Intersection Subtensor W = U (I 1 , This is a direct generalization of (18) to 3rd-order tensors. In view of (19), the core tensor U of the Tucker approximation is of size P 2 P 3 × P 1 P 3 × P 1 P 2 , and as a result, we need to sample P 2 P 3 columns, P 1 P 3 rows, and P 1 P 2 tubes, see Figure 7. They should be selected in an appropriate way. It is shown in [61] that the corresponding factor matrices A 1 ∈ R I 1 ×P 2 P 3 , A 2 ∈ R I 2 ×P 1 P 3 , A 3 ∈ R I 3 ×P 1 P 2 are the subsampled matrices from the unfolding matrices X (1) (:, I 2 , I 3 ), X (2) (I 1 , :, I 3 ) and X (3) (I 1 , I 2 , :) respectively and CTA can be found as The procedure of this approach is summarized as follows • Consider indices I n ∈ [I n ], n = 1, 2, 3 and produce the intersection subtensor W and corresponding sampled columns, rows, and tubes A 1 , A 2 and A 3 .
• Compute the Tucker approximation (20). We refer to this algorithm as Fast Sampling Tucker Decomposition (FSTD) and summarize it in Algorithm 4 [61]. It is interesting to note that the FSTD is obtained as a particular case of the MLProj method, eq. (12), when projection matrices n are built upon subsets of columns of the identity matrix. A similar approach is developed in [104] in the sense of the number of selected fibers in each mode.
This approach can be straightforwardly generalized to higher-order tensors. The main difference between formulas (7) and (20) is that components of factor matrices A (n) in (7) are taken from the original data tensor X, while this is the case for the core tensor W in (20). A drawback of this approach is that the number of sampling fibers i.e., A n , n = 1, 2, 3 required in each mode is relatively high, and the matrix-matrix multiplications in (20) may be expensive. As a matter of fact, the CTA for N -th order tensors needs P N −1 fibers in each mode which still increases exponentially with the tensor order. To resolve this issue, an adaptive algorithm was suggested in [61], which is a generalization of the Cross2D algorithm [50], [51] to tensors in the sense of just selecting maximum absolute values of fibers while in the Cross3D algorithm both fibers and slices are involved. This procedure for a 3rd order tensor is described in Algorithm 5.
Note that Algorithm 5 is not a randomized algorithm because, except Line 1, all computations are performed deterministically. In contrary to other algorithms, the indices are not given to the algorithm in advance, and they are selected adaptively. To be more precise, let X ∈ R I ×J ×K be a given data tensor and be the indices which have been already selected. Based on the already selected indices (I p , J q , K r ), let us choose a new index to be added, for example, to the set I p . To this end, first the (p, q, r)-approximation is computed using the FSTD as follows where the core tensor U (p,q,r) is computed from the intersection subtensor W (p,q,r) = X(I p , J q , K r ) using (19) and C and the index of the residual fiber E (p,q,r) (:, j q , k r ), with maximum absolute value, e.g. i p+1 , is selected and added to the index set, i.e., Similarly new indices can be added to the index sets J q , K r for which the residual fibers E (p,q,r) (i p , :, k r ) and E (p,q,r) (i p , j q , :) should be considered, respectively. Note that the whole residual E (p,q,r) is not necessary to be computed and only a fiber to which a maximum component is required is calculated; see [61] for details. Algorithm 5 is fast for tensors of very low Tucker ranks and its complexity increased exponentially if the Tucker rank is large, see Table 2. It is shown in [61] that the linear complexity is achievable and the Tucker approximation of an N th-order tensor of size X ∈ R I 1 ×I 2 ×···×I N with an exact Tucker rank (R, R, . . . , R) can be computed just by taking R mode-n fibers in each mode for n = 1, 2, . . . , N . This result is based on a modification of the (20). In this modified version, the indices I n ⊂ [I n ], n = 1, 2, . . . , N under which the intersection core tensor W is constructed are selected in a special way and not randomly. More precisely, in the first step, from each of the n-unfolding matrices (n = 1, 2, . . . , N ), R fibers Algorithm 5: Adaptive Fiber Selection (AFS) Algorithm [61] Input : Initial column fiber selection (j 1 , k 1 ) and the number of fibers to be selected P Output: Indices of selected P rows/columns/tubes I P , J P and K P in mode n (mode-n fibers) are selected and they are stored as matrices C n ∈ R I n ×R . As R ≤ I n , and in the next step, a subset of indices I n , n = 1, 2, . . . , N , (I n ⊂ [I n ]), is selected for which the intersection submatrix W n = C n (I n , :) ∈ R R×R , is nonsigular. Then it is shown that the following exact approximation can be obtained and W = X(I 1 , I 2 , . . . , I N ) ∈ R R×R×R is the intersection subtensor. Note that formulation (22) differs from the formulation (19) because in the former, the unfolding matrices W (n) , n = 1, 2, 3 are used, while in the latter, the submatrices W n = C n (I n , :), n = 1, 2, . . . , N . The fibers can be sampled using the pivoted QR decomposition applied to the unfolding matrices X (n) . More precisely, inspired by the matrix case [90], Higher-Order Interpolatory Decomposition (HOID) is proposed in [91], and this will be discussed in detail in the next Section.

E. RANDOMIZED HIGHER-ORDER INTERPOLATORY DECOMPOSITION (HOID)
The fibers can be sampled using the pivoted QR decomposition applied to the unfolding matrices X (n) . The Higher-Order Interpolatory Decomposition (HOID) is developed in [91] and is a generalization of the DEIM algorithm [90]. Given an N th-order tensor X ∈ R I 1 ×I 2 ×···×I N , the pivoted QR factorization is applied to the unfolding matrix X (n) ∈ R I n ×J , where ∈ R J ×J is a permutation matrix, and Q ∈ R I n ×I n is an orthogonal matrix. The pivoted QR decomposition can be computed using the strong rank-revealing QR (RRQR) algorithm [105]. The permutation matrix and the corresponding orthogonal matrix Q are partitioned as follows Here Equation (23) can be rewritten as where From (25), by straightforward computations, we have if R 22 2 is small enough. It can be seen that and substituting Q 1 = Q (n) R −1 11 in (26), we have which is a low-rank approximation for the matrix X (n) . The matrix Q (n) is of full-rank because it is a multiplication of nonsingular and orthogonal matrices and can be used as an approximation for the basis of the range of X (n) . Let denote the indices of the columns of the matrix X (n) which are selected based on the permutation matrix 4 1 as p. Then, we have Q (n) = X (n) (:, p) and as a result, it may not be necessarily orthonormal. The HOID algorithm applies the earlier procedure to all unfolding matrices X (n) and computes the basis matrices Q (n) . Afterward, the core tensor S can be computed as follows Algorithm 6 summarizes a deterministic algorithm that computes the QR decomposition of unfolding matrices. A Algorithm 6: HOID Algorithm [91] Input : A data tensor X ∈ R I 1 ×I 2 ×···×I N , and a multilinear rank (R 1 , R 2 , . . . , R N ) Output: Tucker approximation X ∼ = S; Q (1) , Q (2) , . . . , Q (N ) 1 for n = 1, 2, . . . , N do 2 Compute an Interpolatory Decomposition of randomized variant of this algorithm replaces randomized QR decomposition instead of the QR decomposition. To this end, we should first make a reduction in the first mode of the unfolding matrices using random projection as Y = X (n) where ∈ R R+P×I n (P is the oversampling parameter) is a random matrix after which the procedure described above is applied to the matrix Y for column the selection. It is shown in [106] that the selected column indices of the matrix Y can be used for the original data matrix X. More precisely, if Y ∼ = Y(:, p)F T then the indices p of the selected columns can be used for the matrix X (n) , i.e., X (n) ∼ = X (n) (:, p)F T . In Algorithm 6, firstly, the interpolatory decompositions of all unfolding matrices X (n) are computed, and then the core tensor is constructed. It is possible to accelerate this algorithm using the idea of a Sequentially truncated HOSVD algorithm [98] where at each iteration, the size of the unfolding matrices is reduced. The accelerated version of Algorithm 6 based on the mentioned idea is developed in [91].

F. CTA BASED ON SLICE-FIBER SELECTION
Motivated by some applications in hyperspectral medical image analysis and consumer recommender system analysis where one of the modes is qualitatively different from others, an alternative CTA is proposed in [58]. Some examples of such datasets which have a ''qualitatively different'' or ''distinguished'' mode, are time-evolving internet graph or a set of hyperspectrally-resolved biopsy images or user product-product preference data for consumers [58]. The distinguished modes in the mentioned data sets are temporal evolution of the graph, the frequency or spectral variation in the images, and the users, respectively.
Here, the procedure is based on slice-tube selection, see Figure 8 for an illustrative explanation. We first briefly describe this idea for 3rd-order tensors. Let X ∈ R I 1 ×I 2 ×I 3 be a given data tensor, and without loss of generality, we assume that the last mode is qualitatively different from the others.
Given prior probability distributions for sampling frontal slices as {p i } and tubes as {q j } I 1 I 2 j=1 , in the first step, some frontal slices, say L 1 , are sampled, and they are stored in C ∈ R I 1 ×I 2 ×L 1 . In the second step, we sample some tubes, say L 2 = R 1 R 2 and store them in R ∈ R R 1 ×R 2 ×I 3 , or a matrix R ∈ R L 2 ×I 3 , (see Figure 9 (a)). The CTA is then defined as (see Figure 9 (b)) where the tensor C ∈ R I 1 ×I 2 ×L 1 and the matrix R ∈ R L 2 ×I 3 contain the sampled frontal slices and tubes respectively. The matrix U ∈ R L 1 ×L 2 is defined as where D 1 ∈ R L 1 ×L 1 and D 2 ∈ R L 2 ×L 2 are scaling diagonal matrices corresponding to the slice and fiber sampling respectively and defined as follows are probability distributions under which the frontal slices and fibers are sampled. This procedure is summarized in Algorithm 7. The length-squared probability distributions defined as follows where J 1 and J 2 are subsets of the indices I 1 and I 2 , are used in [58] for selecting the slices/tubes. Remark 2: As discussed in [63], the model (27) can be considered as a special case of (20) when W = C and A i = W (i) , i = 1, 2. Here, W is the intersection subtensor in model (20), and C is a tensor containing the selected frontal slices.  A generalized version of Algorithm 7 to higher-order tensors is proposed in [58]. Here, the notion of the slab is used which means that all modes of a given tensor are free except one and this definition for 3rd-order tensors reduces to the slices. Let X ∈ R I 1 ×I 2 ×···×I N be a given data tensor whose n-th mode is the corresponding qualitatively different mode. We first sample L 1 slabs in the mode n from the original data tensor X denoted by the tensor C ∈ R I 1 ×···×I n−1 ×L 1 ×I n+1 ×···×I N . In the next step, we sample L 2 fibers in mode n from the original data tensor X and also L 2 fibers from the sampled tensor C denoted by R ∈ R L 2 ×I n and ∈ R L 2 ×L 1 respectively. Similar to the 3rd-order tensors, the diagonal scaling matrices D 1 ∈ R L 1 ×L 1 and D 2 ∈ R L 2 ×L 2 are defined as follows are probability distributions under which the slabs and fibers are sampled. Then the CTA is defined as follows and ∈ R L 1 ×L 1 is the best rank-L 1 approximation of the MP of the matrix H (n) H T (n) where H = C × n D 1 .
For detailed theoretical theorems and results, we refer to [58].

IV. CTA BASED ON TUBAL PRODUCT (T-producT)
The CMA and matrix column selection can be generalized to tensors based on the concept of t-product. We first introduce preliminary concepts and operations related to the t-product and then explain how to define the cross approximation using the t-product.
, a probability distribution q j I 1 I 2 j=1 and positive integers L 1 and L 2 Output: A tensor C of size I 1 × I 2 × L 1 , a matrix U of size L 1 × L 2 and a matrix R of size L 2 × I 3 1 Select L 1 Frontal Slices of Tensor X i.i.d. Trials According to {p i } I 3 i=1 and Produce Tensor

A. TUBAL SVD DECOMPOSITION
Tubal SVD (t-SVD) is a special kind of tensor decomposition representing a 3rd-order tensor as a product of three 3rd-order tensors where the middle tensor has nonzero tubes located only in the main diagonal [15]- [17], [64], see Figure 10. The t-SVD has found many applications in deep learning [107], [108], tensor completion [109], [110], numerical analysis [111]- [114], image reconstruction [115]. The generalization of the t-SVD to higher-order tensors is given in [116]. Throughout this paper for t-SVD, we only focus on 3rd-order tensors.
The number of nonzero tubes is called tubal rank. The truncated t-SVD gives the best approximation in the least-squares sense for any unitary invariant tensor norm, unlike other tensor decompositions. In order to introduce the t-SVD, we first need to present some basic definitions, such as the t-product operation and f-diagonal tensors. Definition 2 (t-Product): Let X ∈ R I 1 ×I 2 ×I 3 and Y ∈ R I 2 ×I 4 ×I 3 , the t-product X * Y ∈ R I 1 ×I 4 ×I 3 is defined as follows where circ X = In view of (29), it is seen that the t-product operation is the circular convolution operator, and because of this, it can be easily computed through Fast Fourier Transform (FFT) transform. More precisely, we first transform all tubes of two tensors X, Y, into the frequency domain, and construct two new tensors X and Y which are called spectral tensors. Then, the frontal slice of the spectral tensors X and Y are multiplied. Finally, we apply the Inverse FFT (IFFT) transform to all tubes of the last tensor. This procedure is summarized in Algorithm 8. Note that other types of tensor decompositions in the t-product format can be computed in a similar manner. For example, to compute the tubal QR computation of a tensor X ∈ R I 1 ×I 2 ×I 3 , i.e., X = Q * R, we first compute the FFT of the tensor X as and then the QR decomposition of all frontal slices of the tensor X is computed as follows In [117], the unitary transform matrices are used instead of discrete Fourier transform, and it is shown that it can provide Algorithm 8: t-Product in the Fourier Domain [15] Input : Two data tensors Z ∈ R I 1 ×I 2 ×I 3 , X ∈ R I 2 ×I 4 ×I 3 Output: t-product C = Z * X ∈ R I 1 ×I 4 ×I Input : A data tensor X ∈ R I 1 ×I 2 ×I 3 and target tubal rank R Output: decomposition with a lower tubal rank. Note that (30) is equivalent to computing the FFT of all tubes of the tensor X. Finally, the IFFT operator is applied to the tensors Q and R, to compute the tensors Q and R.
Definition 3 (Transpose): Let X ∈ R I 1 ×I 2 ×I 3 be a given data tensor. Then the conjugate transpose of tensor X is denoted by X T ∈ R I 2 ×I 1 ×I 3 , which is constructed by applying transpose to all its frontal slices and reversing the order of second till last transposed frontal slices.
Definition 4 (Identity Tensor): Identity tensor I ∈ R I 1 ×I 1 ×I 3 is a tensor whose first frontal slice is an identity matrix of size I 1 × I 1 and all other frontal slices are zero.

Definition 5 (Orthogonal Tensor):
A tensor X ∈ R I 1 ×I 1 ×I 3 is orthogonal (under t-product operator) if X T * X = X * X T = I. Definition 6 (f-Diagonal Tensor): If all frontal slices of a tensor are diagonal then the tensor is called f-diagonal. Assume X ∈ R I 1 ×I 2 ×I 3 , then it can be decomposed as where U ∈ R I 1 ×I 1 ×I 3 , V ∈ R I 2 ×I 2 ×I 3 are orthogonal tensors and tensor S ∈ R I 1 ×I 2 ×I 3 is f-diagonal.
The procedure of computation of the t-SVD for tensors in the Fourier domain is outlined in Algorithm 9. In the next section, we briefly introduce the matrix column selection and CMA approaches. They are used in our subsequent cross tensor analysis.
The MP of tensors can be defined based on the t-product as follows, which will be used in tubal CTA. Definition 7: Assume X ∈ R I 1 ×I 2 ×I 3 , then the MP of the tensor X is denoted by X + ∈ R I 2 ×I 1 ×I 3 and defined as a unique tensor satisfying the next four relations X * X + * X = X, X + * X * X + = X + , X * X + T = X * X + , X + * X T = X + * X. Similar to other operations, the MP pseudoinverse of tensors can be computed through FFT and this is described in Algorithm 10.

B. TENSOR LATERAL SLICE SELECTION AND CTA BASED ON TUBAL PRODUCT (T-producT)
In the framework of tubal SVD, a 3rd order data tensor in R I 1 ×I 2 ×I 3 is viewed as a ''matrix of tubes'' also known as components of the ring R I 3 with the addition and multiplication defined as the vector addition and circular convolution. From this perspective, the lateral and horizontal slices are considered two-dimensional columns and rows of a 3rd-order tensor [62]. The ''matrix of tubes'' viewpoint leads to the tubal SVD and corresponding tubal rank concept. Here, the tensor variants of the CMA and matrix column selection are called the tubal CTA and lateral slice selection, respectively. To be precise, let X be a given 3rd-order tensor. The tubal CTA is formulated based on t-product as follows where C ∈ R I 1 ×L 1 ×I 3 and R ∈ R L 2 ×I 2 ×I 3 are some sampled lateral and horizontal slices of the original tensor X respectively and the middle tensor U ∈ R L 1 ×L 2 ×I 3 is computed in such a way that the approximation (31) should be as small as possible, see Figures 11 and 12, for graphical illustration concerning lateral and horizontal slice selections and also tubal CTA, respectively. Note that similar to other CTA algorithms discussed so far, the procedure of both lateral and horizontal slices sampling can be performed based on prior probability distributions. 5 The probability distributions used in [62] are uniform and nonuniform (length-squared and leverage scores) distributions. Let us first consider the tubal CTA. Similar to the CMA, the best solution for the middle tensor U in the least-squares sense is 5 Here, various probability distributions (with/without replacement) can be used but we will not go through the theoretical details.  which is a straightforward generalization of the CMA [67] to tensors. Formula (32) can be computed in the Fourier domain and these computations are summarized in Algorithm 11. However, it is clear that (32) needs to pass the data tensor X once again, and this is of less practical interest for very large-scale data tensors, especially when the data tensors do not fit into the memory and communication between memory and disk is expensive [75]. To solve this problem, the MP pseudoinverse of the intersection subtensor W ∈ R L 2 ×L 1 ×I 3 , which is obtained based on intersecting the sampled horizontal and lateral slices, should be approximated as It is not difficult to see that the tensor W consists of some tubes of the original data tensor X. A similar algorithm for computation of the tensor W, is proposed in [62] which is a tensor generalization of that proposed for matrices in [26] but as mentioned before the algorithms are rather sophisticated. Algorithm 12 describes this procedure.
Remark 3: For data tensors with missing entries, an algorithm is proposed in [118] for the computation of the tubal CTA. It is a generalization of the algorithm developed in [119] to compute the CMA of data matrices with missing entries.
Remark 4: In Algorithm 9, in Line 3, the truncated SVD can be replaced by cross algorithms such as Cross2D, Maxvol and sampling techniques and the resulting algorithms can be considered as new generalizations of CTA. The idea of sampling applied on lateral slices was used in [118].

Algorithm 11: Two-Passes Tubal CTA (TP-TCTA) Algorithm
Input : A data tensor X ∈ R I 1 ×I 2 ×I 3 , positive numbers L 1 , L 2 , probability distributions p i , i = 1, 2, . . . , I 2 and p i , i = 1, 2, . . . , I 1 Output: Tubal CTA X ∼ = Z = C * U * R 1 Select L 1 lateral slices of the tensor X based on probability distributions p i , i = 1, 2, . . . , I 2 and set tensor C 2 Select L 2 horizontal slices of the tensor X based on probability distributions p i , i = 1, 2, . . . , I 2 and set tensor Also, the tensor lateral slice selection based on the t-product is formulated as follows where C ∈ R I 1 ×L×I 3 is a part of lateral slices of the tensor X and the tensor X ∈ R L×I 2 ×I 3 is computed in such a way that the reconstruction error (33) is as small as possible [62], (see Figure 13). The best option for the tensor X is which provides the best approximation in a least-squares sense, that is through a projection approximation as X ∼ = C * C + * X. This procedure is summarized in Algorithm 13. Please note that this algorithm needs to pass the original whole data tensor X but a single-pass variant can be found in [120], which is a direct generalization of the matrix case developed in [83]. The span of the lateral slices can be captured via random projection [121], i.e., multiplication with random tensors, which is equal to the linear combination of the lateral slices. The span of the lateral slices can also be captured via the count-sketch idea [122]. To the best of our knowledge, this idea has not been investigated yet. Besides, all results and algorithms discussed so far for the tubal CTA are for 3rd order tensors and generalization of these results to N th order tensors is possible using the theory presented in [116].

V. NUMBER OF PARAMETERS AND COMPUTATIONAL COMPLEXITY
In this section, we compare the number of parameters of the CTA algorithms and the associated computational complexity required for computing them. We evaluate the performance of the algorithms in Section IX, experimentally. For the simplicity of presentation, we consider a 3rd order tensor X ∈ R I ×I ×I and depending on the CTA model of interest, the following ranks and parameters are considered In the Tucker based model, we assume that R rows, R columns and R tubes are selected. In the tubal CTA model, R lateral and R horizontal slices are selected, and for the case of lateral slice selection (model (33)) which only deals with the lateral slice selection, R lateral slices are sampled. In model (27), where both frontal slices and tubes should be selected, the number of sampled slices and tubes are the same and equal R. For the SP-STucker algorithm, the sketching size S = (S 1 , S 2 , . . . , S N ) is used where S n > 2R n and R n is the nth component of the multilinear rank. 6 TABLE 2. The number of parameters and computational complexity of different CTA methods for low multilinear rank approximation (R, R, R) approximation, tubal rank R of a 3rd order tensor of size I × I × I. For the CTA-FS Algorithm, R frontal slices and R tubes are selected.

Algorithm 13: Lateral Slice Selection (LSS) Algorithm
Input : A data tensor X ∈ R I 1 ×I 2 ×I 3 , a positive numbers L, probability distribution p i , i = 1, 2, . . . , I 2 Output: Low tubal-rank approximation X ∼ = Z = C * C + * X 1 Select L lateral slices of the tensor X based on probability distributions p i , i = 1, 2, . . . , I 2 and set tensor The number of parameters and also computational complexity of the CTA algorithms are outlined in Table 2. Note that in order to achieve the same level of accuracy for different CTA algorithms, the number of fibers, slices, and tubes of the algorithms may not be the same, so the comparison made in Table 2 is true only for the above-mentioned special case. The complexity of SP-STucker algorithm can be higher than the STucker algorithm but as discussed in the paper, the former is pass-efficient while the latter needs to pass the original data tensor for every multilinear rank. Here the cost of communication may exceed our main computations.

VI. PASS-EFFICIENT ALGORITHMS
In the randomized framework for low-rank tensor/matrix approximation, the whole dataset should be passed as few times as possible and according to this, the randomized algorithms are categorized to the single-pass and the multi-pass randomized algorithms. The pass-efficient algorithms refer to algorithms that need to access the data only a few times. The elements of the data tensors are used in the sketching procedure where a summary of the original large dataset is produced to be used in the subsequent computations. More precisely, the elements of the dataset are used to capture the range of unfolding matrices [55], computing the probability distribution under which the slices/fibers are selected [54], [58]. The question is that is it possible to compute a low-rank tensor approximation without even passing the data tensor one time? Clearly based on the algorithms we discussed so far, the answer is yes, and actually, in Table 3, we have categorized the algorithms in the sense of the number of passes they need to compute a low-rank approximation. Note that some CTA algorithms require at least one time to access the whole dataset, but some only need a part of the data. The latter category is of more interest, especially when the whole data tensor is extremely large and cannot be handled in a single machine. It is worth mentioning that if the uniform sampling is used for slice/tube selection within the CTA-FS algorithm, it does not need to pass the whole data tensor, while if the length squared probability distribution in (28) is used, then it becomes a single-pass algorithm.
In Table 3, we have not explicitly mentioned the number of slices and tubes to be selected in the CTA-FS algorithm because there is not a specific definition of rank for this decomposition.

VII. APPLICATIONS OF CTA MODELS
CTA can be potentially utilized in several applications in which the low-rank tensor approximation is required. Such applications include but not limited to The idea of replacing the deterministic low-rank tensor decomposition with the CMA counterparts reduces the computational complexity of the algorithms and leads to faster algorithms. In this section, we discuss the possibility of using CTA methods for the three above-mentioned problems.

A. TENSOR COMPLETION
Tensor completion is the problem of recovering an incomplete data tensor from partially observed components and has found several applications in recommendation systems [123], traffic data prediction [124], [125], image processing [126], etc. The most useful and widely used technique for matrix completion is through the nuclear norm minimization [127], [128], which is the tightest convex surrogate of the matrix TABLE 3. The number of passes required in different CTA methods for low multilinear rank approximation (R, R, R) approximation, tubal rank R of a 3rd order tensor of size I × I × I. rank. As mentioned before, the notion of rank is not unique for tensors. In fact, in tensor completion problems, we use different types of tensor decomposition and the corresponding tensor ranks [129]. However, in almost all of these formulations, the key operation is the computation of the SVD of some matrices which are computationally prohibitive. In [130]- [133], the SVD computation is replaced by the randomized SVD, while in [119], the CMA decomposition is used. For the tensor case, the same idea can be exploited. For example, tubal CTA is used in [62] and [118] for tensor completion in two different ways. In [118], the CMA was used for low-rank approximation of the underlying matrices, while in [62] some lateral and horizontal slices are selected. The idea of incorporating the CMA within the framework of different tensor completion models such as Tucker, TT/TR, and CP decompositions is a potential topic that needs to be further investigated.

B. ROBUST PCA
As there exist different kinds of tensor decompositions, several different tensor RPCA (TRPCA) problems can be formulated naturally. The formulation of TRPCA based on the Tucker and the tubal tensor decomposition are introduced in [62] and [134], [135] respectively. The main computationally expensive operation involved in these formulations is either computing the SVD or tensor decomposition of some underlying matrices or tensors. Motivated by this bottleneck, for the matrix RPCA, in [136], [137], the randomized SVD for low-rank approximation is used while in [31] the CMA is utilized where the computation of the SVD is replaced by the CMA.
In [62], the CMA is used to find a low tubal rank approximation of the underlying data tensors.
The idea of exploiting the CMA approaches for low-rank approximation can be analogously used in robust TT/TR decompositions [138], and faster algorithms can be developed.

C. DENOISING
The truncated tensor decomposition algorithms can be naturally used for denoising the data tensors in a similar way as the SVD is used for denoising the data matrices. The deterministic algorithms for tensor decompositions can be replaced by the randomized tensor decomposition variants, e.g., CTA algorithms. For example, in [139], the randomized CPD decomposition is used, while in [62], the tubal CTA algorithm is exploited. The application of the CTA algorithms for tensor denoising is an interesting research topic that needs to be further investigated.

D. COMPRESSED SENSING (CS)
In recent years, Compressed Sensing (CS) technology has accelerated signal acquisition by restricting the number of measurements in, for example, the Fourier domain [93], [94], [140]. This is the case of Magnetic Resonance Imaging (MRI) systems for medical applications where the measurements are taken explicitly in the Fourier domain, also known as the k-space. Here, the goal of CS is to reconstruct an image from incomplete measurements taken in the Fourier domain. It is interesting to note that in the MLProj reconstruction formula (eq. (12)), when the projection matrices n are constructed as collections of selected columns in the Fourier transform matrix, then taking measurements Z (n) correspond to subsampling fibers in the Fourier domain. In Section IX, we illustrate how to apply the MLProj method to recover cardiac cine MRI data from incomplete measurements in the Fourier domain.

VIII. DISCUSSION AND FUTURE CHALLENGES
Recently many approaches and results, i,g., perturbation analysis, have been investigated in [68], [141], for CMA and it is of intense interest to generalize these results to tensors. For example, some of these results are generalized to tensors in [63], [142]. The current tubal CTA algorithms are randomized where the lateral and horizontal slices are selected randomly. The generalization of the max-volume and adaptive cross concepts to this type of decomposition is a challenging topic. The CMA is used in [29] for compressing fully connected layers of deep neural networks (DNNs). Closely related works can be found in [143], where instead of the CMA, the randomization technique is used. As the intrinsic structure of the layers in the DNNs are in the tensor forms, many tensor decomposition models have been used in which the tensor decomposition methods are utilized to compress either fully connected or convolutional layers, for example, see [107], [144]- [146]. Exploiting the randomized algorithms, particularly the CTA algorithms, for compressing layers in DNNs seems to be a perspective research direction.
In the next section, we discuss the challenges and methods for tensor rank selection which plays a key role in the tensor analysis.

A. TENSOR RANK SELECTION METHODS
The concept of rank in the context of tensor decomposition is different to the one used for matrices and can take different forms depending on the type of the used tensor decomposition. For example, when using the CPD, the tensor rank is defined as the minimum number of rank-1 tensor terms needed to exactly represent a given data tensor. Tensor rank selection is challenging for CPD; in fact, it is known that its determination is NP-hard [147]. Moreover, in contrast to the matrix case, given a CPD approximation of a data tensor, it is not guaranteed that a reduction of the approximation error can be achieved by increasing the number of rank-1 tensors in the CPD. However, some pragmatic procedures have been proposed in the past, such as the core consistency diagnostic (CORCONDIA) algorithm [21] and other alternatives [148]- [150]. Fortunately, the previous difficulties are not present when working with a Tucker approximation. In this case, the Tucker-rank or multilinear rank is the tuple (R 1 , R 2 , . . . , R N ) that determines the size of the core tensor. It is essential to highlight that the Tucker-rank can be, in theory, computed by finding the matrix-rank of each unfolding matrix X (n) obtained from the data tensor X. More importantly, in this case, it is guaranteed that, given a data tensor for which we don't know in advance its multilinearrank, we can select some initial Tucker-rank and reduce the approximation error by successively increasing the multilinear rank [151]. Although the multilinear rank of a given tensor can be computed exactly or approximated, it would require accessing the full data tensor and involve high computational cost. In practice, our methods use a greedy approach; in other words, they usually start at some initial multilinear rank and successively increase it until the approximation error is less than some threshold . The tubal SVD is another decomposition defined based on the concept of tubal tensor-tensor multiplication. Here, the tubal rank is defined. Similar to the Tucker model, computing the tubal rank is possible. The interesting property of the truncated t-SVD is that it provides the best rank approximation under any unitary invariant tensor norm. In this paper, we studied the CTA algorithms mostly based on the Tucker model and the tubal SVD decomposition.

IX. SIMULATIONS
In this section, we experimentally evaluate the validity and performance of the CTA algorithms discussed in the papers. The MATLAB implementations of the algorithms are accessible at https://github.com/SalmanAhmadi-Asl/Cross_Tensor_Approximation. All numerical simulations were performed on a cluster computer 120Gb RAM and 48 CPUs 1.2GHz. The efficiency and performance of algorithms were compared in terms of compression ratio, sampling rate, relative error and running time on real datasets.
The compression ratio is defined as where C 1 and C 2 are, respectively, the number of components of the original data tensor and the number of the parameters to represent a data tensor in the decomposition format. The sampling ratio is defined as where C 3 is the number of observed (sampled) entries in the original data tensor. The sampling ratio is used to compare the CTA algorithms which do not need to pass the whole data tensor, e.g., FSTD Algorithm. The relative error is defined as where X and X are approximate and exact data tensors, respectively. Note that in all our sampling algorithms, we use sampling without replacement. The PSNRs is used to assess the quality of the images reconstructed by the algorithms, defined as PSNR = 10 log 10 255 2 /MSE , where MSE = X − X 2 F /num X , and ''num'' denotes the number of parameters of a given data tensor.

Example 4.1 (Real Data Compression):
In this experiment, we applied the CTA algorithms to compress COIL-100 dataset [152]. This data tensor consists of 7200 color images (100 objects and 72 rotations per object, see Figure 14 for some samples of this data tensor). The size of each image is 128 × 128 × 3, and the whole data is represented as a fifth-order tensor of size 128 × 128 × 3 × 100 × 72. Since some of the CTA algorithms are applicable only for 3rd order tensors, we first reshaped the tensor to a 3rd order tensor of size 768 × 768 × 600, then applied the CTA algorithms to compress it. Finally, the approximate tensors were reshaped to the original size of the dataset. We first considered the algorithms which only need to access a part of the data tensor, i.e., the FSTD algorithm, the Cross3D algorithm, the AFS algorithm, the CTA-FS algorithm, and the F-TCTA algorithm. The results achieved by the naive Cross3D algorithm 7 and the AFS algorithm were almost the same, and because of this we only report the AFS algorithm.
For the FSTD algorithm, we assumed that the number of indices that are selected in each mode are the same, and were chosen in the range, 40, 80, 120, . . . , 360. For the AFS algorithm, we assumed that the number of fibers to be selected in each mode are the same and they were selected in the range, 50, 100, 150, . . . , 550. For the CTA-FS algorithm, we considered the uniform sampling without replacement and the number of slices to be selected were 50, 100, 150, . . . , 500 and also, we assumed that the number of tubes is twice the number of slices, i.e., 100, 200, 300, . . . , 1000. We found that sampling without replacement is more efficient than sampling with replacement for the slice selection case while we did not 7 Applying the Cross2D directly on the unfolding without taking into account the second cross approximation of the reshaped form of the long rows. see any significant difference for the various fiber selection scenarios.
For the F-TCTA, we found that the algorithm is sensitive to the number of selected lateral/horizontal slices. Our experiments show that when the number of lateral/horizontal are the same or close to each other, the approximation is relatively poor. However, as they are different (far from each other), the results are considerably improved. In our experiments, we used (3,8), (5,10), (5,15), (10,30), (60,20), (40,90), (160, 100), (260, 200), (290, 360) where the first/second component denotes the number of selected lateral/horizontal slices. For each of the above assumptions, we computed the sampling ratio, running time, and relative errors. The relative error and running time comparisons against sampling ratio are reported in Figures 15 (a)-(b), respectively. From the results, we found that the AFS /Cross3D algorithms are the most promising and accurate CTA algorithms for computing low tensor approximation but only when the multilinear rank is relatively small. Nonetheless, when the multilinear rank is large, their computational complexity is exponentially increased. Note that in the case of the CTA-FS algorithm, there is no any criterion guiding how many slices and fibers should be selected and how their relationship should be to achieve the optimal compression. In our experiments, we first considered the case that the number of fibers is twice the number of slices, which might not always be an optimal criterion. Due to this fact, we performed a new experiment where the number of tubes to be selected was five times larger than the number of slices. The results of this experiment are reported in Figure 15 (c), where Type I/II refers to the case that the number of tubes is two/five times larger than the frontal slices. From Figure 15, it is seen that the CTA-FS algorithm using Type II parameters achieves a lower approximation error than Type I.
To compare the reconstruction properties and performances of the algorithms, we first consider the CTA-FS algorithm and select 400 frontal slices and 2000 tubes. The sampling ratio, in this case, is 0.6701, with the corresponding relative error 0.1683. We found that the CTA-FS algorithm has a different feature than other CTA algorithms.
More precisely, it reconstructs some images of the data set perfectly 8 while the other images might be good or bad approximated. In our experiment, a total of 4800 images were recovered by inf PSNR. The PSNRs of the rest 2400 images reconstructed by CTA-FS algorithm can be seen in Figure 16 (a).
For the FSTD algorithm, we assumed that the same number of indices in each mode (equal to 320) is selected. For this case, the sampling ratio is 0.7106 with the corresponding relative error 0.2283. The quality of images reconstructed by this algorithm is displayed in Figure 16 (b).
For the F-TCTA algorithm, we assumed that 270 horizontal slices and 200 lateral slices were selected. Here, we have a sampling ratio 0.6120 and a relative error 0.1168. The quality of reconstructed images by this algorithm are displayed in Figure 16 (c).
For the AFS algorithm, we assumed that 550 fibers are selected in each mode. Here, we have a sampling ratio 0.4734 and a relative error 0.0868. The PSNRs of the reconstructed images using this algorithm are displayed in Figure 16 (d).
The recovered images of some samples corresponding to the above-mentioned cases are visualized in Figure 17.
In the next step, we compared the efficiency of the CTA algorithms, which need at least one pass to all parts of the data tensor to compute a low tensor approximation, i.e., the STucker algorithm, the SP-STucker algorithm, the HOID algorithm, the TP-TCTA algorithm, and the LSS algorithm. For the STucker and HOID algorithms, we assumed that the number of fibers to be selected in each mode was the same and in the range, R = 50, 150, . . . , 600. For the LSS algorithm, we assumed that the number of lateral slices are uniformly sampled and are in the range, R = 10, 40, 70, . . . , 240, and for the TP-TCTA algorithm, we assumed that the number of lateral/horizontal slices to be selected are the same and in the range, R = 10, 40, 70, . . . , 240. We also used the CTA-FS algorithm with length-squared probability distribution in (28) as a single-pass algorithm in our experiments. We assumed that the number of slices to be selected were in the range, R = 50, 100, 150, . . . , 500, and the number of selected tubes were five times larger than the number of frontal slices. Figures 18 (a) and (b), respectively, show the probability distribution under which the frontal slices and tubes were selected. Note that since all the mentioned algorithms pass the whole data tensor at least once, we compared the algorithms in terms of running time and compression ratio, not  For this dataset, the LSS algorithms achieved the best results in terms of relative error and compression ratio, while its computational complexity was higher. We remark that for this dataset, the FCD-FS algorithm with length-squared probability distribution provides almost the same performance as the FCD-FS algorithm with the uniform sampling because from Figures 18 (a)-(b), it is seen that the probability distribution of slice/tube selection is approximately uniform. The potentials of uniform sampling for image compression/approximation was also confirmed in [153] although we confirmed that uniform sampling without replacement works better than the one with replacement. In general, uniform sampling works quite well for data with low coherence [54]. Our computer experiments also confirm this. Nonetheless, the length-squared requires a high pass cost over the data and has higher computational complexity for computing the length-probability distributions in (28).
Here, we compare the quality of the reconstructed images by the pass-efficient algorithms. For the STucker algorithm, we assumed that the number of fibers to be selected is the same and equal to 600. The compression ratio was 1.6287 with the corresponding relative error 0.0740. In Figure 20 (a), the reconstructed images (object and rotations) are displayed. For the LSS algorithm, we sample 238 lateral slices for which the compression ratio 1.6134 and the relative error 0.0443 were achieved. The quality of the images reconstructed by this algorithm are displayed in Figure 20 (b). Also for the TP-TCTA algorithm, we sampled 205 lateral slices and 205 horizontal slices for which the compression ratio 1.6526 and relative error 0.0799 were archived. The reconstructed images obtained by this algorithm are displayed in Figure 20(c). For the CTA-FS algorithm, we selected 400 slices and 2000 tubes using length-squared probability distribution (without replacement) for which the compression ratio 1.4874 and relative error 0.1651 were achieved.   The reconstructed images of this experiment are displayed in Figure 20 (d). The recovered images of some samples corresponding to the above-mentioned cases are visualized in Figure 17.
Algorithm 2 (SP-STucker) is applicable only for relatively small multilinear rank because as we mentioned earlier, for theoretical reasons, the sketching parameters are required to satisfy S n > 2R n which means that the Tucker rank (R 1 , R 2 , . . . , R n ) should be small enough. We assumed that the number of sampled fibers in each mode were the same and in the range, R = 50, 100, . . . , 250, and the sketching parameters were S 1 = S 2 = S 3 = 550. We compared the relative error of the SP-STucker and the STucker algorithms for different numbers of sampled fibers in Figure 21 (a). From this figure, it is visible that as soon as the number of sampled fibers is increased, the quality of obtained results is significantly decreased. For completeness of our simulations, we compare the running time of SP-STucker and STucker algorithms. To this end, note that the algorithm SP-STucker is more efficient than the STucker algorithm in the following two main scenarios • The data tensor is extremely large which burdens high communication costs.
• The low multilinear rank approximation of a tensor for several multilinear rank is required. Since we were able to store the data in a single machine, we only considered the second above-mentioned scenario and made a comparison between the SP-STucker algorithm 2 and the STucker algorithm. To this end, we considered again the sketching parameters S 1 = S 2 = S 3 = 500 and the uniform multilinear rank (R, R, R) was used where R = 40, 60, 80, . . . , 260. We utilized Gaussian random matrices with zero mean and variance 1 for computing the sketching tensor W while they are not optimized for this computation. This operation was the most expensive part of the SP-STucker algorithm and took approximately 29 seconds. However, we do not compute it again through all of our computations for different multilinear ranks. The running time comparison of these algorithms are reported in Figure 21 (b). Also the running times which are required to compute the multilinear rank approximation of the data tensor for 13 experiments with R = 20, 40, 60, . . . , 260, based on the STucker and the SP-STucker algorithms were 79.34 and 62.61 seconds, respectively. The results clearly indicate that when multiple low multilinear rank approximation of a data is required, the SP-STucker algorithm 2 can be considerably faster than the STucker algorithm.
Example 4.2 Reconstruction of Cardiac MRI Images by Sub-Sampling in the Fourier Domain): Cardiac cine Magnetic Resonance Imaging (MRI) allows to obtain a temporal sequence of images within a single heart slice. At a given time, an MRI equipment provides measurements in the 2D-Fourier domain (k-space) of that single heart slice so the images are reconstructed by applying the inverse Fourier transform for each time. Thus, time resolution is constrained by the scanning of the full k-space. To improve time resolution, it is necessary to reduce the number of measurements and develop advanced processing techniques that could allow reconstructing images from incomplete measurements in the Fourier domain, which is the main motivation of the theory of Compressed Sensing (CS) developed in recent years [93], [94], [140]. CS techniques specialized in cine cardiac imaging were then developed by designing subsampling schemes using random Cartesian or radial trajectories in the (k x , k y )-space [154], [155].
Here, we show how to reconstruct cine cardiac MRI images by applying fixed masks to frontal slices in the data tensor as shown in Figure 22(a)-left. As a baseline method, we consider the Minimum Energy reconstruction, which consists in filling all unavailable measurements with zeros in the (k x , k y )-space and apply the inverse Fourier transform to each frontal slice. We can also apply the MLProj in eq. (12) by choosing projection matrices 1 and 2 as composed of the columns corresponding to low-frequencies in the Fourier transform matrix; and 3 as composed by a subset of columns in the identity matrix, e.g. equally spaced. By doing so, it is easy to see that the multilinear projection tensors Z (n) , n = 1, 2, 3 and W of eqs. (11) and (8), respectively, can be computed from the available measurements as described in Fig. 22(b). Once the projection measurements are obtained, we can approximate the cine cardiac MRI by applying the reconstruction formula of eq. (12) as illustrated in Fig. 22(c). Since, the obtained reconstructed tensor has Tucker-rank (R, R, R 3 ), we also use a reference the best low Tucker-rank approximation which is obtained by applying the classical High Order Orthogonal Iteration (HOOI) algorithm [25] to the original tensor data.
In our experiments, we used a cardiac MRI data tensor (256 × 256 × 25) as provided within the kt-FOCUSS demo. 9 In order to have a larger tensor, we have concatenated four replicates of this dataset along its 3rd mode thus, producing a (256 × 256 × 100) data tensor. We computed the relative error with sampling rates in the range, [5%, 45%], 9 http://bispl.weebly.com/k-t-focuss.html for the Min. Energy and the MLProj reconstructions using deterministic and random masks. Deterministic masks were generated as shown in Fig. 22(a)-left, i.e., having a cross-type pattern every D = 4 frontal slices and a simple square covering center frequencies for the rest of the slices. We also randomly sampled indices around center frequencies using a double-side exponential law in order to evaluate random masks. The Tucker-rank of the obtained reconstruction is (R, R, R 3 ), where R 3 = 25 and R was computed according to the desired sampling rate. In Fig. 23(a) the obtained relative errors versus the sampling rate are shown for all the methods and the relative error of a Tucker approximation computed through the HOOI algorithm on the full original data tensor is also shown as a lower bound. It is remarkable that the MLProj method provides very excellent results which are close to the optimal low Tucker-rank decomposition. In Fig. 23(b), a visual comparison of the results obtained for a sampling rate equal to 15%, is shown. The best result was obtained with the MLPRoj (random) reconstruction method (relative error = 1.3%), which provides images almost visually identical to the original ones. A remarkable characteristic of this method is that it is very fast as it is obtained by a simple closed formula (see Fig. 22(c))), which took only 0.3s to compute. We also applied a state-of-the-art CS method, namely the FOCUSS algorithm [155], using random Cartesian masks at a sampling rate also equal to 15% but using a different random mask for each slice in each data tensor. The obtained reconstruction was slightly better than MLProj (relative error = 0.6%) but visually almost indistinguishable. Such a good result is because in FOCUSS it is assumed that we are provided with random samples of every frontal slice of the tensor, which could be technologically more sophisticated. Moreover, as an iterative algorithm, FOCUSS has a considerably increased computational cost. It took about 124s to compute the reconstruction, which is about 3 orders of magnitude slower than the MLProj method.

X. CONCLUSION
In this paper, various deterministic and randomized algorithms for computation of Cross Tensor Approximations (CTAs) were reviewed and extended. Three possible generalizations of the matrix cross decomposition to tensors including: CTA based on fiber selection, slice-tube selection, and lateral-horizontal slices selection were discussed and reviewed. Interesting graphical illustrations are exploited to make the presentation much easier. Extensive simulations on image compression datasets were conducted to support and verify the validity and performance of some selected algorithms.