S3CMTF: Fast, accurate, and scalable method for incomplete coupled matrix-tensor factorization

How can we extract hidden relations from a tensor and a matrix data simultaneously in a fast, accurate, and scalable way? Coupled matrix-tensor factorization (CMTF) is an important tool for this purpose. Designing an accurate and efficient CMTF method has become more crucial as the size and dimension of real-world data are growing explosively. However, existing methods for CMTF suffer from lack of accuracy, slow running time, and limited scalability. In this paper, we propose S3CMTF, a fast, accurate, and scalable CMTF method. In contrast to previous methods which do not handle large sparse tensors and are not parallelizable, S3CMTF provides parallel sparse CMTF by carefully deriving gradient update rules. S3CMTF asynchronously updates partial gradients without expensive locking. We show that our method is guaranteed to converge to a quality solution theoretically and empirically. S3CMTF further boosts the performance by carefully storing intermediate computation and reusing them. We theoretically and empirically show that S3CMTF is the fastest, outperforming existing methods. Experimental results show that S3CMTF is up to 930× faster than existing methods while providing the best accuracy. S3CMTF shows linear scalability on the number of data entries and the number of cores. In addition, we apply S3CMTF to Yelp rating tensor data coupled with 3 additional matrices to discover interesting patterns.


Introduction
Given a tensor data, and related matrix data, how can we analyze them efficiently? Tensors (i.e., multi-dimensional arrays) and matrices are natural representations for various real world high-order data [1,2,3]. For instance, an online review site Yelp provides rich information about users (name, friends, reviews, etc.), or businesses (name, city, Wi-Fi, etc.). One popular representation of such data includes a 3-way rating tensor with (user ID, business ID, time) triplets and an additional friendship matrix with (user ID, user ID) pairs. Coupled matrix-tensor factorization (CMTF) is an effective tool for joint analysis of coupled matrices and a tensor. The main purpose of CMTF is to integrate matrix factorization [4] and tensor factorization [5] a1111111111 a1111111111 a1111111111 a1111111111 a1111111111 to efficiently extract the factor matrices of each mode. The extracted factors have many useful applications such as latent semantic analysis [6,7,8], recommendation systems [9,10], network traffic analysis [11], and completion of missing values [12,13,14].
However, existing CMTF methods do not provide good performance in terms of time, accuracy, and scalability. CMTF-Tucker-ALS [15], a method based on Tucker decomposition [16], has a limitation that it is only applicable for dense data and not parallelizable. For sparse real-world data, it assumes empty entries as zero and outputs highly skewed results which lead to high reconstruction error. Moreover, CMTF-Tucker-ALS does not scale to large data because it suffers from high memory requirement caused by M-bottleneck problem [17]. CMTF-OPT [12] is a CMTF method based on CANDECOMP/PARAFAC (CP) decomposition [18]. SDF [19] provided Quasi-Newton and nonlinear least squares optimization techniques for general coupled factorization problems where factors may have certain structures as Toeplitz, orthogonal and nonnegative. CMTF-Tucker-ALS and CMTF-OPT undergo high reconstruction error since the former is not applicable to sparse data, and the latter focuses only on CP model and thus cannot be generalized to the Tucker model. Furthermore, both methods are sequential and hard to take benefit of multi-core parallelization.
In this paper, we propose S 3 CMTF (Sparse, lock-free SGD based, and Scalable CMTF), a CMTF method which resolves the problems of previous methods. S 3 CMTF provides parallel, sparse CMTF based on Tucker factorization unlike previous methods which do not support sparse tensors or cannot be parallelized. We also show that asynchronously parallel stochastic gradient descent (SGD) is useful for S 3 CMTF in multi-core shared memory systems without expensive locking. S 3 CMTF further boosts the performance by storing intermediate computation and reusing them. Table 1 shows the comparison of S 3 CMTF and other existing methods. The main contributions of our study are as follows: • Algorithm: We propose S 3 CMTF, a coupled tensor-matrix factorization algorithm for matrix-tensor joint datasets. S 3 CMTF is designed to efficiently extract factors from the joint datasets by taking advantage of sparsity, exploiting intermediate data. We propose a method which resolves conflicts of parallelization and leads to a solution with guaranteed convergence.
• Performance: S 3 CMTF shows the best performance on accuracy, speed, and scalability. S 3 CMTF runs up to 930× faster and is more scalable than existing methods as shown in Fig  1A. For real-world datasets, S 3 CMTF converges faster to the better optimum as shown in Fig 1B. • Discovery: Applying S 3 CMTF on Yelp review dataset with a 3-mode tensor (user, business, time) coupled with 3 additional matrices ((user, user), (business, category), and (business, city)), we observe interesting patterns and clusters of businesses and suggest a process for personal recommendation.

Preliminaries and related works
In this section, we describe preliminaries for tensor and coupled matrix-tensor factorization. We list all symbols used in this paper in Table 2.

Tensor
A tensor is a multi-dimensional array. Each 'dimension' of a tensor is called mode or way. The length of each mode is called 'dimensionality' and denoted by I 1 , � � �, I N . In this paper, an Nmode or N-way tensor is denoted by the boldface Euler script capital (e.g. X 2 R I 1 �I 2 �...�I N ), and matrices are denoted by boldface capitals (e.g. A). x α and a β denote the entry of X and A with indices α and β, respectively. We describe tensor operations used in this paper. A mode-n fiber is a vector which has fixed indices except for the n-th index in a tensor. The mode-n matrix product of a tensor X 2 R I 1 �I 2 �...�I N with a matrix A 2 R J�I n is denoted by X� n A and has the size of I 1 ×� � �I n−1 ×J×I n+1 � � � × I N . It is defined as: where a ji n is the (j, i n )-th entry of A. For brevity, we use the following shorthand notation for multiplication on every mode as in [20]: where {A} denotes the ordered set {A (1) , A (2) , � � �, A (N) }. We use the following notation for multiplication on every mode except n-th mode.
We examine the case that an ordered set of row vectors {a (1) , a (2) , � � �, a (N) }, denoted by {a}, is multiplied to a tensor X. First, consider the multiplication for every corresponding mode. By Eq (1), where a ðmÞ k denotes the k-th element of a (m) . Then, consider the multiplication for every mode except n-th mode. Such multiplication results in a vector of length I n . The k-th entry of the vector is where O n;k X denotes the index set of X having its n-th index as k. α = (i 1 i 2 � � �i N ) denotes the index for an entry.

Tucker decomposition
Tucker decomposition is one of the most popular tensor factorization models. Tucker decomposition factorizes an N-mode tensor X 2 R I 1 �I 2 �...�I N into a core tensor G 2 R J 1 �J 2 �...�J N and factor matrices U ð1Þ 2 R I 1 �J 1 ; U ð2Þ 2 R I 2 �J 2 ; . . . ; U ðNÞ 2 R I N �J N satisfying  Element-wise formulation of Tucker model is where α is a tensor index (i 1 i 2 � � �i N ), and u ðnÞ i n denotes the i n -th row of factor matrix U (n) . {u} α denotes the set of factor rows fu ð1Þ i 1 ; u ð2Þ i 2 ; . . . ; u ðNÞ i N g. The core tensor G indicates the relation between the factors in Tucker formulation. When the core tensor size is restricted as J 1 = J 2 = � � � = J N and the core tensor structure is hyper-diagonal, it is equivalent to CANDECOMP/ PARAFAC (CP) decomposition. Orthogonality constraint can optionally be imposed to the Tucker decomposition by forcing the factor matrices to have orthonormal columns (e.g. U (n)T U (n) = I for n = 1, � � �, N where I is an identity matrix).

Coupled matrix-tensor factorization
Coupled matrix-tensor factorization (CMTF) is proposed for joint factorization of a tensor and matrices. CMTF integrates matrix factorization and tensor factorization.

Definition 1. (Coupled Matrix-Tensor Factorization)
Given an N-mode tensor X 2 R I 1 �...�I N and a matrix Y 2 R I c �K where c is the coupled mode, X �X ¼ G � fUg, and Y � Y ¼ U ðcÞ V > are the coupled matrix-tensor factorization. U ðcÞ 2 R I c �J c is the c-th mode factor matrix, and V 2 R K�J c denotes the factor matrix for the coupled matrix. Finding the factor matrices and core tensor for CMTF is equivalent to solving arg min where k • k denotes the Frobenius norm. Various methods have been proposed to efficiently solve the CMTF problem. An alternating least squares (ALS) method CMTF-Tucker-ALS [15] was proposed. CMTF-Tucker-ALS is based on Tucker-ALS (HOOI) [21] which is a popular method for fitting the Tucker model. Tucker-ALS suffers from a crucial intermediate memory-bottleneck problem known as M-bottleneck problem [17] that arises from materialization of a large dense tensor X� À n fUg > as intermediate data where fUg > ¼ fU ð1Þ> ; U ð2Þ> ; . . . ; U ðNÞ> g. Generalized coupled tensor factorization frameworks [22,23] have been proposed, and they propose multiplicative methods for non-negative factorization. SDF [19] provided Quasi-Newton and nonlinear least squares optimization techniques for general coupled factorization problems where factors may have certain structures as Toeplitz, orthogonal and nonnegative. A Bayesian method [24] has been proposed. It suggests a generative model for tensor factorization and gets parameters with Gibbs sampling method. Most methods for CMTF use CP decomposition model forX where J 1 = J 2 = � � � = J N and the core tensor G is hyper-diagonal [12,25,26,27,28,19]. CMTF-OPT [12] is a representative algorithm for this problem which uses nonlinear conjugate gradient descent method to find factors. HaTen2 [26,29], and SCouT [25] propose distributed methods for CMTF using CP decomposition model based on the MAPREDUCE framework. Turbo-SMT [27] provides a time-boosting technique for CP-based CMTF methods. Note that Eq (5) requires all data entries of X and Y to be observed. Unobserved values are set to zeros when X and Y are sparse, which results in low accuracy. However, most real world data set shows high sparsity. For example, the density of real world tensor we use for experiments vary from 10 −7 to 10 −4 . For this reason above methods show low accuracy for real-world sparse data; what we focus on this paper is solving CMTF for sparse data. Definition 2. (Sparse CMTF) When X and Y are sparse, sparse CMTF aims to find factors only considering the observed entries. Let W ð1Þ indicates the observed entries of X such that Let W (2) indicates the observed entries of Y analogously. We modify Eq (5) as arg min where � denotes the Hadamard product (element-wise product). CMTF-Tucker-ALS does not support sparse CMTF since it calculates a singular vector of full and dense matrix. CMTF-OPT provides single machine approach for sparse CMTF for CP model, and CDTF [30] and FlexiFaCT [28] provide distributed methods for sparse CMTF for CP model. Note that all existing methods are based on CP model. Our method is for more general setting, Tucker decomposition, and also easily applied to CP model.

Proposed method
Overview S 3 CMTF provides an algorithm for the joint factorization of Tucker decomposition. The major challenge of parallel Tucker decomposition is to avoid the race condition, and design an efficient algorithm for updating factors.
In this section, we describe S 3 CMTF (Sparse, lock-free SGD based, and Scalable CMTF), our proposed method for fast, accurate, and scalable CMTF. Our purpose is to minimize the number of race conditions with probabilistic guarantee by exploiting problem characteristic and minimize calculations by exploiting intermediate data.
We first propose a lock-free parallel method S 3 CMTF-base; then, we propose a timeimproved version S 3 CMTF-opt. Fig 2 shows the overall scheme of S 3 CMTF. S 3 CMTF-base employs asynchronous parallel SGD for the parallel update with proper workload distribution, and S 3 CMTF-opt further improves the speed of S 3 CMTF-base by exploiting intermediate data and reusing them.

Objective function & gradient
We discuss the improved formulation of the sparse CMTF problem defined in Definition 2. For simplicity, we consider the case that one matrix Y 2 R I c �K is coupled to the c-th mode of a tensor X 2 R I 1 �...�I N . Naive calculation of Eq (6) takes excessive time and memory since it includes materialization of dense tensor G � fUg. Therefore, we re-formulate the new CMTF objective function f to exploit the sparsity of data and add regularization. f is the weighted sum of two functions f t and f m which are element-wise sums of squared reconstruction error and regularization terms of tensor X and matrix Y, respectively.
where λ m is a balancing factor of the two functions.
, O X is the observable index set of X, and λ reg denotes the regularization parameter for factors. We rewrite the equation so that it is amenable to SGD update: where α = (i 1 � � �i N ). Note that O n;i n X is the subset of O X having i n as the n-th index. Now we formulate f m , the sum of squared errors of coupled matrix and regularization term corresponding to the coupled matrix.
We calculate the gradient of f (Eq (7)) with respect to factors and core for stochastic gradient descent update. Consider that we pick one index a ¼ ði 1 We calculate the corresponding partial derivatives of f with respect to the factors and the core tensor as follows.
@f @u ðnÞ @f @G @f @u ðcÞ Note that our formulated coupled matrix-tensor factorization model is easily generalized to the case that multiple matrices are coupled to a tensor. We couple multiple matrices to a tensor for experiments in Sections for experiments and discovery.

Multi-core parallelization
How can we parallelize the SGD updates for CMTF in multiple cores? In CMTF, SGD is hard to be parallelized without conflicts since each update may suffer from memory conflicts by attempting to write the core tensor G to memory concurrently [31]. One solution for this problem is memory locking and synchronization. However, there are lots of overhead associated with locking. Therefore, we use lock-free strategy to parallelize S 3 CMTF. We develop a parallel update scheme for S 3 CMTF by adopting HOGWILD! update scheme [32]. For any SGD problem, a hypergraph is induced where its nodes represent parameters and edges represent the set of parameters related to a data point.

Definition 3. (Induced Hypergraph)
The objective function in Eq (7) induces a hypergraph G = (V, E) whose nodes represent factor rows and the core tensor. Each entry of X and Y induces a hyperedge e 2 E consisting of corresponding factor rows or core tensor. Fig 3A shows an example induced graph of S 3 CMTF.
Lock-free parallel updates often converge nearly linearly for a sparse SGD problem in which conflicts between different updates rarely occur [32]. However, in CMTF with Tucker formulation, every update of tensor entries includes the core tensor G as shown in Fig 3A. We allocate the update of the core tensor G to one dedicated CPU core and increase the step size by the number to keep the expected step size unchanged, which leads to line 7 of Algorithm 1 described in the next section. Then we obtain a new induced hypergraph in Fig 3B. Previous induced hypergraph (Fig 3A) implies that every factor update (red, blue, and orange hyperedges) is in conflict with each other on updating the core tensor, resulting to unexpected behaviors. In contrast, the new induced hypergraph shows that the update of factors is independent of that of the core tensor.
Note that our problem with this induced hypergraph is a general case of matrix completion problem in [32] which provides convergence guarantee of lock-free parallelism; each edge in our hypergraph entails N vertices, while that in [32] entails only 2 vertices. Algorithm 1 S 3 CMTF-base 3: if α is picked then 5: ( @f if β is picked then end for 15: η t = η 0 (1 + μt) −1 16: until convergence conditions are satisfied 17: for n = 1, . . ., N do 18:

S 3 CMTF-base
We present our method, S 3 CMTF-base, combination of the aforementioned techniques. S 3 CMTF-base solves the sparse CMTF problem by parallel SGD techniques explained above. Algorithm 1 shows the procedure of S 3 CMTF-base. In the beginning, S 3 CMTF-base initializes factor matrices and the core tensor randomly (line 1 of Algorithm 1). The outer loop (lines 2-16) repeats until the factor variables converge. The inner loop (lines [3][4][5][6][7][8][9][10][11][12][13][14] is performed by several cores in parallel. In each inner loop, S 3 CMTF-base selects an index which belongs to O X or O Y in random order (line 3). If a tensor index α is picked, then the algorithm calculates the partial gradients of corresponding factor rows using compute_gradient (Algorithm 2) in line 5, and updates factor row vectors (line 6). Core tensor G is updated by one dedicated CPU core (line 7). Note that if line 7 is run by multiple cores, a core may interrupt another core's update of G by overwriting the gradient @f @G , which leads to unexpected update of G and hinders convergence; thus, we eliminate the possibility of such conflict by allocating update of G to the dedicated CPU core. The update of line 7 is done independently by the dedicated CPU core, but concurrently with gradient calculation (line 5) and factor updates (line 6) of other CPU cores. The number P of cores is multiplied to the gradient to compensate for the one-core update so that SGD uses the same expected learning rate for all the parameters. If a coupled matrix index β is picked, then the gradient update is performed on corresponding factor row vectors (lines 9-13). At the end of the outer loop, the learning rate η t of the t-th iteration is monotonically decreased [33]. (line 15). QR decomposition is applied on factors to satisfy orthogonality constraint of factor matrices (lines [17][18][19][20]. QR decomposition of U (n) generates Q (n) , an orthogonal matrix of the same size as U (n) , and a square matrix R ðnÞ 2 R J n �J n . Substituting U (n) by Q (n) (line 19) and G by G� 1 R ð1Þ . . . � N R ðNÞ (after N-th execution of line 19) result in orthogonal factors with equivalent factorization quality [5]. In the same manner, we substi-

S 3 CMTF-opt
There is much room for improvement in calculations of S 3 CMTF-base. The computational bottleneck of S 3 CMTF-base is compute_gradient. There are implicitly redundant calculations during multiple tensor-matrix products. For example, calculation of G� À n fug a is repeated N times for every execution of compute_gradient (Algorithm 2) in line 5 of Algorithm 1. The calculation of G� À n fug a for the n-th mode is equivalent to a special case of a well-studied operation, matricized tensor times Khatri-Rao product (MTTKRP). MTTKRP is an operation to compute X (n) � 8k 6 ¼ n A (k) where X (n) is a matricized tensor along the n-th mode, and � denotes the Khatri-Rao product [34]. G� À n fug a is equivalent to an MTTKRP G (n) � 8k 6 ¼ n u (k) where the matrix A (k) is replaced by the vector u (k) . Calculating MTTKRP along all modes is known as the CP gradient problem. In compute_gradient, we need to calculate G� À n fug a for all N modes (line 3 of Algorithm 2), raising the special case of the CP gradient problem. To solve the particular CP gradient problem faster, we propose a method to avoid redundant computations by reusing the intermediate calculations in previous steps. Calculation of G� À n fug a is equivalent to a summation of which is a product of the core value g j 1 j 2 ...j N and N − 1 related factor values. Before the calculation of the CP gradient,x a ¼ G � fug a is calculated in line 1 of Algorithm 2. We exploit the fact that G � fug a is the summation of the product P J 1 j 1 ¼1 . . .
There is no extra time required for calculating S because S is generated while calculating x a . Lemma 1 shows thatx a is calculated by summing all entries of S. Lemma 1. For a given tensor index α, the estimated tensor entrỹ Proof. The proof is straightforward by Eq (4). We use S with following Collapse operation to calculate gradients efficiently.

Definition 5. (Collapse) The Collapse operation of the intermediate tensor S on the n-th mode outputs a row vector defined as:
CollapseðS; nÞ ¼ ½ P Collapse operation aggregates the elements of intermediate tensor S with respect to a fixed mode. We re-express the calculation of gradients for tensor factors in Eqs (8a)-(8d) in an efficient manner.

Analysis
We analyze the proposed method in terms of time complexity per iteration. For simplicity, we assume that I 1 = I 2 = � � � = I N = I, and J 1 = J 2 = � � � = J N = J. Table 3 summarizes the time complexity (per iteration) and memory usage of S 3 CMTF and other methods. Note that the memory usage refers to the auxiliary space for temporary variables used by a method. Lemma 4. The time complexity (per iteration) of S 3 CMTF-base is OðjOjN 2 J N =P þ jO Y jJ=PÞ and the time complexity (per iteration) of S 3 CMTF-opt is OðjOjNJ N =P þ jO Y jJ=PÞ where P denotes the number of parallel cores.
Proof. First, we check the time complexity of S 3 CMTF-base. When a tensor index α is picked in the inner loop (line 4 of Algorithm 1), calculating gradients with respect to tensor factors (line 5) takes OðN 2 J N Þ as shown in Lemma 3. Updating factor rows (line 6) takes Table 3. Comparison of time complexity (per iteration) and memory usage of our proposed S 3 CMTF and other CMTF algorithms. S 3 CMTF-opt shows the lowest time complexity and S 3 CMTF-base shows the lowest memory usage. For simplicity, we assume that all modes are of size I, of rank J, and an I × K matrix is coupled to one mode. P is the number of parallel cores. ( � indicates the lowest time or memory). OðNJÞ, and updating core tensor (line 7) takes OðJ N Þ. If a coupled matrix index β is picked (line 9), calculatingỹ b (line 10) takes OðJÞ. Calculating and updating the factor rows corresponding to coupled matrix entry (lines 10-12) take OðJÞ. All calculations except updating core tensor (line 7) are conducted in parallel. Finally, for all a 2 O X and β 2 O Y , S 3 CMTF-base takes OðjO X jN 2 J N =P þ jO Y jJ=PÞ for one iteration. S 3 CMTF-opt uses compute_gradient_opt instead of compute_gradient in line 5 of Algorithm 1, whose time complexity is shown in Lemma 3. Overall running time per iteration for S 3 CMTF-opt is OðjO X jNJ N =P þ jO Y jJ=PÞ.

Experiments
In this and the next sections, we experimentally evaluate S 3 CMTF. Especially, we answer the following questions. Q1: Performance How accurate and fast is S 3 CMTF compared to competitors? Q2: Scalability How do S 3 CMTF and other methods scale in terms of dimensionality, the number of observed entries, and the number of cores?
Q3: Discovery What are the discoveries of applying S 3 CMTF on real-world data?
The source codes of our method and datasets used in this paper are available at https:// datalab.snu.ac.kr/S3CMTF.

Experimental settings
Data. Table 4 shows the data we used in our experiments. We use three real-world datasets, MovieLens (http://grouplens.org/datasets/movielens/10m), Netflix (http://www. netflixprize.com), and Yelp (http://www.yelp.com/dataset_challenge), as well as synthetic data to evaluate S 3 CMTF. Each entry of the real-world datasets represents a rating, which consists of (user, 'item', time; rating) where 'item' indicates 'movie' for MovieLens and Netflix, and 'business' for Yelp. We use (movie, genre) and (movie, year) as coupled matrices for Movie-Lens and Netflix, respectively. We use (user, user) friendship matrix, (business, category) and (business, city) matrices for Yelp. Particularly for scalability experiments, we generate 3-mode synthetic random tensors with dimensionality I and corresponding coupled matrices to observe speed property while size is varying. We vary I in the range of 1K*100M and the number of tensor entries in the range of 1K*100M. We set the number of entries as jO Y j ¼ 1 10 jO X j for synthetic coupled matrices. We generated observed indices randomly, and their entries to follow uniform random distribution between 0 and 1. where O test is the index set of the test data tensor, x α stands for each test tensor entry, andx a is the corresponding reconstructed value.
Methods. For fair comparison, we compare single core run of S 3 CMTF-base and S 3 CMTF-opt with other single machine CMTF methods: CMTF-Tucker-ALS and CMTF-OPT (described in Section). To examine multi-core performance, we run two versions of S 3 CMTFopt: S 3 CMTF-opt1 (1 core), and S 3 CMTF-opt20 (20 cores). We exclude distributed CMTF methods [25,26,28] since they are designed for Hadoop with multiple machines, and thus take too much time for single machine environment. For example, [17] reported that HaTen2 [26] takes 10,700s to decompose 4-way tensor with I = 10K and jO X j ¼ 100K, which is almost 7,000× slower than our single machine implementation of S 3 CMTF-opt. For CMTF-Tucker-ALS, we implemented a C++ version based on Tucker-MET [20], and for CMTF-OPT, we implemented a C++ version of CMTF-OPT [12]. Our implementation for CMTF-OPT solves Eq (6) by sparse matrix operations. We implement S 3 CMTF with C++. For all of our C++ implementations, we used C++11 with O2 flag. We used Armadillo 7.700 with LAPACK 3.7.0 and BLAS 3.7.0 for matrix operations such as eigenvector calculations. We used OpenMP 4.0 library for multi-core parallelization of S 3 CMTF.
We conduct all experiments on a machine equipped with Intel Xeon E5-2630 v4 2.2GHz CPU and 256GB RAM. We mark out-of-memory (O.O.M.) error when the memory usage exceeds the limit.

Performance of S 3 CMTF
We observe the performance of S 3 CMTF to answer Q1. As seen in Figs 1B and 4, S 3 CMTF converges faster to the optimum with the lowest test error than existing methods with the following details.
Accuracy. We divide each data tensor into 80%/20% for train/test sets. Specifically, 80% of the tensor entries are regarded as the train set and remaining 20% as the test set. The lower error for a same elapsed time implies the better accuracy and faster convergence. Figs 1B and 4 show the changes of test RMSE of each method on three datasets over elapsed time which are the answers for Q1. S 3 CMTF achieves the lowest error compared to others for the same elapsed time. For Yelp, CMTF-Tucker-ALS yielded an O.O.M. error. S 3 CMTF-opt20 achieves the lowest error 1.253, 0.9147, and 0.8037 while the best competing method, CMFT-OPT, gives the error 1.370, 1.018, and 0.8125 for Yelp, Netflix, and MovieLens datasets, respectively. Note that the competing method CMFT-Tucker-ALS gives either an out of memory error or results in the highest error rate.
Running time. We compare our method with the multi-core version of SALS-single [30], a parallel CP decomposition algorithm, to demonstrate the high performance of S 3 CMTF compared to the state-of-the-art decomposition algorithms. We used non-coupled CP version of our method, S 3 CMTF-CP-opt, by setting G to be hyper-diagonal and not coupling any matrices. Fig 5 shows that S 3 CMTF is better than SALS-single in terms of both error and time for MovieLens dataset. S 3 CMTF-TUCKER explicitly denotes S 3 CMTF-opt for Tucker model.

Scalability analysis
We present scalability of our proposed S 3 CMTF and competitors to answer Q2, in terms of two aspects: data scalability and parallel scalability. We use synthetic data of varying size for evaluation. As a result, we show the running time (for one iteration) of S 3 CMTF follows our theoretical analysis in Section.
Data scalability. The time complexity of CMTF-Tucker-ALS and CMTF-OPT have OðNI NÀ 1 J 2 Þ and OðNI NÀ 1 JÞ as their dominant terms, respectively. In contrast, S 3 CMTF exploits the sparsity of input data, and has the time complexity linear to the number of entries (jO X j, |O Y |) and is independent of the dimensionality (I) as shown in Lemma 4. Figs 1A and 6A show that the running time (for one iteration) of S 3 CMTF on real world data sets follows our theoretical analysis in Section. First, we fix jO X j to 1M and |O Y | to 100K, and vary dimensionality I from 1K to 100M. Fig 1A shows the running time (for one iteration) of all methods with J = 10. Note that all of our proposed methods achieve constant running time as  Parallel scalability. We conduct experiments to examine parallel scalability of S 3 CMTF on shared memory systems. For measurement, we define speed up as (iteration time on 1 core)/ (iteration time). Fig 6B shows the linear speed up of S 3 CMTF-base and S 3 CMTF-opt. The slope of the parallel scalability curve is not one (perfectly parallelizable) since the growing number of cores leads to the concurrent read accesses to memory, which leads to conflicts. S 3 CMTF-opt shows higher speed up than S 3 CMTF-base because it reduces reading accesses for core tensor by utilizing intermediate data.

Discovery
In this section, we use S 3 CMTF for mining real-world data, Yelp, to answer the question Q3 in the beginning of the previous section. First, we demonstrate that S 3 CMTF has better discernment for business entities compared to the naive decomposition method by jointly capturing spatial and categorical prior knowledge. Second, we show how S 3 CMTF is possibly applied to the real recommender systems. It is an open challenge to jointly capture the spatio-temporal context along with user preference data [35]. We exemplify a personal recommendation for a specific user. For discovery, we use the total Yelp data tensor along with coupled matrices as explained in Table 4. For better interpretability, we found a non-negative factorization by applying projected gradient method [36]. An orthogonality condition is not imposed to keep non-negativity, and each column of factors is normalized.

Discovery
First, we compare discernment by S 3 CMTF and the Tucker decomposition. We use the business factor U (2) . Fig 7A shows the gap statistic values of clustering business entities with kmeans clustering algorithm. Gap statistic is a theoretical tool to measure separability between k-means clusters [37]. A higher gap statistic means higher separability between clusters. S 3 CMTF shows higher gap statistic values compared to the Tucker decomposition which means S 3 CMTF outperforms the naive Tucker decomposition for entity clustering with respect to the gap statistic.
As the difference between S 3 CMTF and the Tucker decomposition is in the existence of coupled matrices, the high performance of S 3 CMTF is attributed to the unified factorization using spatial and categorical data as prior knowledge. Table 5 shows the found clusters of business entities. Note that each cluster represents a certain combination of spatial and categorical characteristics of business entities.

User-specific recommendation
Commercial recommendations are one of the most important applications of factorization models [4,9]. Here we illustrate how factor matrices are used for personalized recommendations with a real example. Fig 7B shows the process for recommendation. Below, we illustrate the process in detail. https://doi.org/10.1371/journal.pone.0217316.g007 Table 5. Clustering results on business factor U (2) found by S 3 CMTF. We found dominant spatial and categorical characteristics from each cluster. Businesses in a same cluster tend to be in adjacent cities and are included in similar categories. • An example user Tyler has a factor vector u, namely user profile, which has been calculated by previous review histories.

Cluster Location / Category
• We then calculate the personalized profile matrix R ¼ G� 1 uð2 R J 2 �J 3 Þ. R measures the amount of interaction of user profile with business and time factors.
• Norm values of rows in R indicate the influence of latent business concepts on Tyler. Dominant and weak concepts are found based on the calculated norm values. In the example, B4 is the dominant, and B7 is the weak latent concept.
• We inspect the corresponding columns of business factor matrix U (2) and find relevant business entities with high values for the found concepts (B4 and B7).
We found both strong and weak entities by the above process. The strong and weak entities provide recommendation information by themselves in the sense that the probability of the user to like strong and weak entities are high and low, respectively, and they also give extended user preference information. For example, strong entities for Tyler are related to 'spa & health' and located in neighborhood cities of Arizona, US. Weak entities are related to 'grill & restaurants' and located in Toronto, Canada. The captured user preference information potentially makes commercial recommender systems interpretable with additional user-specific information such as address, current location among others.

Conclusion
We propose S 3 CMTF, a fast, accurate, and scalable CMTF method. S 3 CMTF provides up to 930× faster running times and the best accuracy by sparse CMTF with carefully derived update rules, lock-free parallel SGD, and reusing intermediate computation results. S 3 CMTF shows linear scalability for the number of data entries and parallel cores. Moreover, we show the usefulness of S 3 CMTF for cluster analysis and recommendation by applying S 3 CMTF to realworld Yelp data. For future improvements, applying recent achievements in the literature to improve CP gradient algorithm [38,39] to our method is possible. Also, future works include extending the method to a distributed setting.