Parallel matrix factorization for low-rank tensor completion

Higher-order low-rank tensors naturally arise in many applications including hyperspectral data recovery, video inpainting, seismic data recon- struction, and so on. We propose a new model to recover a low-rank tensor by simultaneously performing low-rank matrix factorizations to the all-mode ma- tricizations of the underlying tensor. An alternating minimization algorithm is applied to solve the model, along with two adaptive rank-adjusting strategies when the exact rank is not known. Phase transition plots reveal that our algorithm can recover a variety of synthetic low-rank tensors from significantly fewer samples than the compared methods, which include a matrix completion method applied to tensor recovery and two state-of-the-art tensor completion methods. Further tests on real- world data show similar advantages. Although our model is non-convex, our algorithm performs consistently throughout the tests and give better results than the compared methods, some of which are based on convex models. In addition, the global convergence of our algorithm can be established in the sense that the gradient of Lagrangian function converges to zero.

1.1. Notation. Following [8], we use bold lower-case letters x, y, . . . for vectors, bold upper-case letters X, Y, . . . for matrices, and bold caligraphic letters X , Y, . . . for tensors. The (i 1 , . . . , i N )-th component of an N -way tensor X is denoted as x i1...i N . For X , Y ∈ R I1×...×I N , we define their inner product in the same way as that for matrices, i.e., The Frobenius norm of X is defined as X F = X , X .
A fiber of X is a vector obtained by fixing all indices of X except one, and a slice of X is a matrix by fixing all indices of X except two. For example, if X ∈ R 2×2×2 has two frontal slices (with the third index fixed) then [1,2] is a mode-1 fiber (with all but the first indices fixed), [1,3] is a mode-2 fiber (with all but the second indices fixed), and [1,5] is a mode-3 fiber (with all but the third indices fixed). Its two horizontal (with the first index fixed) and two lateral slices (with the second index fixed) are The mode-n matricization (also called unfolding) of X ∈ R I1×...×I N is denoted as X (n) ∈ R In×Π j =n Ij , which is a matrix with columns being the mode-n fibers of X in the lexicographical order. Take the tensor in (1.1) for example. Its mode-1 and mode-3 matricizations are respectively Relating to the matricization process, we define unfold n (X ) = X (n) and fold n to reverse the process, i.e., fold n (unfold n (X )) = X . The n-rank of an N -way tensor X , denoted as rank n (X ), is the rank of X (n) , and we define the rank 1 of X as an array: rank(X ) = (rank(X (1) ), . . . , rank(X (N ) )). We say X is (approximately) low-rank if X (n) is (approximately) low-rank for all n.

Problem formulation.
We aim at recovering an (approximately) low-rank tensor M ∈ R I1×...×I N from partial observations B = P Ω (M), where Ω is the index set of observed entries, and P Ω keeps the entries in Ω and zeros out others. We apply low-rank matrix factorization to each mode unfolding of M by finding matrices X n ∈ R In×rn , Y n ∈ R rn×Π j =n Ij such that M (n) ≈ X n Y n for n = 1, . . . , N , where r n is the estimated rank, either fixed or adaptively updated. Introducing one common variable Z to relate these matrix factorizations, we solve the following model to recover M The ranks r 1 , . . . , r N in (1.2) must be specified, yet we do not assume the knowledge of their true values. To address this issue, we dynamically adjust the rank estimates in two schemes. One scheme starts from overestimated ranks and then decreases them by checking the singular values of the factor matrices in each mode. When a large gap between ther n th and (r n + 1)th singular values of the factors is found, r n is reduced tor n . The other scheme starts from underestimated ranks and then gradually increases them if the algorithm detects slow progress.
We try to model (1.2) by cyclically updating X, Y, and Z. Although a global solution is not guaranteed, we demonstrate by numerical experiments that our algorithm can reliably recover a wide variety of low-rank tensors. In addition, a global convergence result is given under boundedness assumptions.
The details will be given in Section 3.

Related work.
Our model (1.2) can be regarded as an extension of the following model [26] from matrix completion to tensor completion  1 Our definition relates to the Tucker decomposition [25]. Another popularly used definition is based on the CAN-DECOMP/PARAFAC (CP) decomposition [6].
significantly better than nuclear norm 2 based convex models such as min Z Z * , subject to P Ω (Z) = B, (1.4) where Z * denotes the nuclear norm of Z, defined as the sum of its singular values.
The work [15] generalizes (1.4) to the tensor case, and to recover the (approximately) low-rank tensor M, it proposes to solve where α n ≥ 0, n = 1, . . . , N are preselected weights satisfying n α n = 1. Different from our model (1.2), the problem (1.5) is convex, and in [15], various methods are applied to solve it such as block coordinate descent method, proximal gradient method, and alternating direction method of multiplier (ADMM). The model (1.5) utilizes low-rankness of all mode unfoldings of the tensor, and as demonstrated in [15], it can significantly improve the solution quality over that obtained by solving (1.4), where the matrix Z corresponds to some mode unfolding of the tensor.
The recent work [17] proposes a more "square" convex model for recovering M as follows: whereẐ ∈ R Ii 1 ×...×Ii N is a tensor by relabeling mode i n of Z to mode n for n = 1, . . . , N , and j and the permutation (i 1 , . . . , i N ) are chosen to make n≤j I in as close as to n>j I in . As the order of M is no more than three, (1.6) is the same as (1.4) with Z corresponding to some mode unfolding of the tensor, and it may not perform as well as (1.5). However, for a low-rank tensor of more than three orders, it is shown in [17] that (1.6) can exactly recover the tensor from far fewer observed entries than those required by (1.5).
There are some other models proposed recently for LRTC. For example, the one in [20] uses, as a regularization term, a tight convex relaxation of the average rank function 1 N n rank n (M) and applies the ADMM method to solve the problem. The work [11] directly constrains the solution in some low-rank manifold and employs the Riemannian optimization to solve the problem. Different from the above discussed models that use tensor n-rank, the model in [30] employs the so-called tubalrank based on the recently proposed tensor singular value decomposition (t-SVD) [7]. For details about these models, we refer the readers to the papers where they are proposed.
1.4. Organization. The rest of the paper is organized as follows. Section 2 shows the phase transition of our proposed method and some existing ones. Section 3 gives our algorithm with two different rank-adjusting strategies, and the convergence result of the algorithm is given in section 4. In section 5, we compare the proposed method with some state-of-the-art methods for tensor completion on both synthetic and real-world data. Finally, section 6 conludes the paper. 2 The matrix nuclear norm is the convex envelope of matrix rank function [19], and the nuclear norm minimization can promote the low-rank structure of the solution.

Phase transition plots.
A phase transition plot uses greyscale colors to depict how likely a certain kind of low-rank tensors can be recovered by an algorithm for a range of different ranks and sample ratios. Phase transition plots are important means to compare the performance of different tensor recovery methods.
We compare our method (called TMac) to the following three methods on random tensors of different kinds. In section 5, we compare them on the real-world data including 3D images and videos.
• Matrix completion method for recovering low-rank tensors: we unfold the underlying Nway tensor M along its N th mode and apply LMaFit [26] to (1.3), where Z corresponds to unfold N (M). If the output is (X,Ỹ), then we use fold N (XỸ) to estimate M.
• Nuclear norm minimization method for tensor completion: we apply FaLRTC [15] to (1.5) and use the output Z to estimate M.
• Square deal method: we apply FPCA [16] to (1.6) and use the output Z to estimate M.
We call the above three methods as MatComp, FaLRTC, and SquareDeal, respectively. We chose these methods due to their popularity and code availability. LMaFit has been demonstrated superior over many other matrix completion solvers such as APGL [23], SVT [5], and FPCA [16]; (1.5) appears to be the first convex model for tensor completion, and FaLRTC is the first efficient and also reliable 3 solver of (1.5); the work [17] about "square deal" appears the first work to give theoretical guarantee for low-rank higher-order tensor completion.
If the relative error the recovery was regarded as successful, where M rec denotes the recovered tensor.
2.1. Gaussian random data. Two Gaussian random datasets were tested. Each tensor in the first dataset was 3-way and had the form where C was generated by MATLAB command randn(r,r,r) and A n by randn(50,r) for n = 1, 2, 3. We generated Ω uniformly at random. The rank r varies from 5 to 35 with increment 3 and the sample ratio SR = |Ω| ΠnIn from 10% to 90% with increment 5%. In the second dataset, each tensor was 4-way and had the form where C was generated by MATLAB command randn(r,r,r,r) and A n by randn(20,r) for n = 1, 2, 3, 4. The rank r varies from 4 to 13 with increment 1 and SR from 10% to 90% with increment 5%. For each setting, 50 independent trials were run. 3 In [15], the ADMM is also coded up for solving (1.5) and claimed to give high accurate solutions. However, we found that it was not as reliable as FaLRTC. In addition, if the smoothing parameter µ for FaLRTC was set small, FaLRTC could also produce solutions of high accuracy. In addition, Figure 2.3 depicts the phase transition plots of TMac utilizing 1, 2, 3, and 4 modes of matricization on the 4-way dataset. We see that TMac can recover more tensors as it uses more modes.

2.2.
Uniformly random data. This section tests the recoverability of TMac, MatComp, FaL-RTC, and SquareDeal on two datasets in which the tensor factors have uniformly random entries.
In the first dataset, each tensor had the form where C was generated by MATLAB command rand(r,r,r)-0.5 and A n by rand(50,r)-0.5. Each tensor in the second where C was generated by MATLAB command rand(r,r,r,r)-0.5 and A n by rand(20,r)-0.5. Figure 2.4 shows the recoverability of each method on the first dataset and Figure 2.5 on the second dataset. We see that TMac with both rank-fixing and rank-increasing strategies performs significantly better than the other compared methods. In the first dataset, each tensor had the form where C was generated by MATLAB command rand(r,r,r) and A n by orth(randn(50,r))*diag([1:r].^(-0.5)).
Note that the core tensor C has nonzero-mean entries and each factor matrix has power-law decaying singular values. This kind of low-rank tensor appears more difficult to recover compared to the previous random low-rank tensors. Each tensor in the second dataset had the form M = where C was generated by MATLAB command rand(r,r,r,r) and A n by orth(randn(20,r))*diag([1:r].^(-0.5)). For these two datasets, TMac with rank-decreasing strategy can never decrease r n to the true rank and thus performs badly. Figure 2.6 shows the recoverability of each method on the first dataset and Figure 2.7 on the second dataset. Again, we see that TMac with both rank-fixing and rank-increasing strategies performs significantly better than the other compared methods.
3. Algorithm. We apply the alternating least squares method to (1.2). Since the model needs an estimate of rank(M), we provide two strategies to dynamically adjust the rank estimates.
be the objective of (1.2). We perform the updates as Note that both (3.2a) and (3.2b) can be decomposed into N independent least squares problems, which can be solved in parallel. The updates in (3.2) can be explicitly written as where A † denotes the Moore-Penrose pseudo-inverse of A, Ω c is the complement of Ω, and we have used the fact that P Ω c = 0 in (3.3c).
No matter how X n is computed, only the products X n Y n , n = 1, . . . , N , affect Z and thus the recovery M. Hence, we shall update X in the following more efficient way which together with (3.3b) gives the same products X k+1 n Y k+1 n , ∀n, as those by (3.3a) and (3.3b) according to the following lemma, which is similar to Lemma 2.1 in [26]. We give a proof here for  Then first line of (3.5) where we have used (GH) † = H † G † for any G, H of appropriate sizes in the second equality. On the = CVΣU UΣV C CVΣU Hence, we have the desired result (3.5).
If they are fixed, the a good estimate is important for (1.2) to perform well. Too small r n 's can cause underfitting and a large recovery error whereas too large r n 's can cause overfitting and large deviation to the underlying tensor M. Since we do not assume the knowledge of rank(M), we provide two schemes to dynamically adjust the rank estimates r 1 , . . . , r N . In our algorithm, we use parameter ξ n to determine which one of the two scheme to apply. If ξ n = −1, the rank-decreasing scheme is applied to r n ; if ξ n = 1, the rank-increasing scheme is applied to r n ; otherwise, r n is fixed to its initial value.
3.2.1. Rank-decreasing scheme. This scheme starts from an input overestimated rank , i.e., r n > rank n (M). Following [12,26], we calculate the eigenvalues of X n X n after each iteration, which proposed in [17]. (b) the matrix completion solver LMaFit [26] solves (2.1), which is a non-convex variant of (1.6). which means a "big" gap between λr n and λr n +1 , then we reduce r n tor n . Assume that the SVD of X n Y n is UΣV . Then we update X n to Ur n Σr n and Y n to V rn , where Ur n is a submatrix of U containingr n columns corresponding to the largestr n singular values, and Σr n and Vr n are obtained accordingly.
We observe in our numerical experiments that this rank-adjusting scheme generally works well for exactly low-rank tensors. Because, for these tensors, gap n can be very large and easy to identify, the true rank is typically obtained after just one rank adjustment. For approximately low-rank tensors, however, a large gap may or may not exist, and when it does not, the rank overestimates will not decrease. For these tensors, the rank-increasing scheme below works better.

Rank-increasing scheme.
This scheme starts an underestimated rank, i.e., r n ≤ rank n (M).
Following [26], we increase r n to min(r n + ∆r n , r max n ) at iteration k + 1 if which means "slow" progress in the r n dimensional space along the n-th mode. Here, ∆r n is a positive integer, and r max n is the maximal rank estimate. Let the economy QR factorization of (Y k+1 n ) be QR. We augment Q ← [Q,Q] whereQ has ∆r n randomly generated columns and then orthonormalize Q.
Next, we update Y k+1 n to Q and X k+1 n ← [X k+1 n , 0], where 0 is an I n × ∆r n zero matrix 5 . Numerically, this scheme works well not only for exactly low-rank tensors but also for approximately low-rank ones. However, for exactly low-rank tensors, this scheme cause the algorithm to run longer than the rank-decreasing scheme. Figure 3.1 shows the performance of our algorithm equipped with the two rank-adjusting schemes. As in section 2.1, we randomly generated M of size 50 × 50 × 50 with rank n (M) = r, ∀n, and we started TMac with 25% overestimated ranks for rank-decreasing scheme and 25% underestimated ranks for rank-increasing scheme. From Figure 3.1, we see that TMac with the rank-decreasing scheme is better when r is small while TMac with the rank-increasing scheme becomes better when r is large. We can also see from the figure that after r n is adjusted to match rank n (M), our algorithm converges linearly.
3.3. Pseudocode. The above discussions are distilled in Algorithm 1. After the algorithm terminates with output (X, Y, Z), we use the weighted sum N n=1 α n fold n (X n Y n ) to generate the tensor M, which is usually better than Z when the underlying M is only approximately low-rank or the observations are contaminated by noise, or both. if stopping criterion is satisfied then Output (X k+1 , Y k+1 , Z k+1 ). Remark 3.1. In Algorithm 1, we can have different rank-adjusting schemes, i.e., different ξ n , for different modes. For simplicity, we set ξ n = −1 or ξ n = 1 uniformly for all n in our experiments. X n (X n Y n − Z (n) ) = 0, n = 1, . . . , N, Our main result is summarized in the following theorem. namely, the KKT conditions in (4.2) are satisfied in the limit.
Our proof of Theorem 4.1 mainly follows [26]. The literature has work that analyzes the convergence of alternating minimization method for non-convex problems such as [4,24,28]. However, to the best of our knowledge, none of them implies our convergence result.
Before giving the proof, we establish some important lemmas that are used to establish our main result. We begin with the following lemma, which is similar to Lemma 3.1 of [26].
We also need the following lemma.
This completes the proof.
According the Lemma 4.3, it is not difficult to get the following corollary. We are now ready to prove Theorem 4.1.
Proof of Theorem 4.1. According to (3.2c), we have P Ω (Z k ) = B, and P Ω c (Z k ) = P Ω c ( n α n · fold n (X k n Y k n )). Hence, (4.2c) and (4.2d) hold as T = T k for all k. From Lemma 4.2, it follows that In addition, it is not difficult to verify Summing up (4.10) and (4.11) and observing that f is lower bounded by zero, we have and thus For each n and k, let the economy SVDs of X k n and Y k n are X k which together with (4.12) gives , then right multiplying (Y k n ) on both sides of (4.13a) yields which indicates that (4.2a) is satisfied in the limit. From (4.12) and (4.13b), we have and thus (4.2b) is satisfied in the limit. This completes the proof.

Numerical experiments.
This section tests Algorithm 1, TMac, for solving (1.2). To demonstrate its effectiveness, we compared it with MatComp and FaLRTC (see section 2) on real-world data.
5.1. Dynamic weights and stopping rules. The parameters α 1 , . . . , α N in (1.2) were uniformly set to 1 N at the beginning of TMac. During the iterations, we either fixed them or dynamically updated them according to the fitting error The smaller fit n (X n Y n ) is, the larger α n should be. Specifically, if the current iterate is (X k , Y k , Z k ), we set As demonstrated below, dynamic updating α n 's can improve the recovery quality for tensors that have better low-rankness in one mode than others. TMac was terminated if one of the following conditions was satisfied for some k where tol is a small positive value specified below. The condition (5.2) checks the relative change of the overall fitting, and (5.3) is satisfied if the weighted fitting is good enough.

MRI data.
This section compares TMac, MatComp, and FaLRTC on a 181 × 217 × 181 brain MRI data, which has been used in [15]. The data is approximately low-rank: for its three mode unfodings, the numbers of singular values larger than 0.1% of the largest one are 28, 33, and 29, respectively. One slice of the data is shown in Figure 5.1. We tested all three methods on both noiseless and noisy data 6 . Specifically, we added scaled Gaussian noise to the original data to have We ran all the algorithms to maximum 1000 iterations. The stopping tolerance was set to tol = 10 −3 for TMac and LMaFit. For FaLRTC, tol = 10 −4 was set since we found 10 −3 was too loose.
Both TMac and LMaFit used the rank-increasing strategy. For TMac, we initialized r n = 5 and set ∆r n = 3, r max n = 50, ∀n, and for LMaFit, we set initial rank K = 5, increment κ = 3, and maximal rank K max = 50. We tested TMac with fixed parameters α n = 1 3 , n = 1, 2, 3, and also dynamically updated ones by (5.1) starting from α 0 n = 1 3 , n = 1, 2, 3. The smoothing parameter for FaLRTC was set to its default value µ = 0.5 and weight parameters set to α n = 1 3 , n = 1, 2, 3. Table 5.1 shows the average relative errors and running times of five independent trials for each setting of σ and SR. Figure   5.1 shows one noisy masked slice and the corresponding recovered slices by different methods with the setting of σ = 0.05 and SR = 10%. From the results, we see that TMac consistently reached lower relative errors than those by FaLRTC and cost less time. MatComp used the least time and could achieve low relative error as SR is large. However, for low SR's (e.g., SR=10%), it performed extremely bad, and even we ran it to more iterations, say 5000, it still performed much worse than TMac and FaLRTC. In addition, TMac using fixed α n 's worked similarly well as that using dynamically updated ones, which should be because the data has similar low-rankness along each mode.

Hyperspectral data.
This section compares TMac, MatComp, and FaLRTC on a 205 × 246 × 96 hyperspectral data, one slice of which is shown in Figure 5.2. This data is also approximately low-rank: for its three mode unfoldings, the numbers of singular values larger than 1% of the largest one are 19,19, and 4, respectively. However, its low-rank property is not as good as that of the 6 For noisy case, it could be better to relax the equality constraints in (1.2), (1.3), and (1.5) to include some information on the noise level. However, we did not assume such information, and these methods could still work reasonably. For each setting of σ and SR, we made 5 independent runs. Table 5.2 reports the average results of each tested method, and Figure 5.2 shows one noisy slice with 90% missing values and 5% Gaussian noise, and the corresponding recovered slices by the compared methods. From the results, we see again that TMac outperformed FaLRTC in both solution quality and running time, and MatComp gave the largest relative errors at the cost of least time. In addition, TMac with dynamically updated α n 's worked better than that with fixed α n 's as SR was relatively large (e.g., SR=30%, 50%), and this should be because the data has much better low-rankness along the third mode than the other two. As SR was low (e.g., SR=10%), TMac with dynamically updated α n 's performed even worse. This could be explained by the case that all slices might have missing values at common locations (i.e., some mode-3 fibers were entirely missing) as SR was low, and in this case, the third mode unfolding had some entire columns missing. In general, it is impossible for any matrix completion solver to recover an entire missing column or row of a matrix, and thus putting more weight on the third mode could worsen the recovery. That also explains why MatComp gave much larger relative errors than those by FaLRTC but the slice recovered by MatComp looks better than that by FaLRTC in Figure 5.2. Note values larger than 1% of the largest one are about 10 50, 50, and 24. During each run, the three channel tensors had the same index set of observed entries, which is the case in practice, and we recovered each channel independently. We set α n = 1 3 , n = 1, 2, 3 for FaLRTC and TMac, while the latter was tested with both fixed α n 's and dynamically updated one by (5.1). All the other parameters of the test methods were set as the same as those in the previous test. The average results of 5 independent runs were reported in Table 5.3 for the grayscale video and Table 5.4 for the color video. Figure 5.3 shows one frame of recovered grayscale video by each method and Figure 5.4 one frame of recovered color video. From the tables, we see that the comparisons are similar to those for the previous hyperspectral data recovery. 7 http://www.ugcs.caltech.edu/∼srbecker/escalator data.mat 8 http://media.xiph.org/video/derf/ The original video has 500 frames, and we used its first 150 frames in our test. 9 One can also treat the color video as a 4th-order tensor. However, we found that recovering a 4th-order tensor cost much more time than recovering three 3rd-order tensors and made no quality improvement. 10 More precisely, the numbers are (51, 53, 24), (48, 51, 24), and (49, 52, 24) respectively for three channel tensors. Grayscale video: one original frame, the corresponding frame with 70% pixels missing and 5% Gaussian noise, and the recovered frames by different tensor completion methods.

Original
70% masked and 5% noise MatComp FaLRTC TMac with dynamic α n 's TMac with fixed α n 's 6. Discussions. We have proposed a new method for low-rank tensor completion. Our model utilizes low-rank matrix factorizations to all-mode unfoldings of the tensor. Synthetic data tests demonstrate that our model can recover significantly more low-rank tensors than two nuclear norm based models and one model that performs low-rank matrix factorization to only one mode unfolding.
In addition, numerical results on 3D images and videos show that our method consistently produces the best solutions among all compared methods and outperforms the nuclear norm minimization method in both solution quality and running time.
Numerically, we have observed that our algorithm converges fast (e.g., linear convergence in Figure   3.1). Papers [14,26] demonstrate that the SOR technique can significantly accelerate the progress of alternating least squares. However, we did not observe any acceleration applying the same technique to our algorithm. In the future, we will explore the reason and try to develop other techniques to accelerate our algorithm. We also plan to incorporate the objective term in (2.1) to enrich (1.2), if the underlying low-rank tensor has more than three orders.