A Non-Convex Optimization Framework for Large-Scale Low-rank Matrix Factorization

,


Introduction
Matrix completion and factorization algorithms that do not require non-negativity on their components, such as principal component analysis (PCA) or independent component analysis (ICA), often do not provide interpretable decompositions.In contrast, non-negative matrix factorization (NMF) is a method that promotes relatively spatial representation and has gained a lot of attention recently within both the computer science and optimization communities due to its applications in data science problems [1].By imposing non-negative structures simultaneously in data reconstruction and basis identification, NMF has shown great potential in unravelling important structures in the data from a lot of applications, including data clustering [4], protein sequence motif discovery [3], and etc. NMF is based on the assumption that a m ⇥ n data matrix (dataset) X with n data and m features, can be represented by a small sets of k basis vectors [2].A non-negative data matrix X (m features by n data) into two non-negative sub-matrices W and H, such that The rank, k typically is much smaller than the dimensions and data (k ⌧ m, n).There exist many different methods to solve the NMF problem.Since ( 1) is an approximation problem, one should look for a suitable cost function to measure the goodness of fit.The most common loss function used in the literature is the squared Frobenius norm loss: While other loss functions are available, using the Frobenius norm is well-motivated due to its compatibility with the Singular Value Decomposition (SVD).By defining the cost function as in (2), the next step would be choosing the optimization method.So far, different algorithms have been suggested.Early proposed methods mainly concentrated on multiplicative update rules.For instance, [2] proposed the following algorithm.
This method updates W and H individually, such that the Frobenius norm loss in NMF above is guaranteed to decrease on each iterations.However, since multiple matrix products need to be calculated on each iterations, potential computation savings might be available by modifying the algorithm.More variations of multiplicative update rules can be seen in [5].Since the NMF is not jointly convex in W and H, many other optimization algorithms has been proposed, such as block-coordinate descent, projected gradient descent, and non-negative least squares (ANLS).In particular, ANLS modifies the problem in (2) into two convex optimization problems: which can be solved using standard convex optimization solvers.An important characteristic of this method is that, since the constraints are built directly into the subproblems, the limit points of any algorithm that solves the subproblems will also be stationary points for the NMF problem [6].The proposed algorithms for solving this problem is quite diverse.For instance, one method, known as the active set method [6] and an additional method, Hierarchical Alternating Least Squares (HALS), simplifies the nonnegative subproblem by updating each column of W and row of H individually [7].On a similar note, much attention has been put in prior works to solve these problems [8] by employing variants of steepest descent algorithm for unconstrained problems.In [9,10], the so-called Rank-one Residue Iteration (RRI) algorithm for solving the NMF problem was studied.This algorithm was independently proposed by Cichocki, Zdunek, and Amari in [11], where it is named the Hierarchical Alternating Least Squares (HALS) algorithm.
A helpful feature of this algorithm is that the solution to (5), and ( 6) have explicit formulas which make it easy to implement.Simulation experiments in [10,11] show that the HALS/RRI algorithm is much faster than the Lee-Seung method and also some other methods based on ANLS methods.Using Newton or quasi-Newton methods to solve the subproblems can have a faster rate of convergence [12].However, these methods require to determine a suitable active set for the constraints at each iteration [13].Another approach for solving the NMF problem is to apply the CG method.This strategy has very fast and has global convergence.In this paper, motivated by the success of the prior works of projected gradient method and good performance of the BB method for optimization, we propose a novel non-monotone CG method that uses BB method to overcome the aforementioned problems for NMF.By considering that the objective function of each NMF subproblem is a convex function whose gradient is Lipchitz continuous, we optimally minimize one matrix factor with another fixed.In particular, we update two sequences recursively for optimizing each factor, the step size is determined by the BB method, and the search point is constructed by CG method.This algorithm accelerates the optimization based on the problem structure.Therefore, this method converges much faster than other NMF solvers by alternatively performing the optimal gradient method on the two matrix factors.
Preliminary experiments on both synthetic and real-world datasets suggest the efficiency of the proposed method.Face recognition experiments on popular real-world and synthetic datasets confirm the effectiveness of PNBB.In addition, we show that PNBB can be naturally extended to handle several versions of NMF that incorporates varients of BB step sizes.

BB Step Size and Nonmonotone Algorithm
To begin, in this paper we deal with solving the following minimization problem as: where f (x) stands for f (W, H) for short.There are various optimization algorithms for solving this problem.Indeed, they compute the search direction and find a suitable step size and move to next point.One of the most popular choices for the search directions is generated by the Newton method or the quasi-Newton method [14].Although these methods have superlinear convergence, they are not suitable for solving large-scale optimization problems, since it is difficult to compute and store the (approximate) Hessian matrices of function at each iteration [15].To overcome the drawbacks in the Newton-type or the quasi-Newton-type methods, numerous spectral gradient methods [16] and CG methods [15] have been presented so far.Numerical experiments are often employed to show the advantages of their numerical efficiency in solving large-scale optimization problems.With that in mind, herein we make use of CG method.In particular, at kth iteration the new point is computed as where ↵ k and d k are named the step size and search direction respectively.The search direction is obtained by using the following strategy: where k = To find ↵ k , the traditional monotone line search approaches depart from x k along the search direction In exact line search method, the step size is obtained by solving the following minimization problem: According to [17], this strategy is expensive and impractical when the cost of the minimization problem with multiple variable required.On the other hand, inexact methods such as Armijo condition, the Wolfe condition and the Goldstein condition are other alternative methods for finding an acceptable step size ↵ k .For instance, the Armijo line search finds the step size ↵ k such that where 2 (0, 1) is a constant.By using the fact that d k is a descent direction i.e. g T k d k < 0, we can conclude that the traditional Armijo rule guarantees the monotonicity of the sequence {f k }.While all the above-mentioned algorithms are monotone, Chamberlain et al. [18] suggested an interesting nonmonotone line search technique namely "watchdog technique" for improving the iterative algorithms.Since then, the nonmonotone technique has been exploited by many researchers.In detail, their method finds a step size ↵ satisfying the following condition: where where N 0 is a fixed number and f l k is the so-called non-monotone term and allows the objective function to increase in some iterations when its convergence is guaranteed.Employing the nonmonotone strategy with other approaches such as Newton's method, trust region method and CG method has led to a significant improvement [19].Later on, Grippo and Sciandrone in [20] proposed a non-monotone line search technique for Newton's method.However, Grippo's nonmonotone technique has some drawbacks [21], for example, the iterative method may generate R-linear convergent iterations for strongly convex function, the iterations may not satisfy the condition (11) for k sufficiently large, for any fixed bound N on the memory.Also, in some cases, the performance of the algorithm highly depend on the choice of N .Further, in their proposed method, the algorithm takes a constant value as the step length in the first iteration and cannot be adjusted according to the characteristics of the objective function in the current iteration.
Zhang and Hager [22] proved that the best convergence results are always obtained by using a stronger non monotone strategy when iterations are far from the minimizer point, and by using a weaker non monotone strategy when iterations are close to it.In this regard, Ahookhosh et al. [21] suggested a new non-monotone condition as: where R k is defined by The drawback of ( 14) is that the value of step size in initial steps of inner loop is one and close to one which increases the number of iterations of the inner loop.To cure this issue, we apply a new and efficient Barzilai-Borwein (BB) which is motivated by Newton's method but is Hessian free at almost no extra cost over the standard gradient method.Also, the algorithm often significantly outperform the standard gradient method.In particular, for computing the BB step size, we need to define s k = x k x k 1 , y k = g k g k 1 based on which we are able to present the BB step sizes as follows Inspired by the BB methods step sizes, an extended BB step size which is obtained by p ↵ 1 ↵ 2 is also employed in this paper.In the next section we will discuss the algorithm in detail.

The Optimization Algorithm
Here in this section, we will focus on the proposed algorithm for solving problems ( 5) and (6).As mentioned in [19], to obtain the best convergence results, we need a stronger non-monotone strategy whenever the iterates are far from the minimizer and a weaker non-monotone strategy for those iterates that are close enough to that.To this end, here we introduce a new parameter ⌘ k that can address the mentioned drawbacks in the preceding section.The parameter ⌘ k is estimated by the algorithm at each iteration as bellow: It is clear that in the first iterations that are far away from the minimizer, the values of parameter ⌘ k are closed to 1. On the other hand, when the kg k k goes to zero, the values of the parameter ⌘ k goes to zero.Much faster convergence is usually observed in the above-mentioned algorithm compared to its peer, although the cost function does not decline significantly at initial stages which is due to way that ⌘ k is defined in their formulations.To be more concrete, ⌘ k should be defined such that to be contingent upon the behavior of the gradient of the cost function.Therefore, it is advantageous to design monotone parameters to be close to one in initial stages of the algorithm and vanishes at the final iterations.This will be the underpinning motivation of our algorithm design.One of the other parameters that can influence the performance of the algorithm is N which was defined in the nonmonotone term in (12).It should be noted that N is assumed to be fixed in almost all non-monotone optimization algorithms.In this paper, we consider an adaptive value of N which depends on the value of the gradient at each step.The update of N in each iteration is done as follows By taking ( 16) into account, when the sequence of iterations is located in a narrow valley, that is, the norm of gradient is large, to avoid creeping along the bottom of a narrow curved valley, it is better that we increase the value of N k .Besides, when the norm of gradient is moderate and satisfies the second condition, it would be better not to change the value of the N k .In the last case, where the norm of gradient is small that is the current iteration is in a flat area, to further decrease the function value, it is better for the algorithm to shrink the value of N k .Now, we in position to focus on designation of the algorithm for updating one of these matrices (W and H) and the same process can be applied to update the other matrix.Let the algorithm update the matrix W .As such, we deal with solving the following problem: For simplicity, we use f (W ) := f k (W ) for short.It is clear that f (W ) is a convex function, therefore the optimal point can be found.Now, in order to obtain the new point, we first need to define a quadratic form of the objective function f (W ).For this purpose, we consider the quadratic form of the function for any fixed U 0 as: Compute the new point 5: While Eq. ( 13) is False ↵ k ↵ 9: Compute the ↵ k+1 by (21) 11: Calculate ⌘ k+1 by using (15) 12: Calculate N k+1 by using (16) 13: k k + 1 14: end 15: return W It is obvious that the function (U, W ) is strictly convex in term W for any fixed U 0. Therefore, the optimal solution for this function can be obtained.At the current iteration j, a new point is generated by solving the strongly convex quadratic minimization problem as: Now, by using some calculations, we can derive a unique solution for (18) as: Therefore, the new point is constructed as: where all the negative entries of X are projected to zero by using the projection operator P [.].Finally, by using the BB method, the step size ↵ k is computed by: where We outlined the mentioned procedure in Algorithm 1 named as projected nonmonotone BB (PNBB) algorithm.

Convergence Analysis
The aim of this section is two folds: first, we prove that the proposed nonmonotone strategy is a descent method which can be proved according to the structure of the nonmonotone equation.Second, we prove the convergence of the algorithm.In doing so, we obtain a lower bound for the step size and then will demonstrate that for enough number of iterations, the norm of the search direction goes to zero implying that the convergence occurs.To begin, we establish two assumptions thereby we will investigate the convergence analysis.Afterwards, we delineate some properties of the new proposed non-monotone term which leads us to prove that the sequence W t generated by the Algorithm 1 converges to a stationary point of ( 5).The so-called assumptions are as follows: A1: The level sets, i.e., the sequence {W k } generated by Algorithm 1 is contained in the level set

A2:
The gradient of this function, that is: Next, we elaborate on some lemmas that delineates some properties of the new non-monotone line search proposed in this paper.
Lemma 1.Let {W k } be the sequence obtained by Algorithm 1, then, f l(k) is a diminishing sequence.
Proof.The proof comes by taking into account the definition of R k and ( 12), which renders that It can be followed by Due to the negativity of hrf (W k ), D k i < 0, we can conclude that Besides, from (12) we have the following result.
From ( 24) and ( 25) we reach f l(k+1)  f l(k) which demonstrate that the sequence f l(k) is a diminishing sequence.
Lemma 2. For the sequence {W k } generated by Algorithm 1 and for all k 0, we have W k 2 L(W 0 ).
Proof.By the definition of f l(k+1) in mind, we can posit f k+1  f l(k+1) for any k 0. Thus, we have: Now, by making use of the definition of R k , we can conclude that R 0 = f 0 .Next, by induction, assuming W i 2 L(W 0 ), for all i = 1, 2, . . ., k, we show that W k+1 2 L(W 0 ).Equations ( 12) and ( 23) along with Lemma 1 is followed by In order to establish the convergence, Karush-Kuhn-Tucker (KKT) optimality conditions require to hold for problem (17) W 0; rf (W ) 0; rf (W ) ⌦ W = 0, where ⌦ stands for the Hadamard product.Further, prior to talking about the stationary points, we need to define the scaled projected gradient direction as follows.
where ↵ > 0 and W 0. Now, we are in position to investigate the stationary point and some facts about scaled projected gradient direction by the following lemmas.Lemma 3. Let ↵ 2 (0, ↵ max ] and W 0. Thus, we have: Proof.The proof is provided in [26]. Lemma 4. Assuming the first part of Lemma ( 3) and (H1) hold and Algorithm 1 produces the sequence {W k }, the sequence {f l(k) } will be convergent.
Proof.Lemma 1 along with the fact that f l(0) = f 0 it will be straightforward to conclude that the sequence {W l(k) } remains in level set L(W 0 ).In addition, f (W k )  f (W l(k) ) shows that the sequence {W k } remains in L(W 0 ).Thus, (A1) along with Lemma 1 imply that the sequence {f l(k) } is convergent.Lemma 5.With (A1) and that the direction D k satisfies the first item of Lemma 1 in mind, for the sequence {W k } generated by Algorithm 1, we get Proof.The proof can be done in the same fashion as in Lemma 2 of [21].Lemma 6. Assume that (A1) holds and the direction D k satisfies the first item of Lemma 1.Then, for the sequence {W k } generated by Algorithm 1, we can show Proof.The proof comes by making use of ( 23) and ( 26) which leads us to f k  R k  f l(k) .Then, by employing Lemma 5, the proof is there.
Lemma 7. Let assume that W k is not a stationary point of (17).Then, there exists a constant L↵ max , such that k ˜ Proof.In order to prove the lemma, we consider two conditions: the first one is If k 1, which completes the proof.So, let k < 1.Now by using the definition of k and ( 26), we argue that: where the second inequality is due to the fact that Now, by having ( 29) and ( 28) in hand, we can conclude that , it follows that: Due to the fact that hrf (W k ), Henceforth, our next goal is to prove that Algorithm 1 has a global convergence.The following theorem summarizes the global convergence.Theorem 8. Equipped with assumption (A1) and the first part of lemma 1 for the sequence {W k } returned by Algorithm 1, we have: Proof.To prove lemma, we need to show that where is given by ✓ = min If k 1, then we have: which proves (32).Now, assuming k < 1 and knowing 1) which demonstrate that (32) holds.Since is a positive real number, (32) implies that R k f k+1 ✓kD k k 2 0. This fact along with Lemma 6 proves that lim k!1 kD k k = 0.This completes proof and shows that Algorithm 1 benefits from global convergence.

Experiment Results
Here our aim is to confirm the theoretical results and investigate the practical behaviour of Algorithm 1, we also compare the performance of PNBB (two methods i.e., N.M.BB2 and N.M.BB3) with monotone Armijo line search, nonmonotone Armijo suggested in [19], non-monotone algorithm suggested by Grippo [20], the BB method monotone algorithm and the BB non-monotone algorithm suggested by Ahookhosh et.al [19] using synthetic and real datasets.In what follows, we evaluate the perfomance of the proposed PNBB in terms of face recognition ability and efficiency of the algorithm (error, CPU time, the number of objective function evaluations, and gradient behavior).

Implementation Details
There are several methods to calculate the initial point, for example, random, stochastic random Acol, SVD.In order to speed up the convergence rate of NMF, Boutsidis and Gallopoulos in [28], suggested Non-negative Double Singular Value decomposition (NNDSVD), which is a new method based on two SVD processes, one to approximate the initial data matrix X and the other to approximate the positive sections of the resulting partial SVD factors but we found empirically that it was too computationally heavy when the number of components k was fairly high (k > 100).Herein we used random and SVD initialization.The results for random strategy is obtained by averaging of 10 time of performing that algorithm.In addition, we measure the accuracy of the algorithm in face recognition by Accuracy := The num of faces recognized correctly characteristics, which is a common tool to graphically compare the efficiency of the algorithms.We can use two measures which are, the number of function evaluation and the number of iterations to compare these algorithms.Hence, we use these two indexes for all of the present algorithms separately.Fig. 1 shows the performance of the above algorithms relative to these metrics, respectively.We also can see that PNBB algorithm grows up faster than the other algorithms.
We did the next experiment on the ORL image dataset which contains 400 facial images of 40 different people with 10 images per person.The size of each image is 92⇥112 pixels, with 256 grey levels per pixel.The matrix X whose columns represent these images has 92⇥112 = 10, 304 rows and 400 columns.In this experiment, the algorithm use 280 images for training and 120 images for test.We run the algorithm on this dataset and measure the accuracy of the face recognition for various algorithms.The comparison of the accuracy results is presented in Fig. 2 for two different datasets.
In the second set of experiments, we evaluate the performance of the PNBB algorithm in terms of the accuracy of face recognition and the number of times that objective function (the number of inner iterations) is evaluated by the algorithm, CPU time, and the behavior of the gradient.We do the experiment on the ORL dataset and to implement the algorithm, we use the SVD strategy for denoting the starting point and matrix rank.In Fig. 3, we present the CPU time to run PNBB versus different values of tolerances, ✏ on the full dataset.The results reveals that the CPU time of Armijo line search grows fast for low values of the tolerance compared to our PNBB.
The objective function evaluation is another important measure in optimization.Indeed, this term denotes the number of inner iterations.Fewer inner iterations of an algorithm is certainly a true index of its efficiency and showing that it can achieve the solution in less time.Here in Fig. 4, we plotted comparison of different algorithms in terms of the number of objective function evaluation vesus tolerance.As is evident from the figure, the number of inner iteration of the proposed PNBB algorithm is less than that of the Armijo by several orders of magnitude.Further, we plotted the Figure 2: Accuracy results of ORL and Jaffe dataset where both initial point and the matrix rank obtained by SVD approach.Here in this figure, "A", "B", "C", "D", "E" and "F" stand for "M-Ar.","NM-Ar.","M-BB2", "NM1-BB2", "NM1-BB3", "NM-BB2", respectively.gradient behaviour of our nonmonotone algorithm compared with that of a monotone one versus the number of iterations in Fig. 5.In our studies the gradient behavior implies the convergence speed of the algorithm.As can be seen from the figure, PNBB as a nonmonotone method converges fast (after around 15 iterations) while the monotone method is still zigzagging after around 30 iterations.
Finally, we now turn our attention to Fig. 6 which shows the error versus the number of iterations.Here error is defined as the difference between the original matrix, X and W H matrix which is indeed the value of the objective function.This figure also reveals that the proposed PNBB outperforms the Armijo algorithm.

Conclusion
In this paper, we presented a new efficient nonmonotone BB CG method using the idea of the BB method and some properties of the linear CG method for NMF problems which sequentially optimizes one matrix factor with another fixed.We also develop a generalized Wolfe line search, which is nonmonotone and can avoid a numerical drawback of the original Wolfe line search.Experiments on both synthetic and real-world datasets show that PNBB outperforms existing NMF algorithms in terms of efficiency and overcomes their deficiencies.The face recognition experiments on real-world datasets confirm the effectiveness of PNBB.Since PNBB exploits a suitable nonmonotone method and a proper step size, it accelerates the NMF optimization which is due to the fact that it decreases the number of inner loops for function evaluations.

Figure 1 :
Figure 1: Performance profile for the number of function evaluations.

Figure 5 :
Figure 5: Gradient behaviour of our nonmonotone algorithm compared with that of a monotone one versus the number of iterations.

Figure 6 :
Figure 6: The error (value of the objective function) versus the number of iterations.