Orthogonal nonnegative matrix tri-factorization based on Tweedie distributions

Orthogonal nonnegative matrix tri-factorization (ONMTF) is a biclustering method using a given nonnegative data matrix and has been applied to document-term clustering, collaborative filtering, and so on. In previously proposed ONMTF methods, it is assumed that the error distribution is normal. However, the assumption of normal distribution is not always appropriate for nonnegative data. In this paper, we propose three new ONMTF methods, which respectively employ the following error distributions: normal, Poisson, and compound Poisson. To develop the new methods, we adopt a k-means based algorithm but not a multiplicative updating algorithm, which was the main method used for obtaining estimators in previous methods. A simulation study and an application involving document-term matrices demonstrate that our method can outperform previous methods, in terms of the goodness of clustering and in the estimation of the factor matrix.


Introduction
Nonnegative matrix factorization (NMF), which is a dimension reduction technique for decomposing a data matrix into two factor matrices, in both of which all entries are nonnegative, has been applied to many fields and extended to various forms (Lee andSeung 1999, 2001;Berry et al. 2007;Wang and Zhang 2013).One of best-known extensions is orthogonal NMF (ONMF), which imposes column orthogonality on one side's nonnegative factor matrix (Ding et al. 2006;Yoo and Choi 2008;Choi 2008;Yoo and Choi 2010a;Li et al. 2010;Pompili et al. 2014;Mirzal 2014;Kimura et al. 2014).Because a nonnegative column orthogonal matrix plays a role analogous to an indicator matrix in k-means clustering, and in fact one can obtain the sparse factor matrix from ONMF, it has mainly been adopted for nearest-neighbor clustering tasks such as document and term clustering (Mauthner et al. 2010;Kim et al. 2011;Wang et al. 2016).Another extended version of NMF is nonnegative matrix tri-factorization (NMTF), which decomposes a nonnegative matrix into three nonnegative factor matrices.Because constraint-free NMTF is known to be generally equivalent to NMF, some constraints are often imposed.One popular constraint is column orthogonality of the left-and right-side nonnegative factor matrices, similar to ONMF.This NMTF is referred to as orthogonal NMTF (ONMTF) (Ding et al. 2006;Yoo andChoi 2010b, 2009).Owing to the relationship between column orthogonal nonnegative factor matrices and clustering mentioned above, ONMTF is considered to be a biclustering method.The objective of biclustering is to simultaneously detect any row or column clusters of a data matrix (Govaert and Nadif 2013).The sample objects and variables are classified using data matrix at a time.It has been adopted for use in document-term clustering, collaborative filtering, etc. (Costa and Ortale 2014;Chen et al. 2009).
In ONMF and ONMTF, it is often assumed that the error is normal or Poisson distributed.However, in NMF, various algorithms have been proposed based on various error distributions, including generalized ones.Tweedie family distributions are wellknown generalized distributions for NMF (Févotte and Idier 2011;Nakano et al. 2010).It uses the index parameter β to identify the distribution, and includes normal (β = 2), Poisson (β = 1), compound Poisson (CP) (β ∈ (0, 1)), gamma (β = 0), and inverse normal (β = −1) as special cases (Jørgensen 1997;Dunn and Smyth 2001).The assumption of a Tweedie family distribution as the error distribution implies that the error criterion has the form of a β-divergence (Simsekli et al. 2013;Tan and Févotte 2013).β-divergence is a generalized divergence that includes Euclidean distance (β = 2), KL-divergence (β = 1), and Itakura-Saito divergence (β = 0) (Cichocki and Amari 2010).One of the merits of Tweedie family distributions is their flexibility with real nonnegative data.The assumption of normality for nonnegative data means that the variance is the same whenever the expected value is small or large.However, one can adjust the relationship between expected value and variance in a Tweedie distribution assumption by changing β, because variance is proportional to a power of the expected value, such as in V (y) = φ E(y) 2−β , where V (y) and E(y) are the expected value and variance, respectively, of the random variable y, and φ is the dispersion parameter in the Tweedie family distribution.For example, if we assume a CP distribution (i.e., choose β in (0,1)), it is implicitly suggested that the variance will be large when the expected value is also large.In fact, a CP distribution is defined as a Poisson mixture of gamma distributions; it is absolutely continuous on the positive axis and has a positive mass at zero, from the aspect of the generative model.In other words, a CP distributed random variable is the sum of n independent identically gamma-distributed random variables, and the number of random variables n is Poisson distributed.This assumption is familiar with a gross summation of the nonnegative values and associated with various types of real-world nonnegative data (e.g., precipitation, insurance, and purchase volume data) (Ohnishi and Dunn 2007;Smyth and Jørgensen 2002).From an NMF parameter estimation aspect, this assumption is related to robust estimation in the presence of outliers that have extremely large positive values (Li et al. 2017;Virtanen et al. 2015;Carabias-Orti et al. 2013;Weninger and Schuller 2012;Févotte et al. 2009;Virtanen 2007).
In this paper, we propose a new ONMTF method that generalizes the error distribution to a Tweedie family distribution.Our new method has two advantages.First, it facilitates interpretations from nonnegative data matrices by simultaneously detecting the clusters of row and column objects and the relationships among them.Second, the assumption of the error distribution is plausible with some nonnegative data, which leads to robust estimation against extremely large positive values.
One of the ways to develop the new method is to derive an iterative algorithm for estimating factor matrices using the same techniques as the previous ONMTF methods proposed by Ding et al. (2006) and Yoo and Choi (2010b).In both methods, the factor matrices are estimated using a multiplicative updating algorithm, in which they are iteratively updated by element-wise multiplication.However, this algorithm suffers from two drawbacks.First, column orthogonality is approximately (not precisely) obtained despite the column orthogonality constraints.Second, although the objective function value tends to be non-increasing in the early stages, it is not exactly monotonically non-increasing.These problems are caused by the difficulty of obtaining the optimal nonnegative column orthogonal factor matrices by constrained optimization.Mirzal (2014) pointed out the second drawback in the multiplicative algorithm for ONMF, and proposed a new convergent ONMF method using an additive updating algorithm; however, there is no guarantee that an orthogonal factor matrix will be obtained.Kimura et al. (2014) proposed a new ONMF method using a hierarchical alternating leastsquares algorithm; this algorithm is faster than the previous multiplicative algorithms, but still has these two drawbacks.On the other hand, Pompili et al. (2014) proposed a k-means like method for ONMF, which exactly maintains orthogonality and ensures monotonicity for the objective function.Pompili et al. (2014) found that the optimization problem of ONMF is similar to that of spherical k-means introduced by Banerjee et al. (2003), and referred to this problem as a weighted variant of the spherical kmeans (WVSK).This is an ONMF method, not ONMTF, and its error distribution is normal.Thus, we extend it to the ONMTF methods based on normal and Tweedie family distributions.Of course, our ONMTF method guarantees orthogonality and the monotonically non-increasing property of the objective function value.
Several biclustering methods based on k-means have been proposed in the literature as double k-means (Vichi 2001;Wang et al. 2011;Van Mechelen et al. 2004).These methods consider the following hard clustering problem: each object belongs to exactly one cluster, and only belongs or does not belong, like 0 or 1.Although we also consider a hard clustering problem, our proposed methods identify a membership degree.In that sense, our methods are somewhat more flexible than the double k-means.Li and Peng (2005) relaxed the hard double k-means method using orthogonal constrained factor matrices instead of indicator matrices.However, they did not impose nonnegativity constraints on these matrices.Moreover, these former double k-means approaches did not consider non-normal error distributions.The contribution of our work is the proposal of flexible hard bi-clustering approaches that consider the nonnegativity constraints and robust estimation with respect to extremely large positive values.
First, we introduce ONMTF-N, our new ONMTF method based on normal error distribution; this method is an ONMTF version of ONMF by Pompili et al. (2014).We then introduce two other new ONMTF methods (ONMTF-P and ONMTF-CP), which are respectively based on Poisson and CP error distributions.Poisson distribution is often used for integer data, e.g., count data, especially in the linear regression framework on the assumption that the expected value and the variance are equal; thereby, ONMTF-P can also be applied for count data, such as a contingency table.On the other hand, ONMTF-CP can be applied to nonnegative data that are collected as a summation of nonnegative values, such as a matrix containing the purchase volumes of individual customers in a store.To the best of our knowledge, ONMTF based on a CP distribution has not been proposed in any previous studies.Our two simulation studies demonstrate the increased consistency of ONMTF-N (compared to ONMTF based on the multiplicative algorithm), and the robustness of ONMTF-CP.In addition, we apply our ONMTF to document-term matrix to examine its goodness of clustering.
Because the previous methods are not relatively accurete, as shown in a simulation described later on, the estimates obtained by these methods can result in misinterpretations.However, our methods are more reliable because the estimates obtained by our methods are more accurate as compared to the previous methods.Moreover, our methods, especially ONMTF-P and ONMTF-CP, can be applied to real-world data because nonnegative data cannot be assumed to have a normal distribution; for example, purchase volume data contains a few extremely large positive values, which indicates that a large volume was purchased by only a few people.In such situations, a right-tailed distribution is more appropriate than a normal distribution.Therefore, our methods can significantly contribute in solving problems encountered in real-world data.
The notation employed in this paper is as follows.A matrix is represented in uppercase (bold type), e.g., M; its i, j element is represented in lowercase, e.g., m i j .An element of a complicated matrix is expressed as [•] i j .Further, m i and m ( j) are column vectors with elements m i j of the i-th row and the j-th column of M, respectively.We use the prime symbol and '−1' to express a transpose matrix and inverse matrix, e.g., M and M −1 , respectively.The trace and diagonal parts of a square matrix M are denoted by tr(M) and diag(M), respectively.The Euclidean norm of a matrix or vector is represented as M = tr(M M).D M is a diagonal matrix in which each diagonal element is m ( j) , while Δ(M) is the vector with the absolute values of elements in eigenvector v with the largest eigenvalue of square matrix M: Mv = λv.Finally, R n× p + is a set of n × p nonnegative matrices.We introduce herein our bi-clustering problem as a motivation for our ONMTF models, which are presented in later sections, and discuss the connection between the bi-clustering problem and ONMTF.Let y i j be the i, j element of a nonnegative data matrix Y ∈ R n× p + and R = {r 1 , . . ., r n } and C = {c 1 , . . ., c p } be the sets of the n row and p column objects, respectively.One of our aims is to detect a k-partition R = {R 1 , . . ., R k } of R and an -partition C = {C 1 , . . ., C } of C, where R m and C q are the m-and q-th classes of R and C, respectively.Classes R m and C q are defined as sets of row and column objects, respectively, and R and C are sets of k and disjoint non-empty classes that cover R and C, respectively.This definition implies that an object that belongs to one class does not belong to any other classes (i.e., when r i ∈ R m then r i / ∈ R m * for all m * = m and when c j ∈ C q then c j / ∈ C q * for all q * = q).We refer to R m as an "m-th row cluster" and C q as a "q-th column cluster."We consider the so-called membership of objects for a cluster to which each of the objects belong to.Let F = ( f im ) be the n ×k membership matrix of row objects such that f im > 0 when r i ∈ R m , while f im = 0 when r i / ∈ R m .Let A = (a jq ) be the p× membership matrix of column objects such that a jq > 0 when c j ∈ C q , while a jq = 0 when c j / ∈ C q .Note that this definition leads to the orthogonality of F and A: f (m) f (m * ) = 0 for all m * = m and a (q) a (q * ) = 0 for all q * = q.If f (m) and a (q) for all m, q have the unit length, the orthogonality is changed to orthonormality: F F = I k and A A = I .We also consider a relationship between row and column clusters.Let S = (s mq ) be a k × matrix such that s mq > 0 for all entries.This study aims to estimate the best unknown parameters θ = {R, C, F, A, S} by given Y .We consider the following approximation problem for this aim: the best θ is obtained such that for all i, j there exist m : R m r i and q : C q c j such that y i j ≈ x i j = f im s mq a jq .From the definition of the membership matrices F and A, we can rewrite the approximation problem as y i j ≈ x i j = n i=1 p j=1 f im s mq a jq .In matrix forms, we can describe it as Y ≈ X = F S A .Note that F and A have orthogonality, and all entries in all matrices are nonnegative.Hence, the approximation problem is equivalent to the ONMTF problem.We consider herein that the numbers of classes k and are chosen in advance.

Orthogonal NMTF based on normal error distribution
In this section, a new method for NMTF, namely, ONMTF-N, is introduced.This method is based on the WVSK algorithm proposed by Pompili et al. (2014); it is a fundamental algorithm for ONMTF methods described in the following sections, where it is assumed that data follows Poisson and CP distributions.
To formulate the bi-clustering problem as the approximation problem described in Sect. 1, we define the objective function by the following Euclidean distance: and define ONMTF-N as the following optimization problem with respect to θ : We set the two orthogonal constraints for F and A according to the discussion in Sect. 1.This optimization problem is formally derived from a maximum likelihood (ML) problem under the assumption of normality: y i j ∼ N (x i j , σ 2 ) for all i, j independently, where N (μ, σ 2 ) is a normal distribution with mean μ and variance σ 2 , which is why we named the method "ONMTF-N."However, this assumption is incompatible because we consider that the given data matrix only contains nonnegative entries.For this issue, we start with (2) to describe the ONMTF-N problem.The explanation on the other two methods that will be described later starts with the ML problem.
The objective function ( 1) is invariant under the changing of length of each row vector of S because the following is satisfied: where D S is the k × k diagonal matrix in which each diagonal element is s m (m = 1, . . ., k), F * = F D S , and is satisfied and f * (m) (m = 1, . . ., k) no longer has to be a unit length.According to (3) and ( 4), the optimization problem (2) is the same as (5) The parameters are updated using alternating optimization.First, we derive the update equation of F given R, S, and A. Note that (1) can be expanded such that Hence, the minimizer of f im for (6) is Substituting ( 7) into (6) and rearranging the terms proportional to the parameters, we obtain Therefore, the problem of minimizing Q N is the same as the problem of minimizing Q * N .Note that (8) can be observed to be similar to the objective function of WVSK when A y i (i = 1, . . ., n) and s m (m = 1, . . ., k) are considered as n data vectors and k normalized centroids in R + , respectively.In addition, when all n data vectors are normalized (i.e., A y i 2 = 1 (i = 1, . . ., n)), ( 8) is observed to be the same as the objective function of spherical k-means in R + .Given S and A, the minimizers of R for Q * N are derived by, e.g., a k-means algorithm such that Because ( 8) can be rewritten such that where The minimizer of S given R and A can be obtained by an eigenvector of A Y m Y m A as follows: Finally, the minimizer of F given R, S, and A can be expressed from (7) as If we regard the model as Y ≈ AS F , the update equations of C, S, and A can be derived similarly to ( 9), (11), and (12) as follows: where Y (q) (q = 1, . . ., ) is an n × |C q | matrix consisting of the column vectors y ( j) , such that c j ∈ C q .From ( 9), ( 11), ( 12), ( 13), ( 14) and ( 15), we can formulate an alternating optimization algorithm for model ONMTF-N as Algorithm 2.1 given below: Note that lines 7 and 8 mean the normalization of the length of each column vector of F and the modification of the length of each row vector of S maintaining the objective function value, respectively.These two steps are needed because the update of C q in step 9 is under the assumption of F F = I k .This holds for steps 12 and 13.
15: until the difference between the previous and the current Q is less than τ .
In some former double k-means approaches (e.g., Vichi (2001); Wang et al. (2011); Li and Peng (2005)), the middle factor matrix S is updated such that S = F Y A or S = (F F) −1 F Y A( A A) −1 .We can utilize this update in our methods, and it may lead to a better performance of the estimation accuracy or the computation time.However, we adopt the two-time update described in Algorithm 2.1 to emphasize ONMTF-N as an expansion of the WVSK (Pompili et al. 2014).This expansion provides the idea of ONMTF-P and ONMTF-CP introduced in the sections that follow.Approaches for updating S within the algorithm will be investigated in the future.

Orthogonal NMTF based on Poisson error distribution
In this section, we introduce a new ONMTF method, namely, ONMTF-P.It is a modified version of ONMTF-N described in Sect.2, and it assumes that data follow a Poisson distribution.Although a multiplicative updating algorithm for NMTF under this assumption was proposed by Yoo and Choi (2009), the orthogonal constraints were not imposed on it.In contrast, our algorithm is not based on a multiplicative updating algorithm but on the WVSK algorithm, and the orthogonality constraints are imposed on it.Unfortunately, we only provide the model and the algorithm of ONMTF-P because of space limitations.For the derivation of the update equations, please see Abe H, Yadohisa H (2017) Supplementary material to "Orthogonal nonnegative matrix tri-factorization based on Tweedie distributions.".
In formulating the bi-clustering problem described in Sect. 1 , we consider y i j as the independent Poisson distributed random variables with density: where x i j = n i=1 p j=1 f im s mq a jq .The objective function to be minimized is derived from ( 16) by an ML procedure as follows: where d KL (•, •) is KL-divergence, and "cst" is a constant with respect to θ to be estimated.Hence, the optimization problem of ONMTF-P is The two constraints at the bottom are from the bi-clustering problem described in Sect. 1.The procedure to solve the optimization problem (18) of ONMTF-P is completed as Algorithm 3.1.Here, the λ m (m = 1, . . ., k) and λ * q (q = 1, . . ., ) are defined as:

Orthogonal NMTF based on CP error distribution
In this section, we introduce the other new method for ONMTF, namely, ONMTF-CP, where it is assumed that data follows a CP distribution.This method is also based on the WVSK algorithm, as in the case of ONMTF-N and ONMTF-P described in Sects. 2 and 3, respectively.ONMTF-CP has a hyperparameter β that determines the robustness of estimation against the extremely large positive values.It is noteworthy that we derive a new auxiliary function for updating the middle factor matrix S.
In formulating the bi-clustering problem described in Sect. 1, we consider y i j as the independent CP distributed random variables with density: Algorithm 3.1 ONMTF-P 1: Input Y and stopping threshold τ .2: Initialize A and S as satisfying a (q) a (q * ) = 0 (q = q * ).
3: repeat 11: until the difference between the previous and the current Q is less than τ .
where x i j = n i=1 p j=1 f im s mq a jq , φ is a dispersion parameter, and β ∈ (0, 1) is an index parameter that determines the shape of the distribution.The objective function to be minimized is derived from (20) by a ML procedure as follows: where d β (•, •) is a β-divergence with parameter β that corresponding to the index parameter of CP (20).Hence, the optimization problem of ONMTF-CP is The two constraints at the bottom are from the bi-clustering problem described in Sect. 1.The parameters are updated using alternating optimization.First, we derive the update equation of F given R, S, and A. Note that (21) can be rewritten as Hence, the minimizer of f im for ( 23) is Substituting ( 24) into ( 23) and rearranging the terms proportional to the parameters, we obtain Therefore, the problem of minimizing Q CP is the same as the problem of minimizing Q * CP .Given S and A, the minimizers of R for Q * CP are derived by, e.g., a k-means algorithm such that Then, we derive the optimal S for Q * CP (R, S) given R and A. Note that ( 25) is rewritten as It is difficult to directly obtain the minimizer of s mq for ( 27) because the summation of s mq is in the power function.However, we can obtain the optimal s mq using the auxiliary function method, as in the case of ONMTF-P.First, we can find that the following bivariate function is involved in equation ( 27).We need the following corollary to obtain the appropriate inequality for (28): Proof The first-and second-order differentials of f (x, y) are as follows: From ( 31), (32), and (33), we have f x x (x, y) < 0, f yy (x, y) < 0, and f x x (x, y) f yy (x, y) − { f xy (x, y)} = 0. (34) Hence, the function f (x, y) is concave.
Therefore, we have for any λ and η with equality if and only if x = λ and y = η.From this inequality, we obtain the following auxiliary function of (27) for s mq : with where s * mq is the current value of s mq .Let Q * CP (S) be the same function as Q * CP (R, S).We then have: for any S and S * = (s * mq ), and if we can find Ŝ = argmin S Q aux CP (S, S * ), we have Hence, the minimizer of S for ( 36) is at least the optimal S, which is monotonically non-increasing for (27).We obtain the following update equation of s mq as a minimizer for Q aux CP (S, S * ): Finally, the minimizer of F given R, S, and A can be expressed from (24) as follows: If we regard the model as Y ≈ AS F , the update equations of C, S, and A can be derived similarly to (26), (40), and (41) as follows: A novelty of this paper is the derivation of the updating equation for S using the auxiliary function obtained by an inequality of the bivariate concave function in ONMTF-CP.To the best of our knowledge, the corollary on the inequality and its proof have not yet been presented in previous studies.
11: until the difference between previous and current Q is less than τ .

Some issues
The proposed ONMTF methods introduced earlier are based on the k-means algorithm from the row and column sides.In other words, these methods have the disadvantages of double k-means clustering: initialization, local minima, empty clusters.As for the initialization, we randomly assign each column object to a cluster in C and use an exponential random number as an initial value of a jq (c j ∈ C q ) for all j and all entries of S. Note that the initial R and F are not randomly given, but updated by the randomly initialized C, A, and S. Concerning the local minima and empty clusters, it is recommended to run these algorithms with efficiently large random starts.We restart the update iteration from another random start if the empty clusters occur.We select the estimates with the least objective function value from the non-empty cluster estimates.Any initialization with the k-means (e.g., Xue et al. (2008)) is not used in our proposed methods because of the two following reasons: 1) its initialization also needs initialization to compute and 2) its initialization does not always lead to the best estimates; therefore, we use large random starts.

Simulation studies
In this section, we describe two simulation studies.The first study compares ONMTF-N and previous ONMTF methods in terms of estimation accuracy.The second study analyzes the characteristics of the estimates given by ONMTF-N, ONMTF-P, and ONMTF-CP.

Simulation study 1: estimation accuracy of ONMTF-N
We conduct a simulation study to examine the estimation accuracy of ONMT-F-N.We compare ONMTF-N with the previous ONMTF techniques proposed by Ding et al. (2006) and Yoo and Choi (2010b).We refer to their methods as "Ding's method" and "Yoo's method," respectively.The details of the procedure are presented as follows: First, we generate synthetic data matrix as Y = F * S * Ã * + E, where F * , S * , and Ã * denote three nonnormalized true factor matrices generated as described below and E denotes the error matrix whose entries are normal random number from N (0, σ 2 ).The entries are converted to zero if y i j < 0. We generate each entries of these three nonnormalized true factor matrices F * , S * , and Ã * as follows: where E x(•) denotes an exponential distribution, μ denotes the expected value of each entry of Y , and R = { R1 , . . ., Rk } and C = { C1 , . . ., C } denote true partitions of row and column objects.These partitions are determined using discrete uniform random numbers.It is noted that the μ undoubtedly represents the expected value of y i j from the following equation: Here, we get the normalized true factor matrices, F, S, and Ã for the later step as follows: where D F * and D Ã * denote the k × k and × diagonal matrix, in which each diagonal element is f * m (m = 1, . . ., k) and ã * q (q = 1, . . ., ), respectively.Second, we execute our proposed method ONMTF-N and the previous methods (Ding's and Yoo's methods) using given Y , initialized factor matrices, τ , and ν, where τ denotes the threshold for the stopping algorithm and ν denotes the maximum number of iterative cycles.Here, we use the initial inputs R, C, F, S, and Ā for ONMTF-N which is generated in the same manner of generating true ones as described above.In this initialization, instead of the μ, ȳ = n i=1 p j=1 y i j /(np) is used in generating exponential random numbers.For Ding's and Yoo's methods, we generate initial inputs as follows: From this execution, we get the estimated parameters, that is, estimated partitions of row and column objects R = { R1 , . . ., Rk } and Ĉ = { Ĉ1 , . . ., Ĉ }, respectively, and three factor matrices F, Ŝ, and Â.Note that the factor matrices are normalized such as (49).Finally, we measure the similarity (or dissimilarity) between each of the five pairs of true and the estimated parameters.We calculate the following five indices:  1985), which is a similarity measure between two partitions of objects.ARI is 1 if the two partitions are completely the same and is close to 0 if they are different.For the three factor matrices, we calculate the mean square error (MSE).
The parameters for generating the synthetic data are set as follows: (n, p, k, ) = (1000, 600, 5, 3), σ = 1, μ = 10, τ = np × 10 −7 , ν = 1000.For more results of the other conditions, please see Abe H, Yadohisa H (2017) Supplementary material to "Orthogonal nonnegative matrix tri-factorization based on Tweedie distributions."It is noted that the true numbers of row and column clusters, and the estimated ones, are the same as k and , respectively.We generate 100 synthetic data matrices; from among the candidates of the estimates given by 20 executions of each ONMTF, we select the best estimates for which the objective function value is minimized.The results are shown in Fig. 1.ONMTF-N has the highest ARI for row clusters (see (a)), followed by Yoo's and Ding's methods in that order.In contrast, ARIs for the column clusters (see (b)) of both Ding's and Yoo's methods are as large as those of ONMTF-N.This result suggests that all three methods can accurately detect the smaller side clusters (in our simulation, this is the column side), but two previous methods cannot accurately detect the larger side clusters.The results of the MSE for the row and column factor matrices (see the (c) and (d)) are similar to the ARI results.The MSE for the middle factor matrix (see (e)) obtained by our method is smaller than those obtained by the other two methods.Overall, from this simulation study, we can conclude that ONMTF-N provides better estimation consistency than the other two ONMTF methods.

Simulation study 2: robustness of ONMTF-CP against extremely large positive values
We conducted another simulation study to demonstrate the characteristics of the estimates given by ONMTF-N, ONMTF-P, and ONMTF-CP.As mentioned in previous sections, it is assumed that y i j follows normal, Poisson, and CP distributions, respectively, in these three ONMTF methods.These distributions belong to the Tweedie family, which is described by ( 20), and the value of β determines the distribution: it is normal if β = 2, Poisson if β = 1, and CP if β ∈ (0, 1).The index parameter β is related to the robust estimation against extremely large positive values.Figure 2 shows β-divergence, which is derived from the log-likelihood of Tweedie distributions for various β values when y = 10 or y = 100.The β-divergence around a small y is larger than that around a large y, except β = 2.This means that extremely large positive values in the data are not considered for parameter estimation when β is smaller than 2. To examine these characteristics in ONMTF, we measure the estimation accuracy of ONMTF-N, ONMTF-P, and ONMTF-CP for synthetic data matrices generated using normal, Poisson, and CP distributions of data.The accuracy is calculated using the ARI between true clusters and estimated clusters of row and column objects.We now explain how to generate a synthetic data matrix.First, we generate R, F * , C, Ã * , and S * as in Sect.5.1.Next, we generate each element of the synthetic data matrix Y as a random number from y i j ∼ T W (x i j , φ, β), where T W (x, φ, β) denotes a Tweedie distribution, and the mean x i j is the corresponding i, j element of Negative values of y i j can be generated when β = 2.In this case, y i j is converted to zero.The parameters for generating synthetic data are set as follows: (n, p, k, ) = (100, 100, 5, 5), φ = 2, β = {2, 1, 0.8, 0.5, 0.2}, μ = 10, τ = np × 10 −7 (threshold for the stopping algorithm), and ν = 1000 (the maximum number of iterative cycles.)It is noted that the true numbers of row and column clusters, and the estimated ones, are the same as k and , respectively.We generate 100 synthetic data matrices for each of five conditions.Then, from among the estimate candidates given by 20 executions, we select the best estimates, R, for which the objective function value is minimized.
We then calculate ARI( R, R) of each ONMTF.We execute ONMTF-CP for three cases: β = {0.2,0.5, 0.8}.We refer to the procedures as ONMTF-CP2, ONMTF-CP5, and ONMTF-CP8, respectively.The results are shown in Fig. 3.Note that we do not show the result of ARI for column clustering because it is very similar to that for row clustering.When β = 2 (normal), ONMTF-N has the best accuracy, followed by ONMTF-P, ONMTF-CP8, ONMTF-CP5, and ONMTF-CP2, in that order.When β = 0.5, ONMTF-N has the worst accuracy; when β = 0.2, the accuracy deteriorates in the order of ONMTF-N, ONMTF-P, and ONMTF-CP8.Because more extremely large positive values are generated from a CP distribution with small β values, these results imply that ONMTF-N, ONMTF-P, and ONMTF-CP procedures with relatively larger β values do not fit a data matrix containing some extremely large positive values.This does not mean that an ONMTF-CP procedure with a small β value is the best in any case.It may be worse for a data matrix having a normal error, as shown in the case of β = 2 in Fig. 3.However, it may be preferable to use ONMTF-CP because the inaccuracy of ONMTF-CP with small β values is smaller than that of ONMTF-N for a data matrix containing some extremely large positive values.

Applications
In this section, we describe an application involving a document-term data matrix.The data is used to compare the goodness of clustering of ONMTF-N with those of the previous ONMTF methods and other methods used for the document clustering task.In addition, we attempt to apply our methods to point-of-sale data to check the differences among ONMTF-N, ONMTF-P, and ONMTF-CP.However, the application is not described herein because of space limitations.For the application, please see Abe H, Yadohisa H (2017) Supplementary material to "Orthogonal nonnegative matrix tri-factorization based on Tweedie distributions." We describe an application involving a matrix containing document-term data to compare the goodness of clustering of ONMTF-N with previous ONMTF, Ding's and Yoo's methods, and those of methods used previously for the document clustering task.Two reasons are considered to apply the three ONMTF methods for documentterm data: (1) the ONMTF is compatible with the document-term clustering; (2) these two previous ONMTF methods have some problems, as described above, and we are interested in its performance in real data application.In addition to the three ONMTF methods based on normal distribution, we also applied various methods described in Table 1.Double k-means (DK), LP-FNMTF (LP), and Graph modularity maximization (Mod) were applied as the base line biclustering methods.LP was proposed by Wang et al. (2011) as extended double k-means.Mod was proposed by Ailem et al. (2016) for direct maximization of bipartite graph modularity.We applied ONMTF-P (P) and ONMTF-CP (CP8, CP5, and CP2) to confirm the robust performance of them for the document clustering task.SPKM (SP) and WVSPKM (WV) were also applied for comparison with one-side clustering with biclustering.
The data matrices we used were obtained from the open data CLUTO1 website.Table 2 lists the selected data matrices and statistics.The list of datasets in Table 2 are ordered by the number of elements.The tr11, tr12, tr23, tr31, tr41, and tr45 datasets are derived from the TREC2 collections.The true categories of the documents in the tr31 and tr41 datasets are obtained by particular queries.The re0 and re1 datasets are The fbis data set is from the Foreign Broadcast Information Service data of TREC-5.The hitech is a dataset of San Jose Mercury Newspaper articles and contains documents about computers, electronics, health, medicine, research, and technology.The k1a, k1b, and wap datasets are used for the WebACE project (Boley et al. 1999) and contain web pages in various subject directories of Yahoo!. 4 Datasets k1a and k1b contain the same documents, but the true labels are different.
We conduct tf-idf conversion for all data matrices.The intuitive idea for the document clustering is to classify the documents using "unique words" that appear in each of the documents.Although we count the appearance frequency of every word in each of the documents, the counting does not represent the "uniqueness" of the word.Well used words (e.g., "run", "come", and "get") are not important for document clustering.We can add weight to words, which do not appear in many documents, by tf-idf conversion.Using tf-idf values provides a well intuitive document clustering.The row and column objects are assigned to documents and words, respectively; then, F and A can be considered as the factor matrices of documents and words, respectively.The number of document clusters (the number of columns of F, e.g., k) is set to the number of document classes provided in advance, and the number of word clusters (the number of columns of A, e.g., ) is set to 10 for all the data matrices.Although the best number of word cluster must be estimated from the dataset, we set the number to 10 in advance to avoid the time consuming computation.Note that the k= in Mod algorithm, that is , the numbers of word and document clusters are same.The convergence threshold is set as τ = 10 −5 Y 2 /2 for all the methods except for co-clustering based on graph modularity whose computation is repeated until no change is observed in its objective function value.The goodness of clustering is measured using the ARI between the given clusters and estimated clusters of the documents.It is noted that some clusters occasionally become empty in the update iteration for some methods.In this case, we restart the update iteration from another initial parameter.From among the candidates of the estimates given by 10 executions of each method, we select the best estimates for which the objective function value is minimized.The initial inputs of each method is as follows: for Ding, Yoo, N, P, and CP methods, we use initial inputs as described in Sect.5.1.For DK, we generate initial partition of row objects as described in Sect.5.1 and initial middle factor matrix (which means center of cluster) such as s mq ∼ E x( ȳ).
For SP, we generate the initial right side factor matrix (which means the normalized center of cluster) as follows: first we generate random number a jk ∼ E x(1) and then normalize square norm of each row of the A to 1.For WV, we generate the initial right side factor matrix such as a jk ∼ E x( ȳ1/2 ).For LP, we generate initial partition of row and column objects as described in Sect.5.1.For Mod, we use the default initialization procedure of the "CoclustMod" function which is distributed in Python package "coclust." 5We apply the LP in the same way described in Wang et al. (2011).We use nearest neighbor graph as affinity matrices.We try to compute with all patterns of hyperparameters: the neighborhood size k = {1, 2, . . ., 10} and each of the regu- The first, second, and third values are denoted by bold case, italics, and star "*", respectively larization parameters, α and β (which is different from β of Tweedie distribution or β-divergence), is {0.1, 1, 10, 100, 500, 1000}.We select the hyperparameter pattern with the highest average of ARI for relatively small datasets (tr23, tr12, tr11, re0, fbis, and tr45) and then we apply for the remaining datasets with the hyperparameter pattern.The selected pattern is k = 9, α = 100, and β = 500.The result is in Table 3.
Table 3 shows ARI of the 12 methods for each dataset.Yoo's method has four datasets (tr12, re0 k1a, and k1b) with the best ARI and this number is the largest amongst those of the 12 methods.The second best performance is shown in SP, which has three datasets (tr23, tr41, and hitech) with the best ARI and the method has high accuracy in many datasets.The third best performance is shown in N, Mod, and WV, each of which show the best ARI in two datasets.Although Yoo's method seems to show the best performance from the perspective of which method shows the best ARI in any of the datasets, the method has poor ARI on some datasets, e.g., re1, tr41, tr31, and hitech.In fact, Yoo's method shows higher performance than N in no more than 5 datasets of all 13 datasets.On the other hand, N is not poor for almost all datasets in the biclustering methods.However, it is poorer than one-side clustering methods, SP and WV.A possible reason for the result is as follows.Applying one-side clustering approach to document-term clustering task leads to the clustering of documents and terms into the same number of clusters; on the other hand, biclustering approach leads to clustering of the two into a different number of clusters.In the biclustering approach, a number of term clusters may be short or too much for clustering documents, and this mismatch could cause the poorer performance of N in comparison to one-side clustering methods.We also show the worse performance of P and CPs, which are robust methods, in comparison to that of other methods.The cause of this result can be robustness against the large positive values.We use a document-term matrix converted using tf-idf conversion, which strongly weights terms in a few documents.
The entries of such terms have a large positive value.The effect of the weighted entries disappear when these robust ONMTFs are used, implying that an ONMTF based on the Euclidean distance is not used, and interpretable clusters cannot be obtained.Note that good performance of the robust ONMTF can be obtained using new standardizations other than tf-idf conversion.However, we will undertake this challenge in our future works.DK and LP have the poor result in most of the datasets.The accuracy of LP can be improved by selection more appropriate hyperparameters for each dataset.
We obtain the results of the computational time, degree of approximation (measured as Euclidean distance at convergence), and convergence behavior.However, we do not report these herein because of space limitation.Please see Abe H, Yadohisa H (2017) Supplementary material to "Orthogonal nonnegative matrix tri-factorization based on Tweedie distributions for the results." We now show the estimates given by ONMTF-N using the k1a datasets as an example to demonstrate the features from the bi-clustering aspect.The k1a dataset consists of web news documents obtained from the Reuters news service in October, 1997 (Boley 1998).The documents in k1a are labeled in advance under six categories: "business," "entertainment," "health," "politics," "sports," and "tech".Table 4 presents the estimated 10 clusters of the terms given by ONMTF-N.For comparison, we also show them by Yoo's method in Table 5.Table 6 shows the center factor matrices S that enable us to grasp the relationship between the estimated document and the term clusters.We label the document clusters herein as DC 1 to DC 6 and the term clusters as TC 1 to TC 10.
We now focus on the term clustering.Each of the term cluster obtained using the two methods could be interpreted as Table 7.We find that both methods extract similar term clusters from each other (e.g., TC 1, 2, 4, 5, and 6 of both.)However, TC 3, which is strongly related to DC 3 (see Table 6) in both methods, has different terms.TC 3 of ONMTF-N seems to mean the chart of movie by the box office and that of Yoo's method seems to mean sports.Unfortunately, the detail of document clusters cannot be shown due to space limitations, but in fact, DC 3 of ONMTF-N includes some "entertainment" documents, while that of Yoo's method includes almost all of "sports" documents.From the fact that Yoo's TC 1 includes some words related to box office , documents related to box office chart can be integrated into DC 1 in Yoo's estimates.On the other hand, there are no word cluster on sports in ONMTF-N.The "sports" documents are indeed misclassified to DC 1, which includes many entertainment documents, by ONMTF-N.Although this misclassifying of "sports" documents leads to poorer ARI of ONMTF-N than Yoo's method, ONMTF-N detects the meaningful cluster related to box office chart instead of "sports."However, ONMTF-N has irrelevantly estimated the document clusters unfamiliar with the real document labels.Moreover, ONMTF-N has some assorted term clusters that are not affected by document clustering.The difference of the two can be explained from the aspect of their estimating algorithms.In fact, ONMTF-N exhibits less freedom than Yoo's method for estimating factor matrices owing to their orthogonality.Indeed its objective function value at convergence is larger than Yoo's: ONMTF-N has 241.8; and Yoo's method has 241.1.This estimating problem can be solved by more random starts thanks to the higher computational speed of ONMTF-N in comparison to Yoo's method.

Conclusion
In this paper, we proposed a new method for ONMTF, namely, ONMTF-N, in which the objective function value is monotonically non-increasing and the orthogonality of the factor matrices is maintained.In addition, we proposed two other ONMTF methods, namely, ONMTF-P and ONMTF-CP.The main contributions of this paper are as follows.First, our simulation study and an application involving some document-term matrices indicated that ONMTF-N shows higher estimation accuracy than previous methods.Second, we derived a new auxiliary function to optimize the middle factor matrix in ONMTF-CP using an inequality of a bivariate concave function.Third, another simulation study indicated that ONMTF-CP may be robust against the effect of extremely large positive values.NMFs with orthogonality, including our methods, should be used considering the trade-off between its easy-to-understand estimates and its under fitting.An orthogonal constraint simplifies a factor matrix, thereby facilitating result interpretation.However, a factor matrix with a simplified structure leads to a poor approximation to data matrix (e.g., Y .) Two issues remain to be addressed in the future.First, cluster degeneration tends to occur in the three proposed methods.Therefore, further investigation is required to develop an approach for rapidly seeking alternative initial parameters, in order to avoid cluster degeneration.Second, it is necessary to develop an approach for estimating the best number of clusters for both row objects and column objects.

Fig. 1 Fig. 2
Fig. 1 Boxplots of five indices for the 3 methods.Each of the panels represents a ARI( R, R), b ARI( C, Ĉ), c F − F /(nk), d Ã − Â /( p ), and e S − Ŝ /(k ).The left, middle, and right boxplots represent Ding's, Yoo's, and our proposed method, respectively.The vertical axis ranges are not same because the magnitude of f im and a jq differs depending on n and p

Fig. 3
Fig.3 ARI( R, R) obtained by five ONMTF methods for five conditions.The five boxplots for ONMTF-N, ONMTF-P, ONMTF-CP8, ONMTF-CP5, and ONMTF-CP2 are lined up from left to right in that order in each of the panels -21578 text categorization test collection, distribution 1.0. 3 value on the right side of each term is the nonzero value of a jq in each term.All a jq are standardized such that the length of each column vector in A is 1.Only the terms that have top 10 values of nonzero a jq in each cluster are shown Table 5 Terms of 10 clusters in the k1b datasets by Yoo's method.The value of the right side of each of the term is the value of a jq on each of the term.All a jq are standardized such that the length of each column vector in A is 1.Only terms are shown which have top 10 values of a jq in

Table 3
ARI between the given and estimated clusters of the documents

Table 6
Middle factor matrix S of the k1a dataset

Table 7
Our interpretation for term clusters obtained by two methods