Probabilistic Latent Variable Models as Nonnegative Factorizations

This paper presents a family of probabilistic latent variable models that can be used for analysis of nonnegative data. We show that there are strong ties between nonnegative matrix factorization and this family, and provide some straightforward extensions which can help in dealing with shift invariances, higher-order decompositions and sparsity constraints. We argue through these extensions that the use of this approach allows for rapid development of complex statistical models for analyzing nonnegative data.


Introduction
Techniques to analyze nonnegative data are required in several applications such as analysis of images, text corpora and audio spectra to name a few. A variety of techniques have been proposed for the analysis of such data, such as nonnegative PCA [1], nonnegative ICA [2], nonnegative matrix factorization (NMF) [3], and so on. The goal of all of these techniques is to explain the given nonnegative data as a guaranteed nonnegative linear combination of a set of nonnegative "bases" that represents realistic "building blocks" for the data. Of these, probably the most developed is non-negative matrix factorization, with much recent research devoted to the topic [4][5][6]. All of these approaches view each data vector as a point in an N-dimensional space and attempt to identify the bases that best explain the distribution of the data within this space. For the sake of clarity, we will refer to data that represent vectors in any space as point data.
A somewhat related, but separate topic that has garnered much research over the years is the analysis of histograms of multivariate data. Histogram data represent the counts of occurrences of a set of events in a given data set. The aim here is to identify the statistical factors that affect the occurrence of data through the analysis of these counts and appropriate modeling of the distributions underlying them. Such analysis is often required in the analysis of text, behavioral patterns, and so on. A variety of techniques, such as probabilistic latent semantic analysis [7], latent Dirichlet allocation [8], and so on and their derivatives have lately become quite popular. Most, if not all of them, can be related to a class of probabilistic models, known in the behavioral sciences community as latent class models [9][10][11], that attempt to explain the observed histograms as having been drawn from a set of latent classes, each with its own distribution. For clarity, we will refer to histograms and collections of histograms as histogram data.
In this paper, we argue that techniques meant for analysis of histogram data can be equally effectively employed for decomposition of nonnegative point data as well, by interpreting the latter as scaled histograms rather than vectors. Specifically, we show that the algorithms used for estimating the parameters of a latent class model are numerically equivalent to the update rules for one form of NMF. We also propose alternate latent variable models for histogram decomposition that are similar to those commonly employed in the analysis of text, to decompose point data and show that these too are identical to the 2 Computational Intelligence and Neuroscience update rules for NMF. We will generically refer to the application of histogram-decomposition techniques to point data as probabilistic decompositions. (This must not be confused with approaches that model the distribution of the set of vectors. In our approach, the vectors themselves are histograms, or, alternately, scaled probability distributions.) Beyond simple equivalences to NMF, the probabilistic decomposition approach has several advantages, as we explain. Nonnegative PCA/ICA and NMF are primarily intended for matrix-like two-dimensional characterizations of data-the analysis is obtained for matrices that are formed by laying data vectors side-by-side. They do not naturally extend to higher-dimensional tensorial representations, this has been often accomplished by implicit unwrapping the tensors into a matrix. However, the probabilistic decomposition naturally extends from matrices to tensors of arbitrary dimensions.
It is often desired to control the form or structure of the learned bases and their projections. Since the procedure for learning the bases that represent the data is statistical, probabilistic decomposition affords control over the form of the learned bases through the imposition of a priori probabilities, as we will show. Constraints such as sparsity can also be incorporated through these priors.
We also describe extensions to the basic probabilistic decomposition framework that permits shift invariance along one or more of the dimensions (of the data tensor) that can abstract convolutively combined bases from the data.
The rest of the paper is organized as follows. Since, the probabilistic decomposition approach we promote in this paper is most analogous to nonnegative matrix factorization (NMF) among all techniques that analyze nonnegative point data, we begin with a brief discussion of NMF. We present the family of latent variable models in Section 3 that we will employ for probabilistic decompositions. We present tensor generalizations in Section 4.1 and convolutive factorizations in Section 4.2. In Section 4.3, we discuss extensions such as incorporation of sparsity and in Section 4.4, we present aspects of geometric interpretation of these decompositions.

Nonnegative Matrix Factorization
Nonnegative matrix factorization was introduced by [3] to find nonnegative parts-based representation of data. Given an M×N matrix V, where each column corresponds to a data vector, NMF approximates it as a product of nonnegative matrices W and H, that is, V ≈ WH, where W is an M × K matrix and H is a K × N matrix. The above approximation can be written column by column as v n ≈ Wh n , where v n and h n are the nth columns of V and H, respectively. In other words, each data vector v n is approximated by a linear combination of the columns of W, weighted by the entries of h n . The columns of W can be thought of as basis vectors that, when combined with appropriate mixture weights (entries of the columns of H), provide a linear approximation of V.
The optimal choice of matrices W and H are defined by those nonnegative matrices that minimize the reconstruction error between V and WH. Different error functions have been proposed which lead to different update rules (e.g., [3,12]). Shown below are multiplicative update rules derived by [3] using an error measure similar to the Kullback-Leibler divergence: where A i j represents the value at ith row and the jth column of matrix A.

Latent Variable Models
In its simplest form, NMF expresses an M × N data matrix V as the product of non-negative matrices W and H. The idea is to express the data vectors (columns of V) as a combination of a set of basis components or latent factors (columns of W). Below, we show that a class of probabilistic models employing latent variables, known in the field of social and behavioral sciences as latent class models (e.g., [9,11,13]), is equivalent to NMF. Let us represent the two dimensions of the matrix V by x 1 and x 2 , respectively. We can consider the nonnegative entries V x1x2 as having been generated by an underlying probability distribution P(x 1 , x 2 ). Variables x 1 and x 2 are multinomial random variables, where x 1 can take one out of a set of M values in a given draw and x 2 can take one out of a set of N values in a given draw. In other words, one can model V mn , the entry in row m and column n, as the number of times features x 1 = m and x 2 = n were picked in a set of repeated draws from the distribution P(x 1 , x 2 ). Unlike NMF which tries to characterize the observed data directly, latent class models characterize the underlying distribution P(x 1 , x 2 ). This subtle difference of interpretation preserves all the advantages of NMF, while overcoming some of its limitations by providing a framework that is easy to generalize, extend, and interpret.
There are two ways of modeling P(x 1 , x 2 ) and we consider them separately below.

Symmetric Factorization
Latent class models enable one to attribute the observations as being due to hidden or latent factors. The main characteristic of these models is conditional independencemultivariate data are modeled as belonging to latent classes such that the random variables within a latent class are independent of one another. The model expresses a multivariate distribution such as P(x 1 , x 2 ) as a mixture where each component of the mixture is a product of one-dimensional marginal distributions. In the case of two dimensional data such as V, the model can be written mathematically as Computational Intelligence and Neuroscience 3 In (2), z is a latent variable that indexes the hidden components and takes values from the set {1, . . . , K}. This equation assumes the principle of local independence, whereby the latent variable z renders the observed variables x 1 and x 2 independent. This model was presented independently as probabilistic latent component analysis (PLCA) by [14]. The aim of the model is to characterize the distribution underlying the data as shown above by learning the parameters so that hidden structure present in the data becomes explicit.
The model can be expressed as a matrix factorization. Representing the parameters P(x 1 |z), P(x 2 |z), and P(z) as entries of matrices W, G, and S, respectively, where one can write the model of (2) in matrix form as P = WSG, or equivalently, where the entries of matrix P correspond to P(x 1 , x 2 ) and H = SG. Figure 1 illustrates the model schematically.
Parameters can be estimated using EM algorithm. The update equations for the parameters can be written as Writing the above update equations in matrix form using W and H from (3), we obtain The above equations are identical to the NMF update equations of (1) upto a scaling factor in H. This is due to the fact that the probabilistic model decomposes P which is equivalent to a normalized version of the data V. Reference [14] presents detailed derivation of the update algorithms and comparison with NMF update equations. This model has been used in analyzing image and audio data among other applications (e.g., [14][15][16]).

Asymmetric Factorization
The latent class model of (2) considers each dimension symmetrically for factorization. The two dimensional Figure 1: Latent variable model of (2) as matrix factorization. distribution P(x 1 , x 2 ) is expressed as a mixture of twodimensional latent factors where each factor is a product of one-dimensional marginal distributions. Now, consider the following factorization of P(x 1 , x 2 ): where i, j ∈ {1, 2}, i / = j and z is a latent variable. This version of the model with asymmetric factorization is popularly known as probabilistic latent semantic analysis (PLSA) in the topic-modeling literature [7].
Without loss of generality, let j = 1 and i = 2. We can write the above model in matrix form as q n = Wg n , where q n is a column vector indicating P(x 1 |x 2 ), g n is a column vector indicating P(z|x 2 ), and W is a matrix with the (m, k)th element corresponding to P(x 1 = m|z = k). If z takes K values, W is a M × K matrix. Concatenating all column vectors q n and g n as matrices Q and G, respectively, one can write the model as where S is a N × N diagonal matrix whose nth diagonal element is the sum of the entries of v n (the nth column of V ), and H = GS. Figure 2 provides a schematic illustration of the model. Given data matrix V, parameters P(x 1 |z) and P(z|x 2 ) are estimated by iterations of equations derived using the EM algorithm: The above set of equations is exactly identical to the NMF update equations of (1). See [17,18] for detailed derivation of the update equations. The equivalence between NMF and PLSA has also been pointed out by [19]. The model has been used for the analysis of audio spectra (e.g., [20]), images (e.g., [17,21]), and text corpora (e.g., [7]).

Model Extensions
The popularity of NMF comes mainly from its empirical success in finding "useful components" from the data. As pointed out by several researchers, NMF has certain important limitations despite the success. We have presented probabilistic models that are numerically closely related to or identical to one of the widely used NMF update algorithms. Despite the numerical equivalence, the methodological difference in approaches is important. In this section, we outline some advantages of using this alternate probabilistic view of NMF.
The first and most straightforward implication of using a probabilistic approach is that it provides a theoretical basis for the technique. And more importantly, the probabilistic underpinning enables one to utilize all the tools and machinery of statistical inference for estimation. This is crucial for extensions and generalizations of the method. Beyond these obvious advantages, below we discuss some specific examples where utilizing this approach is more useful.

Tensorial Factorization
NMF was introduced to analyze two-dimensional data. However, there are several domains with nonnegative multidimensional data where a multidimensional correlate of NMF could be very useful. This problem has been termed as nonnegative tensor factorization (NTF). Several extensions of NMF have been proposed to handle multi-dimensional data (e.g., [4][5][6]22]). Typically, these methods flatten the tensor into a matrix representation and proceed further with analysis. Conceptually, NTF is a natural generalization of NMF, but the estimation algorithms for learning the parameters, however, do not lend themselves to extensions easily. Several issues contribute to this difficulty. We do not present the reasons here due to lack of space but a detailed discussion can be found in [6]. Now, consider the symmetric factorization case of the latent variable model presented in Section 3.1. This model is naturally suited for generalizations to multiple dimensions. In its general form, the model expresses a K-dimensional distribution as a mixture, where each K-dimensional component of the mixture is a product of one-dimensional marginal distributions. Mathematically, it can be written as where P(x) is a K-dimensional distribution of the random variable x = x 1 , x 2 , . . . , x K . z is the latent variable indexing the mixture components and P(x j |z) are one-dimensional marginal distributions. Parameters are estimated by iterations of equations derived using the EM algorithm and they are In the two-dimensional case, the update equations reduce to (4).
To illustrate the kind of output of this algorithm, consider the following toy example. The input P(x) was the 3-dimensional distribution shown in the upper left plot in Figure 3. This distribution can also be seen as a rank 3 positive tensor. It is clearly composed out of two components, each being an isotropic Gaussian with means at μ 1 = 11, 11, 9 and μ 2 = 14, 14, 16 and variances σ 2 1 = 1 and σ 2 2 = 1/2, respectively. The bottom row of plots shows the derived sets of P(x j |z) using the estimation procedure we just described. We can see that each of them is composed out of a Gaussian at the expected position and with the expected variance. The approximated P(x) using this mode is shown in the top right. Other examples of applications on more complex data and a detailed derivation of the algorithm can be found in [14,23].

Convolutive Decompositions
Given a two-dimensional dataset, NMF finds hidden structure along one dimension (columnwise) that is characteristic to the entire dataset. Consider a scenario where there is localized structure present along both dimensions (rows and columns) that has to be extracted from the data. An example dataset would be an acoustic spectrogram of human speech which has structure along both frequency and time. Traditional NMF is unable to find structure across both dimensions and several extensions have been proposed to handle such datasets (e.g., [24,25]). The latent variable model can be extended for such datasets and the parameter estimation still follows a simple EM algorithm based on the principle of maximum likelihood. The model, known as a shift invariant version of PLCA, can be mathematically written as [23] Computational Intelligence and Neuroscience where the kernel distribution P(w, τ|z) = 0, ∀τ / ∈R where R defines a local convex region along the dimensions of x. Similar to the simple model of (2), the model expresses P(x) as a mixture of latent components. But instead of each component being a simple product of one-dimensional distributions, the components are convolutions between a multidimensional "kernel distribution" and a multidimensional "impulse distribution". The update equations for the parameters are + τ)R(w, h + τ, τ, z)dh dw dτ. (13) Detailed derivation of the algorithm can be found in [14]. The above model is able to deal with tensorial data just as well as matrix data. To illustrate this model, consider the picture in the top left of Figure 4. This particular image is a rank-3 tensor (x, y, color). We wish to discover the underlying components that make up this image. The components are the digits 1, 2, 3 and appear in various spatial locations, thereby necessitating a "shift-invariant" approach. Using the aforementioned algorithm, we obtain the results shown in Figure 4. Other examples of such decompositions on more complex data are shown in [23]. The example above illustrates shift invariance, but it is conceivable that "components" that form the input might occur with transformations such as rotations and/or scaling in addition to translations (shifts). It is possible to extend this model to incorporate invariance to such transformations.  The derivation follows naturally from the approach outlined above, but we omit further discussion here due to space constraints.

Extensions in the Form of Priors
One of the more apparent limitations of NMF is related to the quality of components that are extracted. Researchers have pointed out that NMF, as introduced by Lee and Seung, does not have an explicit way to control the "sparsity" of the desired components [26]. In fact, the inability to impose sparsity is just a specific example of a more general limitation. NMF does not provide a way to impose known or hypothesized structure about the data during estimation.
To elaborate, let us consider the example of sparsity. Several extensions have been proposed to NMF to incorporate sparsity (e.g., [26][27][28]). The general idea in these methods is to impose a cost function during estimation that incorporates an additional constraint that quantifies the sparsity of the obtained factors. While sparsity is usually specified as the L0 norm of the derived factors [29], the actual constraints used consider an L1 norm, since the L0 norm is not amenable to optimization within a procedure that primarily attempts to minimize the L2 norm of the error between the original data and the approximation given by the estimated factors. In the probabilistic formulation, the relationship of the sparsity constraint to the actual objective function optimized is more direct. We characterize sparsity through the entropy of the derived factors, as originally specified in [30]. A sparse code is defined as a set of basis vectors such that any given data point can be largely explained by only a few bases from the set, such that the required contribution of the rest of the bases to the data point is minimal; that is, the entropy of the mixture weights by which the bases are combined to explain the data point is low.
A sparse code can now be obtained by imposing the entropic prior over the mixture weights. For a given distribution θ, the entropic prior is defined as P(θ) ∝ e −βH (θ), where H (θ) is the entropy. Imposition of this prior (with a positive β) on the mixture weights just means that we obtain solutions where mixture weights with low entropy are more likely to occur-a low entropy ensures that few entries of the vector are significant. Sparsity has been imposed in latent variable models by utilizing the entropic prior and has been shown to provide a better characterization of the data [17,18,23,31]. Detailed derivation and estimation algorithms can be found in [17,18]. Notice that priors can be imposed on any set of parameters during estimation.
Information theoretically, entropy is a measure of information content. One can consider the entropic prior as providing an explicit way to control the amount of "information content" desired on the components. We illustrate this idea using a simple shift-invariance case. Consider an image which is composed out of scattered plus sign characters. Upon analysis of that image, we would expect the kernel distribution to be a "+", and the impulse distribution to be a set of delta functions placing it appropriately in space. However, using the entropic prior we can distribute the amount of information from the kernel distribution to the impulse distribution or vice-versa. We show the results from this analysis in Figure 5 in terms of three cases -where no entropic prior is used (left panels), where it is used to make the impulse sparse (mid panels), and where it is used to make the kernel sparse (right panels). In the left panels, information about the data is distributed both in the kernel (top) and in the impulse distribution (bottom). In the other two cases, we were able to concentrate all the information either in the kernel or in the impulse distribution by making use of the entropic prior.
Other prior distributions that have been used in various contexts include the Dirichlet [8,32] and log-normal distributions [33] among others. The ability to utilize prior distributions during estimation provides a way to incorporate information known about the problem. More importantly, the probabilistic framework provides proven methods of statistical inference techniques that one can employ for parameter estimation. We point out that these extensions can work with all the generalizations that were presented in the previous sections.

Geometrical Interpretation
We also want to briefly point out that probabilistic models can sometimes provide insights that are helpful for an intuitive understanding of the workings of the model.
Consider the asymmetric factorization case of the latent variable model as given by (6). Let us refer to the normalized columns of the data matrix V (obtained by scaling the entries of every column to sum to 1), v n , as data distributions. It can be shown that learning the model is equivalent to estimating parameters such that the model P(x 1 |x 2 ) for any data distribution v x2 best approximates it. Notice that the data distributions v x2 , model approximations P(x 1 |x 2 ), and components P(x 1 |z) are all M-dimensional vectors that sum to unity, and hence points in a (M − 1) simplex. The model expresses P(x 1 |x 2 ) as points within the convex hull formed by the components P(x 1 |z). Since it is constrained to lie within this convex hull, P(x 1 |x 2 ) can model v x2 accurately only if the latter also lies within the convex hull. Thus the objective of the model is to estimate P(x 1 |z) as corners of a convex hull such that all the data distributions lie within. This is illustrated in Figure 6 for a toy dataset of 400 threedimensional data distributions.
Not all probabilistic formulations provide such a clean geometric interpretation but in certain cases as outlined above, it can lead to interpretations that are intuitively helpful.

Discussion and Conclusions
In this paper, we presented a family of latent variable models and shown their utility in the analysis of nonnegative data. We show that the latent variable models decompositions are numerically identical to the NMF algorithm that optimizes a Kullback Leibler metric. Unlike previously reported results [34], the proof of equivalence requires no assumption about the distribution of the data, or indeed any assumption about the data besides nonnegativity. The algorithms presented in this paper primarily compute a probabilistic factorization of non-negative data that optimizes the KL distance between the factored approximation and the actual data. We argue that the use of this approach presents a much more straightforward way to make easily extensible models. (It is not clear that the approach can be extended to similarly derive factorizations that optimize other Bregman divergences such as the L2 metric-this is a topic for further investigation.) To demonstrate this, we presented extensions that deal with tensorial data, shift invariances, and use priors on the estimation. The purpose of this paper is not to highlight the use of these approaches nor to present them thoroughly, but rather demonstrate a methodology which allows easier experimentation with nonnegative data analysis and opens up possibilities for more stringent and probabilistic modeling than before. A rich variety of real world applications and derivations of these and other models can be found in the references.