Profile likelihood biclustering

Biclustering, the process of simultaneously clustering observations and variables, is a popular and effective tool for finding structure in a high-dimensional dataset. A variety of biclustering algorithms exist, and they have been applied successfully to data sources ranging from gene expression arrays to review-website data. Currently, while biclustering appears to work well in practice, there have been no theoretical guarantees about its performance. We address this shortcoming with a theorem providing sufficient conditions for asymptotic consistency when both the number of observations and the number of variables in the dataset tend to infinity. This theorem applies to a broad range of data distributions, including Gaussian, Poisson, and Bernoulli. We demonstrate our results through a simulation study and with examples drawn from microarray analysis and collaborative filtering.

1. Introduction. Suppose we are given a data matrix X = [X ij ], and our goal is to cluster the rows and columns of X into meaningful groups. For example, X ij could be the log activation level of gene j in patient i; our goal is to seek groups of patients with similar genetic profiles, while at the same time finding groups of genes with similar activation levels. Alternatively, X ij can indicate whether or not user i reviewed movie j; our goal is to simultaneously cluster the users and the movies. Mirkin (1996) termed the general clustering process "biclustering," but it is also known as direct clustering (Hartigan, 1972), block modeling (Arabie, Boorman and Levitt, 1978), and co-clustering (Dhillon, 2001).
Empirical results from a broad range of disciplines indicate that biclustering data is useful in practice. For example, Ungar and Foster (1998) and Hofmann (1999) found that biclustering helps identify structure in latent class models for collaborative filtering problems where the data is sparse and diverse tastes make it difficult to cluster purely based on purchasing or review habits.
Several biclustering applications exist in the biological sciences. Eisen et al. (1998) was one of the first papers to note the benefits of clustering genes and conditions in microarray data, finding that genes with similar functions cluster together. Recently, Harpaz et al. (2010) applied biclustering methods to a Food and Drug Adminstration report database, identifying associations between certain active ingredients and adverse medical reactions. Several other applications of biclustering exist; see Cheng and Church (2000), Getz, Levine and Domany (2000), Lazzeroni and Owen (2002), and Kluger et al. (2003) as well as Madeira and Oliveira (2004) for a comprehensive survey.
Although their goals are the same, the references above use a variety of different biclustering algorithms. Clearly, many of these algorithms work well in practice, but they are often ad-hoc, and there are no rigorous guarantees as to their performance. In particular, the lack of any notion of consistency means that practitioners cannot be assured that their discoveries from biclustering will generalize or be reproducible; collecting more data may lead to completely different biclusters.
Our first objective in this report is to establish a probabilistic model for the data matrix, so that biclustering can be formalized as an estimation problem. Once we have done this, we study a class of biclustering algorithms based on profile-likelihood, as proposed by Ungar and Foster (1998). We show that these profile-based procedures are asymptotically consistent as the dimensions of the matrix tend to infinity, under weak assumptions on the elements of X. Notably, our methods can handle both dense and sparse data matrices. To our knowledge, this is the first general consistency result for a biclustering algorithm.
The present treatment was inspired by recent developments in clustering methods for symmetric binary networks. In that context, X is an n-byn symmetric binary matrix, and the clusters for the rows of X are the same as the clusters for the columns of X. Bickel and Chen (2009) used methods similar to those used for proving consistency of M-estimators to derive results for network clustering when n tends to infinity. This work was later extended by Choi, Wolfe and Airoldi (2012), who allow the number of clusters to increase with n; Zhao, Levina and Zhu (2011), who allow for nodes not belonging to any cluster; and Zhao, Levina and Zhu (2012), who incorporate individual-specific effects. In parallel to this work, Rohe, Chatterjee and Yu (2011) study the performance of spectral clustering for symmetric binary networks; and Rohe and Yu (2012) study spectral methods for unsymmetric binary networks.
The main contribution of our work is recognizing that methods originally developed for an extremely specialized context (symmetric binary networks) can be generalized to handle clustering for arbitrary data matrices. Using a Lindeberg-type condition, we are able to generalize the Bickel and Chen (2009) results beyond Bernoulli random variables. We have also managed to simplify the presentation of the main proof. To our knowledge, this is the first time methodologies for binary networks have been used to study general biclustering methods.
The main text of the paper is organized as follows. First Section 2 describes the theoretical setup and Section 3 presents the consistency theorem. Section 4 presents the main steps of the proof. Next, Section 5 corroborates the theoretical findings through a simulation study, and Section 6 presents an application to a microarray dataset. Finally, Section 7 presents some concluding remarks. The appendices include additional technical and empirical results.
2. Block models and profile likelihood. As above, let X = [X ij ] ∈ R m×n be a data matrix. One way to formalize the biclustering problem is to follow the network clustering literature and posit existence of K row classes and L column classes, such that the mean value of entry X ij is determined solely by the classes of row i and column j. That is, there is an unknown row class membership vector c ∈ K m , an unknown column class membership vector d ∈ L n , and an unknown mean matrix M = [µ kl ] ∈ R K×L such that we refer to this model as a block model, after the related model for undirected networks proposed by Holland, Laskey and Leinhardt (1983). With a block model, our goal is to estimate c and d. We do this by assigning labels to the rows and columns of X, codified in vectors g ∈ K m and h ∈ L n . Ideally, g and h match c and d. Note that we are assuming that the true numbers of row and column clusters, K and L, are fixed and known.
Analogously to Bickel and Chen (2009), we can employ profile-likelihood (Murphy and van der Vaart, 2000) for the biclustering task. However, we cannot assume that the elements of X are Bernoulli random variables. Instead, we proceed by initially proposing a simple distribution for X and we deriving the maximum profile likelihood estimator for c and d under this model. Later, we consider a far more general distribution for X, and we show that the simple profile-likelihood based estimator is still consistent, even under model misspecification.
Suppose that the elements of X are independent draws whose distribution is specified by a single-paremeter exponential family. Conditional on c and d, entry X ij has density g(x; η c i d j ) with respect to some σ-finite measure ν, where g(x; η) = exp{xη − ψ(η)}; ψ(η) is the cumulant function, and η kl = (ψ ) −1 (µ kl ). With labels g and h, the complete data log-likelihood is .
The profile log-likelihood is where ψ * (x) = sup η {xη − ψ(η)} is the convex conjugate of ψ, known as the rate function in the large deviations literature (Dembo and Zeitouni, 1998). Following the above derivation, it is natural to estimate c and d by the label vectors which maximize pl(g, h).
In the sequel, we consider a far more general setting. We consider criterion functions of the form where f is any smooth convex function and ρ is a scale parameter. Borrowing terminology from the large deviations literature, we refer to f as the rate function of the criterion; though, since we allow f to take negative values, we do not require that f be a rate function in the strictest sense. We permit the elements of X to have different distributions, allowing for heteroscedasticity and model misspecification. We show that under mild technical conditions, the maximizer of F is a consistent estimator of the true row and column classes.
3. Consistency results. To prove consistency, we need to work with a sequence X n of data matrices. Suppose that X n ∈ R m×n and m = m(n) with n/m → γ for some finite constant γ > 0. Suppose that for each n there exists a row class membership vector c n ∈ K m and a column class membership vector d n ∈ L n . We assume that there exist vectors p ∈ R K and q ∈ R L such thatp k (c) → p k andq l (d) → q l as n → ∞ almost surely for all k and l. This assumption is satisfied, for example, if the class labels are independently drawn from a multinomial distribution. When there is no ambiguity, we omit the subscript n.
Assume that the mean of element X ij depends only on the row and column memberships c i and d j , so that for some matrix M = [µ kl ] ∈ R K×L , possibly varying with n. To model sparsity in X, we allow M to tend to 0. To avoid degeneracy, we suppose that there exists a sequence ρ and a fixed matrix M 0 ∈ R K×L such that Define the normalized confusion matrices C(g) ∈ R K×K and D(h) ∈ R L×L by Entry C ak is the proportion of nodes with class a and label k; entry D bl is defined similarly. These matrices are normalized so that C T 1 =p and D T 1 =q. We only consider nontrivial partitions; to this end, for ε > 0, define For fixed convex rate function f , we let F be a criterion function as in (2.1).
3.1. Assumptions. Denote by M 0 ∈ R the convex hull of the entries of M 0 . Let M be a neighborhood of M 0 . We require the following assumptions: (A1) The biclusters are identifiable. No two rows of M 0 are equal, and no two columns of M 0 are equal. (A2) The rate function is locally strictly convex. That is, f (µ) > 0 when µ ∈ M. (A3) The third derivative of the rate function is locally bounded.
That is, |f (µ)| is bounded when µ ∈ M. (A4) The average variance of the elements is of the same order as ρ.
(A5) The mean matrix does not converge to zero too quickly. That is, lim sup n→∞ ρn = ∞.
(A6) The elements satisfy a Lindeberg condition. For all ε > 0, Assumption (A4) is trivially satisfied for dense data and is satisfied for Binomial and Poisson data so long as ρ −1 M → M 0 . However, this assumption cannot handle arbitrary sparsity. For example, if the elements of X are distributed as Negative Binomial random variables, then assumption (A4) requires that the product of the mean and the dispersion parameter does not tend to infinity. In other words, for sparse count data, the counts cannot be too heterogeneous.
Assumption (A5) places a sparsity restriction on the mean matrix. Zhao, Levina and Zhu (2012) used the same assumption to establish weak consistency for network clustering. A variant Lyaponuv's condition (Varadhan, 2001) for some δ > 0, then (A6) follows. In particular, for dense data (ρ bounded away from zero), uniformly bounded (2 + δ) absolute central moments for any δ > 0 is sufficient. For certain types of sparse data (Bernoulli or Poisson data with ρ converging to zero), (A5) is a sufficient condition for (A6).

3.2.
Main results. Given the above setup and assumptions, it follows that the maximizer of F (g, h) is a consistent estimator of the true row and column labels.
The statement of the theorem is somewhat awkward, because we can permute the row or column labels and get the same value of the criterion function, F . Thus, F has multiple maximizers, and it is only possible to recover the true classes up to a permutation of labels.
From the discussion in Section 2, it follows that maximizing the profile loglikelihood associated with a single-parameter exponential family may be a reasonable biclustering procedure. Furthermore, the proof of this result does not require the distribution of the data to be correctly specified and allows for possible model misspecification so long as (A1)-(A6) are satisfied. This implies that this result can be applied to binary matrices, count data, and continuous data which could be reasonably modeled by Binomial, Poisson, and Gaussian distributed data, respectively, making this a suitable result for our motivating examples.
Under stronger distributional assumptions, we can use the methods of the proof to establish finite-sample results. For example, if we assume that the elements of X are Gaussian, then the following result holds. The proof of the finite-sample result follows the same outline as the asymptotic result; see Appendix B.
To prove the main theorem, we first establish that in the limit, F is close to a nonrandom "population version," G. Then, we establish that G is maximized at the true class labels. Finally, we show that outside any neighborhood around the true class labels, G is smaller than at the true values.
As alluded to in the Introduction, the main thrust of the proof is similar to that used in the literature on clustering for symmetric binary networks initiated by Bickel and Chen (2009) and extended by Choi, Wolfe and Airoldi (2012), Zhao, Levina and Zhu (2011) and Zhao, Levina and Zhu (2012). There are three main points of departure from this previous work. First, the literature on clustering for symmetric binary networks works only with Bernoulli random variables, whereas we consider arbitrary distributions that satisfy a Lindeberg-type condition. Second, these papers work with uniformly Lipschitz criterion functions, while we consider potentially unbounded criterion functions. Lastly, the results for symmetric binary networks require an additional assumption: that the criterion function is maximized at the population parameters. We show that this condition is satisfied as long as the parameters are identifiable and the rate function is strictly convex.
4. Proof of consistency theorem. In this section, we continue with the setup and notation of Section 3. To prove Theorem 3.1, we need to introduce some additional notation. Define The weak law of large numbers establishes that for fixed g and h, the convergence R kl (g, h) P → 0 holds. We can prove a stronger result, that this convergence is uniform over all g and h.
Lemma 4.1. Under assumptions (A1)-(A6), for all ε > 0, With Lemma 4.1 (proved in Appendix A), we can establish that in the limit, F (g, h) is close to its "population version," which depends only on C and D. To define this population version, first, for each M , define S M to be the set of M × M matrices with nonnegative entries summing to one: Lemma 4.2. F is close to its population version in the sense that, for all ε > 0, sup This is a direct consequence of Lemma 4.1 and assumption (A3), which implies that f is locally Lipschitz; see Appendix A for details.
Once we have established that F is close to its population version, our next task is to show that the population version is maximized at the true class labels. To do this, for δ > 0, first define A permutation of a diagonal matrix has only one non-zero entry in each column, so taking δ close to zero restricts the confusion matrices to be close to permutations of diagonal matrices.
where κ is a constant independent of δ and η.
We prove Lemma 4.3 in Appendix A.
Next, we prove the consistency theorem.
Proof of Theorem 3.1. Fix δ > 0 and define P δ and Q δ as in (4.1). We will show that if (g, h) ∈ J ε and if (C(g), D(h)) / ∈ (P δ , Q δ ), then F (g, h) < F (c, d) with probability approaching one. Moreover, this inequality holds uniformly over all such choices of (g, h). Since δ is arbitrary, this implies that C(ĝ) and D(ĥ) converge to permutations of diagonal matrices, i.e. the proportions of misclassified rows and columns converge to zero. Set Pick η > 0 smaller than min a {p a } and min b {q b }. By assumption, the true row and column class proportions converge to p and q. Thus, for all g ∈ K m and h ∈ L n , for n large enough, [C(g)] a ≥ η and [D(h)] b ≥ η; this holds uniformly over all choices of (g, h).
Applying Lemma 4.3, to the second term in the inequality, we get that with probability approaching one, r n P → 0. Thus, with probability approaching one, (C(ĝ), D(ĥ)) ∈ P δ × Q δ . Since this result holds for all δ, all limit points of C(ĝ) and D(ĥ) must be permutations of diagonal matrices.

Empirical evaluation.
We study the performance of profile loglikelihood for biclustering Bernoulli, Poisson, Gaussian, and Non-standardized Student's t data. We consider the following three rate functions: We use the appropriate rate function for the Bernoulli and Gaussian simulations. To study performance under model misspecification, we compare the performance of the Poisson and Gaussian rate functions for the Poisson simulations and use the Gaussian rate function for the t simulations.
For the Poisson and Gaussian rate functions (5.1b) and (5.1c), the maximizer of the criterion function (2.1) does not depend on the scale factor ρ. This is immediately obvious in the Gaussian case. For the Poisson case, the function f P oisson (µ/ρ) = 1 ρ µ log(µ) − 1 ρ µ(1 + log(ρ)). When summed over all biclusters, the second term in this sum is equal to a constant so the value of µ which maximizes f P oisson (µ/ρ) does not depend on the value of ρ. This is not the case for the Binomial rate function (5.1a), but the parameter ρ is not identifiable in practice so it does not make sense to try to estimate it. For our simulations we use which maximizes ρ = 1 in the fitting procedure, regardless of the true scale factor for the mean matrix M . Even though in the simulations assumption (A1) doesn't hold for this choice of ρ, we still get consistency, because the maximizer of the criterion with f Bernoulli (µ) is close to the maximizer with f Poisson (µ). See Perry and Wolfe (2012) for discussion of a related phenomenon.
To maximize the profile log-likelihood, we compute initial partitions for the rows and columns. Then, we iteratively update the cluster assignments in a greedy manner, based on the Kernighan-Lin heuristic (Kernighan and Lin, 1970) employed by Newman (2006): 1. For each row and column, we compute the optimal label assignment while keeping the labels of all other rows and columns fixed; we also record the improvement made by making this assignment. 2. In order of the local improvements recorded in step 1, we perform the label reassignments determined in step 1. Note that these assignments are no longer locally optimal since the labels of many of the rows and columns change during this step. 3. Out of those labels considered in step 2, we choose the one which has the highest profile likelihood.
We iteratively perform steps 1-3 until the profile likelihood converges. The algorithm is not guaranteed to converge to a global optimum, but it seems to perform well in practice. If implemented efficiently, each update to the log-likelihood can be done in a constant number of operations. Therefore, the most complex part of this algorithm is the ordering of the local improvements. On average, using a conventional sorting algorithm this requires O(n log(n)) operations. For comparison, the algorithmic costs for spectralbased biclustering algorithms are at least O(n 2 ), the cost of computing the top singular vectors of the data matrix using an indirect method (Golub and Loan, 1996). In our simulations, we report the proportion of misclassified rows and columns by the profile-likelihood-based method (PL), which Theorem 3.1 guarantees to be consistent. For the Poisson simulations we report the profileliklihood-based method with the Poisson rate function (PL-Pois) and with the Gaussian rate function (PL-Gaus). We also report misclassification errors for k-means applied separately to the rows and columns (KM) and for the DI-SIM biclustering algorithm (DS) of Rohe and Yu (2012). For the profile-likelihood-based method, we compute initial partitions for the rows and columns by applying k-means separately to the rows and the columns. In principle any initialization can be used. The algorithm is run once for each simulation.
For the Poisson simulation, we simulate from a block model with K = 2 row clusters and L = 3 column clusters. We vary the number of columns, n, between 200 to 1500 and we take the number of rows as m = γn where γ ∈ {0.5, 1, 2}.
We set the row and column class memberships as independent multino-mials with probabilities p = (0.3, 0.7) and q = (0.2, 0.3, 0.5). We choose the matrix of block parameters to be where b is chosen between 5 and 20; the entries of the matrix were chosen randomly, uniformly on the interval [0, 2]. Note that M → 0, so the data matrix is sparse. We generate the data conditional on the row and column classes as . Figure 1 presents the average bicluster misclassification rates for each sample size and Tables 1 and 2 report the standard deviations. In all of the scenarios considered, the biclustering based on the profile log-likelihood criterion performs at least as well as the other methods and shows signs of convergence even if the rate function is misspecified. Although the k-means algorithm and the DI-SIM algorithm perform well in some settings, in other settings the performance is not well-behaved. We also see that the standard deviation of the misclassification rate tends to zero for the PL bicustering, but not for the other two algorithms. The results in McSherry (2001) and Rohe and Yu (2012) suggest that the DI-SIM algorithm may be a consistent biclustering algorithm, but require strong assumptions that do not hold in our simulations. Specifically, the former requires that the variance of the entries in M to be log 6 (n)/n and the later requires that the minimum expected degree eventually exceeds √ 2n/ log(n). Therefore it is not clear if the convergence of the DI-SIM algorithm is slow or if the results do not apply in this setting.
Appendix C describes in detail the simulations for Bernoulli, Gaussian, and t data. For the Bernoulli simulation, the profile-likelihood-based method outperforms the other procedures. For the Gaussian and t-distributed data, the DI-SIM algorithm is much more competitive, but our algorithm still beats it in almost all cases.
Overall, the simulation results corroborate the conclusions of Theorem 3.1 and support the use of biclustering based on the profile log-likelihood criterion.   6. Application. In this section we use profile-likelihood-based biclustering to reveal structure in two high-dimensional datasets. For each example, we maximize the profile log-likelihood using the algorithm described in Section 5.
6.1. AGEMAP. Biclustering is commonly used for microarray data to visualize the activation patterns of thousands of genes simultaneously. It is used to identify patterns and discover distinguishing properties between genes and individuals. We use the AGEMAP dataset (Zahn et al., 2007) to illustrate this process.
AGEMAP is a large microarray data set containing the log expression levels for 40 mice across 8932 genes measured on 16 different tissue types. For this analysis, we restrict attention to two tissue types: cerebellum and cerebrum. The 40 mice in the dataset belong to four age groups, with five males and five females in each group. One of the mice is missing data for the cerebrum tissue so it has been removed from the dataset.
To study the relationship between mice and gene expression levels, we bicluster the 39 × 17864 residual matrix computed from the least squares fit to the multiple linear regression model where Y ij is the log-activation of gene j in mouse i, A i is the age of mouse i, S i indicates if mouse i is male, ε ij is a random error, and (β 0j , β 1j , β 2j ) is a gene-specific coefficient vector.
The entries of the residual matrix are not independent (for example, the sum of each column is zero). Also, the responses of many genes are likely correlated with each other. Thus, the block model required by Theorem 3.1 is not in force, so its conclusion will not hold unless the dependence between the residuals is negligible. In light of this caveat, the example should be considered as exploratory data analysis.
We perform biclusterring using profile log-likelihood based on the Gaussian criterion (5.1c). Based on the preliminary analysis in Perry and Owen (2010), it appears that there are three mice groups. To determine an appropriate number of gene groups we experiment with values between two and five. Using four gene groups appears to give a reasonable representation of the data. The heatmap presented in Figure 6.1 shows the results. Although the expression levels for the fourth gene group appear to be fairly neutral across the three mouse groups, the first three gene groups each have a visually apparent pattern. It appears that a mouse can have high expression levels at most two of the first three gene groups. Mouse group 1 has high expression for gene groups 2 and 3; mouse group 2 has high expression for gene group 1; and mouse group 3 has high expression for gene groups 1 and 3.
The three clusters of mice agree with those found by Perry and Owen (2010). That analysis identified the mouse clusters, but could not attribute meaning to them. The bicluster-based analysis has deepend our understanding of the three mouse clusters while suggesting some interesting interactions between the genes. 6.2. MovieLens. Collaborative filtering uses the behavior of similar consumers to provide better recommendations of products to new individuals. Since consumer habits likely vary depending on products, biclustering can help identify structure in the data and identify groups of consumers and groups of products with similar patterns. As an application of this we apply biclustering to the MovieLens data set (GroupLens Research Project, 1998). Appendix D describes the results from this application which splits individuals who rate movies from all time periods from individuals who only rate recent movies.
7. Discussion. In this report we developed a statistical setting for studying the performance of biclustering algorithms. Under the assumption that the data follows a stochastic block model, we derived sufficient conditions for an algorithm based on maximizing the profile-likelihood to be consistent. This is the first theoretical guarantee for any biclustering algorithm which can be applied to a broad range of data distributions and can handle both sparse and dense data matrices. These results can easily be extended to symmetric matrices. Our empirical comparisons demonstrated that the method performs well in a variety of situations and can outperform existing procedures.
It is important to note that the theoretical results operate under the assumption that the true number of row and column classes are known. In practice, this is a simplifying assumption and the optimal number of biclusters needs to be inferred from the data. While this is an open problem in the biclustering literature, the formalization of biclustering as an estimation process may help identify parallels that can be drawn from classical model selection procedures to this setting and is an interesting topic for future research.
Applying the profile-likelihood based biclustering algorithm to real data revealed several interesting findings. Biclustering the genes and mice in the AGEMAP data exposed an interesting pattern in the expression of certain genes and we found that at most two gene groups can have high expression levels for any one mouse. Our results from the MovieLens dataset identified a distinct difference in movie review behavior and split individuals who only rate recent movies from individuals who rate movies from all time periods. The consistency theorem proved in this report gives conditions under which we can have confidence in the robustness of these findings.

APPENDIX A: ADDITIONAL TECHNICAL RESULTS
Lemma A.1. For each n, let X n,m , 1 ≤ m ≤ n, be independent random variables with E X n,m = 0. Let ρ n be a sequence of positive numbers. Let I n be a subset of the powerset 2 [n] , with inf{|I| : I ∈ I n } ≥ L n . Suppose (i) 1 nρn n m=1 E|X n,m | 2 is uniformly bounded in n; (ii) For all ε > 0, 1 Then sup Proof. Let ε > 0 be arbitrary. Define Y n,m = ρ −1 n X n,m I(|X n,m | ≤ ε √ nρ n ), and note that Fix an arbitrary t > 0. Set µ n,m = E Y n,m and for I ∈ I n define µ n (I) = 1 |I| m∈I µ n,m .
For I ∈ I n , write Note that since E X n,m = 0, it follows that Thus, by (ii) and (iii) we have that sup I∈In {|µ n (I)|} → 0; in particular, for n large enough, sup I∈In {|µ n (I)|} < t 2 . Consequently, for n large enough, uniformly for all I, We apply Bernstein's inequality to the bound. Define σ 2 n,m = E(Y n,m − µ n,m ) 2 and v n (I) = m∈I σ 2 n,m . Note that |Y n,m − µ n,m | ≤ 2ε √ n. By Bernstein's inequality, By (i), (iv), and (v), it follows that for n large enough, v n (I) < εt|I| √ n/3, so Pr m∈In (Y n,m − µ n,m ) > t|I|/2 ≤ 2 exp − t|I| 8ε √ n .
We use this bound for each I to get the union bound: Pr sup By (iii) and (iv), it is possible to choose ε such that the right hand side goes to zero. It follows then that Pr sup Proof of Lemma 4.1. For all t > 0, m] is the set of all biclusters such thatp k > ε for all k and q l > ε for all l. Since I n is a subset of the powerset 2 [nm] , by Lemma A.1, it follows that Pr sup Proof of Lemma 4.2. The technical assumptions of f imply that its first derivative is bounded. Therefore, f is locally Lipschitz continuous with Lipschitz constant c = sup |f (µ)| for µ in a neighborhood of M and From Lemma 4.1, the righthand side converges to zero almost surely and the result follows.
Lemma A.2. Let C = [C ak ] ∈ R A×K and D = [D bl ] ∈ R B×L be nonnegative matrices whose entries sum to one. Let M = [µ ab ] ∈ R A×B be a matrix with no two identical columns. Define M to be the convex hull of the entries of M and let f : R → R be a twice differentiable convex function with f bounded away from zero in M. Suppose that [C1] a ≥ η for all a and that D bl D b l > δ for some l and b = b . There exists a positive constant C depending on M and f such that Proof. Let l, and b = b be such that D bl D b l > δ. Since no two columns of M are identical, there exists an a such that µ ab = µ ab . Let k be the index of the largest element in row a of matrix C; this element must be at least as large as the mean, i.e.
. Set κ = inf µ∈M f (µ) and define N = [ν ab ] ∈ R A×B with ν ab = f (µ ab ). By a second-order Taylor expansion, It follows that This inequality only holds for one particular choice of k and l; for other choices, the left hand side is nonpositive by Jensen's inequality. The result of the theorem follows, with C defined by Proof of Lemma 4.3. If D / ∈ Q δ , then for some l and some where κ 0 depends only on M 0 and f . Similarly, if C / ∈ P δ , then the right hand side is bounded by −κ 0 η 2 δ/L 2 . The result of the lemma follows with κ = κ 0 / min{K 2 , L 2 }.

APPENDIX B: FINITE SAMPLE RESULTS
In this appendix we derive a finite sample tail bound for the probability that the class assignments that maximize the profile likelihood are close to the true class labels. To proceed in this setting, we make stronger distributional assumptions than in the asymptotic case. Specifically, we assume here that the entries X ij |c, d follow a Gaussian distribution with mean µ c i d j and finite variance σ 2 . We proceed with the notation from the main text.
where A ∞ = max k,l |A kl | for any matrix A.
Proof. If the entries X ij follow a Gaussian distribution with mean µ c i d j and variance σ 2 then E |X ij − µ c i d j | l ≤ σ 2 2 σ l−2 l! so the conditions of Bernstein's inequality hold. It follows that for any bicluster I, for all t > 0, Applying a union bound, Lemma B.1 is used to establish a finite sample bound on the difference between F (g, h) and its population version.
Lemma B.2. Under assumptions (A2) and (A3), for any t > 0, Lemma B.2 is a direct consequence of the fact that f is locally Lipschitz continuous under assumptions (A2) and (A3). The details are similar to proof of Lemma 4.2.
The next step is to show that population version is maximized at the true class labels.
The proof of Lemma B.3 is similar to the proof of Lemma 4.3 except that the bound on the difference uses ε in place of the random value η. Some details follow.
Proof of Lemma B.3. First note that in Lemma A.2, we can let a be the index of the largest element in column k of matrix C; then, since we are restricted to the set J ε , this element must be at least as large as Therefore the bound in Lemma A.2 can be rewritten as Noting that for the Gaussian rate function f (µ) = 1 for all µ ∈ M, the remainder of the proof is similar to the proof of Lemma 4.3.
We establish a finite sample bound by combining these results.
Proof of Theorem 3.2. Fix δ > 0 and define P δ and Q δ as in Lemma 4.3. Set r n = sup Jε |F (g, h) − G M 0 (C(g), D(h)|. Suppose (g, h) ∈ J ε . In this case, Applying Lemma B.3 to the second term in the inequality, we get that The result follows by applying Lemma B.2.

APPENDIX C: ADDITIONAL EMPIRICAL RESULTS
This appendix reports additional empirical results for Bernoulli, Gaussian, and non-standardized Student's t distributed data. Figures 3-5 present the average bicluster misclassification rates for each sample size and Tables 3-5 report the standard deviations for the Bernoulli, Gaussian, and t simulations, respectively. Since the normalization for the DI-SIM algorithm is only sbpecified for non-negative data, the algorithm is run on the un-normalized matrix for the Gaussian and non-standardized Student's t examples.
For the Bernoulli simulation, we simulate from a block model with K = 2 row clusters and L = 3 column clusters. We vary the number of columns, n, between 200 to 1500 and we take the number of rows as m = γn where γ ∈ {0.5, 1, 2}.
We set the row and column class membership probabilities as p = (0.3, 0.7) and q = (0.2, 0.3, 0.5). We choose the matrix of block parameters to be where the entries were selected to be on the same scale as Bickel and Chen (2009). We vary b between 5 and 20. We generate the data conditional on the row and column classes as X ij | c, d ∼ Bernoulli(µ c i d j ).  For the Gaussian simulation, we simulate from a block model with K = 2 row clusters and L = 3 column clusters. We vary the number of columns, n, between 50 to 400 and we take the number of rows as m = γn where γ ∈ {0.5, 1, 2}.
We set the row and column class membership probabilities as p = (0.3, 0.7) and q = (0.2, 0.3, 0.5). We choose the matrix of block parameters to be where the entries were simulated from a uniform distribution on [−1, 1]. We vary b between 0.5 and 2. We generate the data conditional on the row and column classes as X ij | c, d ∼ Gaussian(µ c i d j , σ = 1).  For the non-standardized Student's t simulation, we use the same parameters as in the Gaussian simulation and we generate the data conditional on the row and column classes as X ij | c, d ∼ t(µ c i d j , σ = 1) with four degrees of freedom.   Similar to the Poisson simulation, biclustering based on the profile loglikelihood criterion performs at least as well as the other methods and shows signs of convergence in all three examples. These results provide further verification of the theoretical findings and support the use of biclustering based on the profile log-likelihood criterion.

APPENDIX D: ADDITIONAL APPLICATION -MOVIELENS
The MovieLens data set consists of 100,000 movie reviews on 1682 movies by 943 users. Each user has rated at least 20 movies and each movie is rated on a scale from one to five. In addition to the review rating, the release date and genre of each movie is available as well as some demographic information about each user including gender, age, occupation and zip code.
In order to retain customers, movie-renting services strive to recommend new movies to individuals who are likely to view them. Since most users only review movies that they have already seen, we can use the structure of the user-movie review matrix to identify associations between users and viewing habits of movies. Specifically, we consider the 943×1682 binary matrix X where X ij = 1 if user i has rated movie j and X ij = 0 otherwise. To find structure in X, we biclustered the rows and columns of X using the profile log-likelihood based on the Bernoulli criterion (5.1a).
To determine a reasonable selection for the number of biclusters we varied the number of user groups from two to three and the number of movie groups from two to seven. Qualitatively, the model with three user groups and six movie groups seems to provide a parsimonious description of the data. Figure 6 presents the heatmap of the data based on the resulting bicluster assignments, with the ordering of the clusters determined by the total number of a reviews in each cluster. Table 6 reports the top ten movies in each group. The eclectic mix of genres within each movie group suggests that the rating behavior of users is not explained by genre alone. Figure D presents a boxplot comparing the distributions of the movie release years for each group. We can see a clear ordering of the movie groups by median release date. The median ages within the user group were 32, 31, and 30.5, and the percentages of male users within each group were 67.6%, 72.8%, and 77.8%. These statistics suggest that there is some age and gender effect on the reviewing habits of the users. Table 6 The top ten movies in each cluster based on the total number of reviews.

Movie Year
Roughly speaking, user group 3 is consistently active across all movie groups with increasing activity as the popularity of the movie increases. The reviewing habits of user group 2 follow a similar pattern but to a lesser extent. In contrast, user group 1 is consistently inactive with the only exceptions being movie groups 4 and 6. From Figure D, it appears that the users in group 1 only rate recent movies whereas users in groups 2 and 3 rate movies from all time periods.
The biclusters discovered here suggest that a movie-renting service should recommend under-reviewed movies to individuals in user group 3, and it should recommend new releases to users in group 1. This information can be used in an ensemble-based recommendation engine like that of Töscher, Jahrer and Bell (2009).