Community detection in multi-relational data with restricted multi-layer stochastic blockmodel

In recent years there has been an increased interest in statistical analysis of data with multiple types of relations among a set of entities. Such multi-relational data can be represented as multi-layer graphs where the set of vertices represents the entities and multiple types of edges represent the different relations among them. For community detection in multi-layer graphs, we consider two random graph models, the multi-layer stochastic blockmodel (MLSBM) and a model with a restricted parameter space, the restricted multi-layer stochastic blockmodel (RMLSBM). We derive consistency results for community assignments of the maximum likelihood estimators (MLEs) in both models where MLSBM is assumed to be the true model, and either the number of nodes or the number of types of edges or both grow. We compare MLEs in the two models with other baseline approaches, such as separate modeling of layers, aggregating the layers and majority voting. RMLSBM is shown to have advantage over MLSBM when either the growth rate of the number of communities is high or the growth rate of the average degree of the component graphs in the multi-graph is low. We also derive minimax rates of error and sharp thresholds for achieving consistency of community detection in both models, which are then used to compare the multi-layer models with a baseline model, the aggregate stochastic block model. The simulation studies and real data applications confirm the superior performance of the multi-layer approaches in comparison to the baseline procedures.


Introduction
Over the last decade, relational data has become ubiquitous in all forms of human activities. In many applications of statistics and machine learning, one encounters relational data where the entities are represented as nodes or vertices and the relations or interactions between the entities as edges of a graph. Applications of such graphs or networks include many information systems such as social networks, World Wide Web, user information databases in e-commerce, metabolic networks, gene regulatory networks, protein-protein interaction networks and food web.
In majority of the cases dealt with in the literature, the relations are assumed to be of the same type such as web page linkage, friendship, co-authorship and protein-protein interaction. However in modern complex relational databases and networks, we often have information regarding relationships of multiple types among the nodes. For example, in the context of internet services a set of users may be connected through email, messaging, social media, etc., each one of them creating one layer or type of the user-user interaction network (Papalexakis et al. 2013). Similarly, users in a social network can have "friendship", "mentions", "following", etc. (Greene and Cunningham 2013) or researchers in academia may have co-authorship, citations, title/abstract similarity, etc., as different types of relations among themselves. In genomics data, cellular components can have different aspects of interactions among them, e.g., protein-protein physical interactions and gene co-expressions (Narayanan et al. 2010). Such multi-relational data can be represented as multi-layer graphs where multiple types of edges represent the relations and the set of vertices/nodes represents the entities (Jenatton et al. 2012).
One of the most important and widely investigated learning goals in an information network is clustering the entities on the basis of the relationships between them into densely connected subsets called "communities". From a probabilistic point of view, communities can be thought of as groups of vertices which are more likely to be connected to each other compared to the rest of the graph, i.e., the probability of having an edge between two vertices belonging to the same group is higher than that of having an edge between vertices belonging to different communities. Consequently we would observe the number of intra community edges to be higher than inter community edges.
Many researchers have proposed methods and algorithms for community detection in networks. Such methods can broadly be divided into three categories: methods based on probabilistic models, methods based on the maximization of a global objective function and those based on spectral or matrix factorization of the adjacency matrix or the Laplacian matrix. The stochastic blockmodel (Holland et al. 1983;Nowicki and Snijders 2001) is a statistical model for random graphs with a natural community structure. It is one of a large class of statistical models described in the literature for community detection in complex networks, which includes the latent variable (Handcock et al. 2007) and latent space models (Hoff et al. 2002), the degree corrected blockmodel (Karrer and Newman 2011;Zhao et al. 2012) and the mixed membership blockmodel (Airoldi et al. 2008). Various likelihood maximization based inference strategies have been proposed in the literature to simultaneously infer the block assignments and the parameters in the stochastic blockmodel, e.g., profile likelihood maximization (Bickel and Chen 2009), maximizing the conditional likelihood (Choi et al. 2012), and variational EM under mixture model settings (Daudin et al. 2008). Other strategies involve Bayesian inference using Gibbs sampling or variational methods (Latouche et al. 2011) and optimizing a modularity function over all possible partitions of the graph (Newman and Girvan 2004). See Goldenberg et al. (2010) for a detailed review of statistical inference in networks.
Several authors have also studied the conditions required on the growth of the number of communities and the degree density of networks for the estimation strategies to be consistent. Bickel and Chen (2009) and Zhao et al. (2012) studied the conditions for community detection through modularity maximization under the stochastic blockmodel and the degree corrected stochastic blockmodel respectively. Choi et al. (2012) laid down the conditions necessary for the consistency of maximum likelihood estimation under the stochastic blockmodel. This work was extended by Rohe et al. (2012) with a regularized estimator to high dimensional settings where the number of communities grows roughly as fast as the number of nodes. Celisse et al. (2012) derived consistency and Bickel et al. (2013) derived asymptotic normality of the maximum likelihood estimators and their variational approximations in the mixture model settings.
In this paper our primary focus is on the problem of detecting an underlying community structure in multi-layer networks. We assume that such networks have an implicit community structure and different observed layers manifest that underlying structure with varying amount of information and noise. As an example of a network where such an assumption is reasonable, we analyze a twitter network of British Members of Parliament (see Figure  1) where the underlying communities are based on their party memberships and the three observed layers, "mentions", "follows" and "re-tweets" manifest that structure in varying proportions. In such cases the multi-layer graph is a more accurate representation of the underlying similarity of the objects and each layer can provide only "partial" information about the data (Rocklin and Pinar 2011). The goal in such cases would be to correctly identify the underlying set of communities combining information from all three layers. Earlier approaches towards multi-relational data or multi-layer graph clustering suffer from the deficiency that they either cluster each graph independently and combine the results, or aggregate the graphs and cluster the aggregated graph. These approaches fail to take into account the dependency among the different layers, in particular the correlation among different types of edges that share the same pair of nodes. Moreover, the multiple network layers can have different characteristics in terms of sparsity and noise. Some layers may be dense but may carry little worthwhile information, whereas some layers may be extremely sparse but may carry valuable information. The aggregation process of graphs could lose the intrinsic heterogeneity of the network layers. Here we attempt to address the problem of how to efficiently cluster the nodes or entities in a network taking into account all types of layers or relations among them. Several approaches have been recently proposed in the literature for this purpose. Among them are approaches based on collective or joint matrix factorization (Nickel et al. 2011;Tang et al. 2009;Rocklin and Pinar 2011), non-parametric Bayesian models and latent factor models (Jenatton et al. 2012), extensions of spectral clustering (Dong et al. 2012) and modularity (Mucha et al. 2010) to multi-layer graphs. However there is a lack of statistical analysis of the properties of those methods. For community detection in multi-layer networks, we consider a natural extension of the standard stochastic blockmodel to multi-layer settings that we will call "multi-layer stochastic blockmodel" (MLSBM). This model, also considered in Han et al. (2014) as "multigraph SBM", is in the spirit of multi-relational models described in Holland et al. (1983), Taskar et al. (2001) and Kemp et al. (2006). Han et al. (2014) proved the consistency of the maximum likelihood estimates (MLEs) in this model when the number of relations grows. They keep the number of nodes (and hence the number of communities) fixed. However, as we will see later in both the asymptotic analysis and simulation studies that MLE in this model does not perform very well when either the number of communities grows fast or the network layers are sparse on average. Hence, we propose a restricted version of this model through restrictions on the parameter space which is capable of handling networks with a large number of communities. We call this model "restricted multi-layer stochastic blockmodel" (RMLSBM). We derive conditions on the growth of the number of communities and the average edge density of the networks under which the MLE of the class assignment vector is consistent (in the sense that the proportion of misclassified nodes tends to 0 as the number of nodes, and possibly the number of relations as well, grows). We further derive the minimax rates of error for community detection in MLSBM and obtain thresholds for consistent community detection. To compute the unknown class assignments and block model parameters simultaneously, we follow Daudin et al. (2008) and propose a variational estimation strategy.
The rest of the paper is organized as follows. Section 2 extends the stochastic blockmodel to multi-layer settings and defines the two models, MLSBM and RMLSBM. Section 3 settles the consistency of the community assignments through maximum likelihood estimation in the two models when the true data generating model is MLSBM. Section 4 describes a few baseline procedures and Section 5 compares the multi-layer models with the baseline models in terms of minimax error rate and sharp threshold results. Section 6 describes two estimation strategies for the MLEs in the two models. Section 7 describes the results of a simulation study to validate the theoretical results. Section 8 presents the application of the methods to the Twitter UK politics data set. Section 9 gives concluding remarks.

Extension of blockmodels to multi-layer settings
We consider an undirected multi-layer graph G = {V, E}, where the vertex set V consists of N vertices and the edge set E consists of edges of M different types representing different relations. We can view the multi-graph as a graph with vector valued edge information, i.e., the adjacency matrix A consists of elements A ij , who are themselves M dimensional vectors: ij , . . . , A (M ) ij }. An alternative way to approach the problem is to view the multi-graph as a collection of M , N × N adjacency matrices {A (1) , A (2) , . . . , A (M ) }, each corresponding to one particular type of relation. The rest of the set up is similar to the regular stochastic block model (SBM) for one-layer case with K blocks (Nowicki and Snijders 2001). We assume the number of communities K is known. Let z = {z 1 , z 2 , . . . , z N } be the community indicator vector for the N nodes, such that each z i takes exactly one value from the set {1, . . . , K} and z i = q if and only if node i belongs to community q. Conditional on the community indicator vector z, the edges are formed independently as Bernoulli random variables with probabilities depending only on the community assignments and the type of edges. In what follows we describe the two extensions of the standard SBM to multi-layer settings.
Except for the estimation algorithm, the model is always represented as a conditional block model and z is assumed to be a fixed unknown parameter of the model and needs to be estimated from data. Conditioned on the community assignments of the nodes z i and z j , the edges are formed independently following Bernoulli distribution The first model assigns a separate probability for the mth type of edge between nodes belonging to the qth and the lth community independent of all other edges. We call this model the "multi-layer stochastic blockmodel" (MLSBM). The probability of an mth type of edge between nodes i and j belonging to communities q and l respectively can be written as P The set of parameters for the model, π = {π (m) ql ; q ≤ l, q, l ∈ {1, . . . , K}, m ∈ {1, . . . , M }} has K(K + 1)M/2 elements. This model is "saturated" in the sense that we have a different parameter for each of the different types of edges between nodes belonging to different communities. Denote the range of this parameter set or array as Π = {π ∈ [0, 1] K(K+1)M/2 }.
In our asymptotic settings, where both N and M grow and K grows with N , the number of parameters to be estimated in the MLSBM grows as K 2 M and quickly becomes large. Hence the MLE performs poorly especially when the individual network layers are sparse. This problem does not arise in the asymptotic settings of Han et al. (2014) where only M grows and N, K remain fixed. However, it has been empirically shown that in most real world networks the average cluster size does not grow with the size of the network (Leskovec et al. 2008;Rohe et al. 2012;Binkiewicz 2015) and consequently, K grows with N . Hence in our asymptotic settings where N grows, keeping K fixed would be rather unrealistic. This motivates us to propose the second related model whose number of parameters grows much slowly compared to MLSBM.
The second model assumes the probability of the mth type of edge appearing between nodes i and j is governed by two factors: the first one being the community assignment of the two nodes and the second one being the type of edge. Hence the model has two sets of parameters: a K × K parameter matrix π K×K corresponding to the community structure, and an M × 1 vector β M ×1 which contains the parameters for different types of edges. We call this model the restricted multi-layer stochastic blockmodel (RMLSBM).
Notice that in the second model, if the edges were all of the same type, we would just have β m = β for all m ∈ {1, . . . , M } and then we will recover the standard stochastic blockmodel, with probabilities of edges determined solely by the community assignments. On the other hand, if we did not have a community structure, but M types of edges, then π ql would be identical for all communities q, l and the probability of an edge between nodes i and j will solely be determined by the type of edge. This model can retrieve information from sparse but highly informative edge types as the sparsity of the network layers will be captured in the β m parameters. Hence, although we assume the edges to be conditionally independent, this model induces two types of correlations unconditionally -among the edges of the same type and among the edges that share nodes of the same community.
The probability P (m) ij in RMLSBM , which denotes the probability of an mth type of edge between nodes i and j belonging to communities q and l respectively, can be modeled in the following way with the logit link function This model has K(K + 1)/2 + M parameters for an undirected graph. Hence, when both K and M grow, the growth rate in the number of parameters for this model is the same as the maximum of the growth rates in K 2 and M . In comparison, the number of parameters in MLSBM would grow as K 2 M . This makes the maximum likelihood estimator in RMLSBM a regularized estimator.
For the RMLSBM to be identifiable, we require the parameters β m to satisfy the condition m β m = 0. Hence we have one less free parameter. Denote the set of parameters for RMLSBM as π R = {(π ql , β m ) : q ≤ l, q, l ∈ {1, . . . , K}, m ∈ {1, . . . , M }} and its range as To prove the consistency of maximum likelihood estimation under MLSBM, we assume π ql , β m ∈ (−C log(M N 2 ), C log(M N 2 )) for some constant C > 0. This condition ensures that π ql and β m are bounded away from ±∞.

Consistency
In this section, we discuss the consistency of maximum likelihood estimation of the proposed models under three asymptotic regimes with varying conditions imposed on the growth of the number of communities (K) and the expected total number of edges of the multi-layer graph (L). We first define a one to one transformation of the parameters of RMLSBM as φ (m) ql = logit −1 (π ql + β m ) = exp(π ql + β m ) 1 + exp(π ql + β m ) .
(3.1) Now we assume that the data are generated from the more general model MLSBM and view RMLSBM as a MLSBM with the following restrictions on the parameters: This way the MLE in RMLSBM can be thought of as a restricted MLE (RMLE) of MLSBM. Our aim is to investigate the consistency of both the MLE and the RMLE under three asymptotic regimes where we let either the number of nodes (N ) or the number of types of edges (M ) or both to grow. This setup is quite appropriate for modern day multilayer networks, where data collection increases both in terms of new entities as well as new features or layers getting added to the database. Consequently methods are being sought which would be consistent in such situations. Some consistency results for the MLE were obtained in Han et al. (2014) under the settings when M grows, but N and consequently K remain fixed. Here we prove consistency results for the MLE in the more general asymptotic setting where N can also grow (and K grows with N ). We then compare the MLE with the regularized estimator in terms of the asymptotic conditions required for consistency. The different asymptotic setups we consider under the three regimes of growth in N and M are described below.

As both
For the RMLE, we further require that M = O(N ) so that K does not exceed N .
3. As both N → ∞ and M → ∞ with M growing faster than N , i.e., M = ω(N ), for RMLE we consider two related setups: for some δ > 0 otherwise. In setting (a), we further require log M to grow slower than N for the growth of K to be meaningful. Also, in that setup if log M grows at the same rate as (log N ) β for some β > 0, the number of communities grows almost as fast as the number of nodes except for the log terms and is "highest dimensional" in the sense of Rohe et al. (2012).
Note that the first regime assumes no relation between the growth rates of N and M , while the next two regimes assume certain relations between the two growth rates. So the last two regimes can be thought of as special cases of the first one in terms of the growth rates of N and M . Naturally we expect some relaxation in the required growth conditions on K and L in the last two regimes. The asymptotic setups described above reflect this relaxation for the RMLE. However no such relaxation is possible for the MLE. Hence we will prove that MLE in MLSBM is consistent under the first asymptotic regime, whereas MLE in RMLSBM (i.e., the RMLE of MLSBM under the restrictions defined by Equation (3.2) is consistent under all three asymptotic regimes. The MLSBM, despite being intuitively the simplest extension, does not perform as well as the RMLSBM for community detection in multi-relational networks if the networks are sparse at an average or contain a large number of communities.

Preliminaries
Since in this paper our primary interest is in modeling multi-layer networks where layers are sparse on an average, we require the true MLSBM model probabilities π (m) ql to satisfy certain sparsity conditions. As Zhao et al. (2012) pointed out, if the block model probabilities remain fixed as N increases, then the network will be unrealistically dense. In this connection it is worth noting that Snijders and Nowicki (1997) let the probabilities remain fixed and as a result the networks considered there have linearly increasing average degree, while both Bickel and Chen (2009) and Choi et al. (2012) considered networks with poly-logarithmically increasing average degree and hence gradually decaying probabilities. Here to keep the network sparse, we scale down the block model probabilities accordingly as N increases. We introduce a new notation L to denote the quantity inside the asymptotic notation ω in the growth rate of L under different asymptotic setups. As an example, consider the case when L = ω(M N (log N ) 3+δ ), then L = M N (log N ) 3+δ . Hence L can be viewed as the minimum rate at which L is required to grow under a particular asymptotic setup. The blockmodel parameters are restricted to have an upper bound that decreases with increasing N except for a small finite set indexed by the triplet Q = {q, l, m} such that the expected number of edges in the set For all {q, l, m} / ∈ Q, the parameters are restricted in the following way for some δ > 0 and some constant C, so that the upper bound is determined by the expected density of the network. The exact upper bound is determined by L and consequently, by the growth rate of L and varies under the different asymptotic assumptions. For any arbitrary partition z of the entities in the graph, the log likelihood of the set of M adjacency matrices (3.4) Note that for an undirected graph with no self-loops, both A (m) and π (m) , m = 1, . . . , M , are symmetric matrices in {0, 1} N ×N and [0, 1] K×K respectively. The Bernoulli parameters π (m) z i z j depend both on the class assignment z and the type of relation m. For a fixed class assignment z, let N q denote the number of nodes assigned to class q, and n ql denote the maximum number of possible edges between classes q and l. So we have n ql = N q N l and n qq = Nq 2 . For an arbitrary partition z, the MLE of π (z) iŝ where 1{·} is the indicator function. Note that for a fixed partition z, the denominator n ql in the MLEπ (m) (z)ql is the same for all edge types m. Now we define the expectation ofπ (z) asπ (z) and that of l(A; z, π) asl P (z, π) under the independent Bernoulli(P Clearly for a given z,π (z) andπ (z) are the maximizers of the functions l(A; z, π) andl P (z, π) respectively, and we let l(A; z) andl P (z) denote the corresponding maximum values. We extend Lemma 1 of Choi et al. (2012) to multi-layer settings as follows: (3.9) Here D(a||b) is the Kullback-Liebler divergence between two Bernoulli random variables with parameters a and b respectively. This equation decomposes the difference between the maximized likelihood and its expected value in terms ofπ (z) andπ (z) for a given class assignment vector z.
Next we turn our attention to RMLSBM. As mentioned before, we consider RMLSBM as a restricted version of MLSBM, and the MLE of RMLSBM can be viewed as a RMLE of MLSBM under the restrictions. Given a class assignment z, the RMLEπ (m)R z i z j = {π (z)ql ,β (z)m } is the maximizer of l R (A; z, π R ), the multi-layer block model log likelihood within the restricted parameter space. Substituting the estimated parameters in the likelihood function gives l R (A; z), the maximum of the likelihood function within the restricted parameter space. However, no closed form solution exists for the RMLE. Instead we have the following M + K(K + 1)/2 estimating equations: 1 + exp(π z i z j +β m ) . (3.11) One of the equations is redundant since if we add the equations in (3.10), the resulting equation is identical to the sum of the equations in (3.11). Now we use the transformation defined by φ in Equation (3.1). The likelihood with respect to the new parameters can be represented as (3.12) and the estimating equations in (3.10) and (3.11) can be written as (3.14) Together the right hand sides of these equations are the complete and sufficient statistics for the model. Hence we have K(K + 1)/2 + M − 1 independent equations which will together determine the MLE of K(K + 1)/2 + M − 1 free parameters in the set π R (z) . Here it is understood that the estimation procedure ensures that the finiteness condition of π ql and β m are respected possibly by restricting π ql , β m ∈ (−C log(M N 2 ), C log(M N 2 )). By the functional invariance property of the MLE,φ Note that the minimum value anyφ (m) (z)ql can take due to the imposed boundedness constraint is 1/M N 2 . This value is sufficiently small so that none of the partial sums in the left hand side of Equations (3.13) and (3.14) exceeds 1.
As before we define expectations ofφ z asφ z and that of l R (A; z, φ) asl R P (z, φ) under the independent Bernoulli(P (m) ij ) model. Then, For a given class assignment z,φ z andφ z are the maximizers of the functions l R (A; z, φ) andl R P (z, φ) respectively, and we let l R (A; z) andl R P (z) denote the corresponding maximum values. The difference between the maximized values of the observed and expected likelihood can be decomposed in two parts similar to Equation (3.8) as follows where as before, (3.17) A proof of this result can be found in the Appendix. Since the maximum of unrestricted likelihood would be at least as large as the maximum of restricted likelihood, we have l(A; z) ≥ l R (A; z) andl P (z) ≥l R P (z) for all z. Now letz denote the true partition. Further letẑ andẑ R denote the MLEs ofz under the two models MLSBM and RMLSBM respectively, i.e., z = arg max z l(A, z). (3.18)

Main results
We give several theorems in this section as we develop towards our main result. These theorems provide insights into the conditions required under the three asymptotic regimes discussed in the beginning of Section 3, which in turn provide comparison between the asymptotic behavior of MLEs in the two models MLSBM and RMLSBM. All the proofs are given in the Appendix. The first three theorems bound the difference in the maximized log likelihood and its expected value for both MLSBM and RMLSBM as defined in Equations (3.8) and (3.16).
Theorem 1. Suppose a MLSBM and a RMLSBM, both with K classes and M layers, are fitted to the graph with adjacency matrix The first result (3.20) provides a bound for the first part of the right hand side of Equation (3.8) for MLSBM. The results (3.21) and (3.22) provide a bound that will be used in Theorem 3 to bound the first part of the corresponding likelihood decomposition for RMLSBM in Equation (3.16). In the proofs of the next two theorems, we first bound the second part of Equations (3.8) and (3.16), and then combine the results to provide a bound for the difference between the log likelihood and its expected value under any arbitrary partition z for MLSBM and RMLSBM respectively.
Theorem 2. Suppose a MLSBM with K classes and M layers is fitted to the graph whose edges A The result of this theorem holds under the given conditions irrespective of the relationship between the growth rates of M and N . We state the result under the first asymptotic regime mentioned at the beginning of Section 3 since we do not get any relaxation in the assumption regarding the total expected number of edges if we assume certain relations between the growth rates of M and N .
The next theorem states that the restricted likelihood in RMLSBM is also asymptotically well behaved under five independent sets of conditions corresponding to the three asymptotic regimes discussed at the beginning of Section 3. The first two sets of conditions correspond to regime 1, the third set of conditions corresponds to regime 2, and the last two sets of conditions correspond to regime 3.
Theorem 3. Assume that a RMLSBM with K classes and M layers is fitted to the graph whose edges A (m) ij are independent Bernoulli(P (m) ij ) trials. If we further assume any of the following five sets of conditions with respect to the growth of the properties of the model under different asymptotic settings: (i) both M and N grow, where C is a constant, and the total expected number of edges of the entire multi-layer graph where C is a constant, and the total expected number of edges of the entire multi-layer graph L = ω(M N (log N ) 3+δ ) for some δ > 0; (iii) M is either a constant or grows slower than N , i.e., where C is a constant, and the total expected number of edges of the entire multi-layer graph L is ω(N (log N ) 3+δ ) for some δ > 0; (iv) M grows and N is either a constant or grows slower than M , i.e., M = ω(N ), where C is a constant, and the total expected number of edges of the entire multi-layer graph L = ω(M N (log N ) 1+δ ) for some δ > 0; (v) M grows and N is either a constant or grows slower than M , i.e., M = ω(N ), for all i < j, where C is a constant, and the total expected number of edges of the entire multi-layer graph L is larger than the the smaller of M (log M ) 2+δ (log N ) 1+δ and M N (log N ) 1+δ for some δ > 0; then, max It is clear from Theorem 2 and Theorem 3 that in RMLSBM, the bound on the likelihood can be established both for relatively milder conditions on the expected total number of edges and relatively faster growth conditions on the number of communities. As we will see in Theorem 5 and the discussion following it, this enables RMLSBM to be a more attractive model for community detection either when the number of communities is large or when we have relatively sparser graphs. Now we are ready to state our main results which show that when the true data generating process is a K-class MLSBM, the fraction of nodes misclustered by the MLEs and the RMLEs converge to zero under different asymptotic regimes. We define the number of "misclustered" nodes N e (ẑ) as the number of incorrect class assignments underẑ, counted for every node whose true class underz is not in the majority within its estimated class underẑ (Choi et al. 2012).
The previous results (Theorems 1, 2, 3) hold for any P (m) ij whenever they are bounded as described in the theorems. Now we assume further structure on the probabilities, namely a MLSBM. Denote the true partition asz, and under the true partition, let the true block model parameter array beπ. Hence, under MLSBM we have Consequently,l P (z, π) from Equation (3.7) is maximized by the true model parameterπ, and we have the maximized expected likelihood as (3.23) On the other hand, the expected restricted likelihood is maximized by the parameter arrayπ R under the restricted parameter space of RMLSBM. Note that this is different from the true model parameter arrayπ due to the restrictions imposed on the parameter space. Using the transformation introduced in Equation (3.1), the maximized expected restricted likelihood isl (3.24) The next theorem relates the difference between observed and true likelihood with the fraction of misclustered nodes N e (ẑ) and the expected total number of edges L to establish a bound for the misclustering rate.
Theorem 4. Suppose the data are generated according to a K-class MLSBM with membership vectorz and parameter arrayπ, the conclusion of Theorem 2 holds, and the following conditions hold with respect to the model sequence: for all blockmodel classes q = 1, . . . , K, class size N q grows as s = min q {N q } = Ω(N/K), and over all distinct class pairs (q, l) and Note that condition (3.25) is very similar to condition (ii) of Theorem 3 in Choi et al. (2012) with the total number of edges for the single layer case being replaced by the average number of edges L/M in each layer for the multi-graph. This ensures that any two rows in any of the layer matricesπ (m) ofπ differ in at least one entry by at least a constant times LK M N 2 . Also, when we take into account the asymptotic conditions required on the growth of K and L for the result of Theorem 2 to hold, i.e., K = O(N 1/2 ) and L = ω(M N (log N ) 3+δ ) with M and N both growing, then we have LK M N 2 = ω (log N ) 3+δ N 1/2 . As argued in Choi et al. (2012), if L is close to its least possible rate of growth, LK M N 2 goes to 0 for large N and the condition is not too prohibitive. For example, if L = M N (log N ) β with β > 4, then (log N ) β = o(N 1/2 ), so LK M N 2 goes to 0 and the condition is not overly restrictive. We state the corresponding conclusion for the restricted likelihood estimation (for RMLSBM) in the next theorem, i.e., the class membership assignment vector estimated through the maximum likelihood estimation in the restricted model RMLSBM is consistent under data generated from the MLSBM.
Theorem 5. Suppose the data are generated according to a K-class MLSBM with membership vectorz and parameter arrayπ, the conclusion of Lemma 3 holds, and the following conditions hold with respect to the model sequence: for all blockmodel classes q = 1, . . . , K, class size N q grows as s = min q {N q } = Ω(N/K), and over all distinct class pairs (q, l) and then under any of the five sets of growth conditions in Theorem 3, we have (3.28) Here g in condition (3.27) and the growth rate h depend on the asymptotic conditions imposed on K and L. The growth rate h can be determined from g by the relationship h = KL M N g . In particular, (i) when K = O(N 1/2 ), L = ω(M N (log N ) 3+δ ) with M and N both growing arbitrarily, then we have Note that in Theorem 5, we have used generic notations g and h to denote functions of the network properties such as N , K and L. The functions g and h vary across asymptotic setups. This is so because the regularity condition (3.27) on the difference among the elements of block model probability matrices should be as less prohibitive as possible. Note that in our results, we have chosen g in such a way that if L is close to its least possible rate of growth, then g asymptotically decays to 0 under the assumed asymptotic setup. This ensures that our condition (3.27) is not overly restrictive. It also enables us to understand and contrast the asymptotic behavior of the RMLE from a unified point of view.

Sparse networks
The results of all previous theorems imply that for sparse multi-layer networks, consistency can be achieved with a large number of relatively sparser graphs as long as they together satisfy the edge density requirement. In the case when M grows slower than N , in MLSBM we do not get any relaxation in the required growth condition on the total expected number of edges from all the graph layers combined, and it remains ω(M N (log N ) 3+δ ) for K = O(N 1/2 ). However in RMLSBM we only require the total expected number of edges from all layers to be ω(N (log N ) 3+δ ) for K = O(N 1/2 ) (Condition (iii) of Theorem 3). This implies that we only require the expected number of edges per layer to be ω(N (log N ) 3+δ /M ) on average. For perspective, if M grows faster than (log N ) 3+δ , then the average number of edges per layer needs to grow only at O(N ), which is the sparse bounded degree regime. This case is extremely challenging for single layer networks. In comparison, the consistency of the MLE in MLSBM requires the average expected number of edges per layer to be ω(N (log N ) 3+δ ) (Choi et al. 2012) and hence the average degree per layer must grow at least as (log N ) 3+δ . Thus consistency can be achieved with a large number of relatively sparse layers. This is particularly important as most modern applications of community detection in multi-layer graph fall under this asymptotic scenario.

A Large number of communities
Under MLSBM, consistent community detection is possible when the number of communities grows as K = O(N 1/2 ) and the total expected number of edges is ω(M N (log N ) 3+δ ) as both M and N grow. However, if we assume K = O((M N ) 1/2− ) for some > 0, then we require the total expected number of edges to be ω(M 2 N (log N ) 3+δ ) which is unrealistically dense. On the other hand, under RMLSBM consistent estimation is possible with comparable edge density even when the number of communities grows faster, either as K = O((M N ) 1/2− ) when both M and N grow but M = O(N ), or as K = O( N log M log N ) when N grows slower than M (Conditions (ii) and (iv) of Theorem 3). Hence the restricted model is advantageous for community detection in networks with a large number of communities.

Baseline procedures
We define three intuitively simple baseline procedures for community detection in multi-layer networks. The first two are based on aggregating the layers of the graph and the third one is an ensemble of results from single layer community detection through majority voting.
The first aggregate procedure, which we call "agg-mean" creates a binary network on the nodes by adding an edge between two nodes if they are connected in more than half of the layers. Hence an edge between two nodes, A agg−mean ij is a Bernoulli random variable with probability However, this method of collapsing a multi-layer graph into a single layer graph is not very useful for the sparse graph regimes we are interested in, because the probability that Hence the new graph created by this procedure will have asymptotically few edges.
A more appropriate aggregate measure is to create a network by adding edges if m A (m) ij > 0. We call this procedure "agg-sparse". Note that in this case the edge between two nodes A agg−sparse ij is a Bernoulli random variable with probability Clearly this network is also generated by a SBM with the same community assignment vector as the original multi-layer network. The probability of an edge, given the block assignments, can also be written in terms of those of the original network as Hence from known results on single layer SBM, a maximum likelihood procedure will be able to recover the node assignments consistently (Choi et al. 2012). From now on "aggregate SBM" will refer to this sparse model. We compare this baseline aggregate SBM with the multi-layer models, MLSBM and RMLSBM in terms of minimax rates Gao et al. 2015) and consistency thresholds (Mossel et al. 2014;Abbe and Sandon 2015;Hajek et al. 2014) in the next section. The third baseline procedure is performing community assignment through a scheme by which a node is assigned to a cluster if it belongs to that cluster in majority of the cluster assignments through MLEs in the individual layers. The cluster labels obtained from different single layer MLEs are aligned with each other by solving the linear sum assignment problem.

Minimax rates and sharp thresholds
In this section we derive the minimax rates of misclassification error and sharp thresholds for consistency of community detection in MLSBM and the aggregate SBM. For this analysis, we further assume that all the layers are informative of the underlying community assignments even though the quality of that information in terms of "signal to noise ratio" can vary, i.e., either all layers have more intra-community edges compared to inter-community edges or vice-versa. Formally, π for all q, l, m. To align notations and settings with Zhang and Zhou (2015), we slightly modify the growth condition on class sizes of Theorem 4 and 5 as N q ∈ [ N sK , sN K ] with s ≥ 1 and redefine the parameter space of our undirected symmetric MLSBM with no self loops as with P, z, N q , s, N, K, M as defined previously. Note that the parameters a (m) and b (m) represent the lowest intra-community probability and the highest inter-community probability for layer m respectively. As per assumption, a (m) > b (m) within a layer m, however there is no assumption among the relationships of the parameters across layers. We define I (m) as the Renyi divergence (Van Erven and Harremoës 2014) of order 1/2 between two Bernoulli distributions Bern( a (m) N ) and Bern( b (m) N ), i.e., Letz denote the true community labels of the MLSBM andẑ be an estimate of it. Then we define the mis-clustering rate ofẑ with respect toz up to permutations as where δ(·) is a permutation of the community labels and d H (·) is the Hamming distance. Then we have the following result for MLSBM (proved in the Appendix).
Theorem 6. Under the assumption that for any s ∈ [1, 5/3] and some sequence , at least a constant fraction of nodes are mis-clustered.
The above theorem implies that for MLSBM, minimax risk of error decays exponentially and if → ∞, the rate goes to 0 asymptotically, i.e., exact recovery of community labels is possible. Moreover from the proof of Theorem 6 in the Appendix, there exists a procedure which achieves this rate. On the other hand if (1), then the minimax risk of error is lower bounded by a constant (see the part on lower bound in the proof in Appendix) implying that consistent recovery is not possible in such situations.
Since the model "agg-sparse" is itself a single layer SBM and m P we have the following result using Theorem 1.1 of Zhang and Zhou (2015).
The previous two theorems state results about the fundamental properties of the two models which allow us to compare the models without going into the specifics of the method used to compute the class assignments in practice.
Since the Renyi divergence I (m) ≥ 0 for all m, we have m I (m) ≥ I (m) for all m. Hence the minimax rate for MLSBM is lower than all individual single layer SBMs. Moreover, since Renyi divergence is convex, we have 1 M m I (m) ≥ 1 M I agg asymptotically. This can be shown using Jensen's inequality with the concave functions log(x) and √ x = b (m) a (m) (see Theorem 11 of Van Erven and Harremoës (2014) for a proof), and then noting that asymptotically Zhang and Zhou 2015). Hence the minimax rate of MLSBM is at most that of the aggregate graph. Note that equality in the above inequality is achieved if and only if all the I (m) s are equal and b (m) a (m) is equal for all m. We recognize the quantities b (m) a (m) and I (m) as signal to noise ratios in the mth layer. Hence the MLSBM has lower minimax rate compared to the aggregate SBM as long as the signal quality in different layers varies. This result will be intuitively apparent if we note from the proof of the above theorems that, given the parameters are known or accurately estimated, the penalized maximum likelihood (ML) decision rule, which attains the minimax rate of error in MLSBM, weights the edges from different layers by c (m) before adding. The penalty terms also get weighted by k (m) before being added. The quantity can be thought of as a measure of the signal to noise ratio. Hence, layers with high signal to noise ratio, i.e., high quality information for the purpose of community detection, get more weight. In contrast, the penalized ML decision rule in aggregate graph SBM by construction adds layers without weighting. Hence intuitively the result on minimax rates makes sense, since if all layers contain the same amount of information, then it is immaterial if the decision rule weights the graphs by information content or not, but in all other cases giving more weight to the more informative layer pays off.
Moreover, while it is clear that MLSBM has lower minimax rate than individual layer SBMs, it is not true trivially for the aggregate graph. Since I (m) can be written in terms of signal to noise ratio as , consequently for I agg to be large, the sum of the probabilities m a (m) and m b (m) must be well separated. This is not always guaranteed as large a (m) 's and b (m) 's with relatively low difference can overshadow a large difference in smaller a (m) 's and b (m) 's while adding. We will take this point up again in the next section where we discuss sharp thresholds for consistency.
We note that the model RMLSBM is a MLSBM with a restricted parameter space Π R . Hence Theorem 6 will give the minimax rate under the restricted parameter space with the divergence in the mth layer being , where φ is the transformation of the parameters in RMLSBM as defined before. In particular, we have logit(φ (m) a ) = a + β m . The rate for the aggregate SBM under RMLSBM can similarly be obtained using Theorem . This implies that (a) if RMLSBM is the true data generating model then it has lower minimax rate compared to each of the individual layers, and (b) by the earlier discussion it also has lower minimax rate compared to the aggregate SBM constructed from a RMLSBM graph, since neither is equal for all m.

Sharp consistency thresholds
We derive sharp thresholds for strong and weak consistency for community detection (Mossel et al. 2014;Abbe and Sandon 2015) in MLSBM and the aggregate SBM under two scenarios: sparse graph with average degree per layer o(log n) and ultra-sparse graph with average degree per layer o(1).
In the first case, let a (m) = α > 0 for all m. Then Corollary 4.1 of Zhang and Zhou (2015) gives that assuming K = N o(1) , the sharp threshold for the existence of a strongly consistent estimator for the mth layer SBM Clearly, if the threshold is met in each of the layers, then it will be met in the aggregate SBM as well. However in a more realistic case where this threshold is not met in all the layers, whether the aggregate SBM will have a strongly consistent estimator or not will depend on whether the sum of probabilities meets the threshold of well separation or not, which in turn will depend on the relatively denser layers. To see this, note that this threshold can be written as For aggregate graph, the denominator of this quantity is dominated by the dense layers, and hence the difference in a and b must be large in dense layers for the aggregate to be consistent. In other words, strong signals in sparse layers will get ignored if the signal in dense layers are not strong. On the other hand, for MLSBM, strong consistency is achieved if any of N I (m) K → ∞ or their sum goes to infinity. This implies that the threshold is > 1, which is achieved if at least one of the layers achieves consistency threshold or the layers together achieve the threshold. By the argument before, this threshold consists of sum of normalized signal to noise ratios, hence all layers, dense or sparse, get equal weightage in determining the threshold. The consistency threshold for RMLSBM using Theorem 6 is Here we note that the threshold for RMLSBM is also the sum of normalized signal to noise ratios. However since the parameter space is restricted, the difference between inter and intra community parameters are uniform across layers, and variations in the above sum only come from the normalizing factor due to the layer specific sparsity parameter. Qualitatively, the minimax rate and consequently the threshold in MLSBM take into account variations in both signal quality and sparsity while adding contributions from different layers. RMLSBM tries to estimate the signal to noise ratio in each layer by two parameters, one global parameter which signifies the aggregate signal quality, and the other layer specific parameter which signifies sparsity. Hence although RMLSBM ignores the variation in signal quality, it attempts to reduce the undue influence of dense layers by taking into account the variation in sparsity. The aggregate SBM, on the other hand, does not take into account either the signal quality or the sparsity, and hence is heavily influence by dense layers irrespective of signal quality. Hence both RMLSBM and aggregate SBM would perform well if all the layers have similar signal strength and similar density. If the layers do not have similar density but the signal strength across layers can somewhat be well approximated by an average signal strength, RMLSBM will still be able to detect it through the noise and perform well. Clearly, RMLSBM and aggregate graph will not perform well if both signal strength and sparsity of layers vary widely, and we need to resort to MLSBM in such cases.
In the bounded degree case, while consistent recovery is not possible in each of the layers since the graph is not fully connected (only detection is possible), a consistent recovery is still possible in the multi-layer models. The condition for consistent recovery in MLSBM with a (m) = o(1) and b Note that the condition for detection or weak recovery defined as finding a partition correlated with the true community structure for two communities is a−b √ a+b > 2 ( Mossel et al. 2012Mossel et al. , 2013.
6 Estimation using mixture model approach Simultaneous maximum likelihood estimation of parameters and class assignments in the stochastic blockmodel is a difficult problem (Nowicki and Snijders 2001;Choi et al. 2012;Rohe et al. 2012). The same difficulties remain in the MLSBM and its restricted version. Consequently, to obtain an estimation algorithm here, we view the MLSBM as a mixture model with discrete latent variables Z. In this case, Z i is a missing random variable that follows a multinomial distribution with K parameters: Z i ∼ M ult(1, α = (α 1 , α 2 , . . . , α K )). We follow the framework laid out by Daudin et al. (2008) to simultaneously estimate the conditional blockmodel parameters and the class assignments with variational EM technique.
The derivations for MLSBM are straightforward extensions of the corresponding formula in Daudin et al. (2008) and are omitted in this paper while the update rules for RMLSBM have been derived in the Appendix The update steps for MLSBM and RMLSBM are also provided in the Appendix under Algorithm 1 and Algorithm 2 respectively.

Simulation results
In this section we numerically test the asymptotic results and compare the performance of the methods through a simulation study. We generate data from the more general model, MLSBM. We then compare the relative performance of the two multi-layer methods (MLE and RMLE) between themselves as well as with single layer methods and baseline methods such as majority voting and MLE in aggregate SBM. The comparison is done under various settings on the number of nodes N , the number of communities K, the number of types of relations M , and the expected total number of edges L.
Since the true class labels of the nodes are known in simulated data, we compare the class assignments from different methods with the true labels. We use correct clustering rate (CCR) and normalized mutual information (NMI) as measures of similarity between partitions. The CCR counts the fraction of nodes whose cluster assignment matches the true class label (as determined by the true class label of the majority of nodes in that cluster). The higher the CCR, the better the performance of the clustering method. The NMI is an information theoretic measure of the mutual dependence or similarity of two random variables. The NMI takes values in the range of 0 to 1, with 0 indicating random cluster assignment with respect to the true class labels, and 1 indicating perfect match between the true and assigned clusters. If NMI is 0, it means even though the cluster assignment was not completely random and done according to some algorithm, the solution presents no information regarding the true class labels. Since the results in terms of CCR are very similar to that of NMI, we omit those results here to save space.
In all the simulation studies we repeat the experiments 50 times and take the average of our measures across them. We first generate the node labels independently from a multinomial distribution with probabilities P (Z i = k) = α k . Then we generate the data using the node labels and M different connectivity matrices, all of which give larger probability to connections within groups in comparison to the connections between groups. However, we vary the "signal to noise ratio" (SNR) from layer to layer by varying the ratio of the diagonal and off diagonal elements of the parameter matrix.
We consider two scenarios: (i) all layers are sparse and have strong SNR, (ii) the layers are mixed in terms of sparsity and signal strength in the following way: two layers are sparse and have strong signal, two layers are dense and have weak signal, and one layer is dense with strong signal. While the first scenario is a rather idealistic scenario where all layers are "similar" in the sense that they are sparse and strongly informative about the underlying community structure, the second scenario (also considered in Papalexakis et al. (2013)) is more realistic in applications. For the first scenario, the SNR is kept at 3-4 and sparsity is varied slightly from layer to layer in such a way that variational EM algorithm for community detection on each of the layer individually gives very similar performance. The connectivity matrix parameters are then sampled from a uniform distribution within a small range so as to maintain SNR requirement while having different values for each of the entries of the matrix. For the second scenario, the informative strong signal layers have a SNR of 3 while the non-informative weak signal layers have a SNR only marginally greater than 1. We again sample the actual values of the parameters from a uniform distribution within a small range.
The initial guess for the variational algorithm in both MLE and RMLE is obtained by a two step procedure. On a randomly selected layer we first run spectral clustering to generate an initial guess and then we use this to run a variational EM algorithm on that layer. We use the class assignment and fitted SBM parameters from that layer as our initial guess for the MLSBM parameters. In our simulation results described below, the final solution of class assignments for both the MLE and the RMLE mostly turns out to be an improved estimate of the true class assignments irrespective of which layer we choose to initialize the method.

Fixed K and M while N increases
In this simulation, we take M = 5 types of edges or network layers, each with a separate connectivity matrix inducing a different network according to the schemes described above. We keep the number of communities K fixed at 10 and vary the number of nodes N from 100 to 600. The aim of this study is to compare the two multi-layer methods with the single layer methods and baseline methods in terms of the number of nodes required to achieve a consistent estimation of community assignment with moderately low number of communities. Figures 2(a) and (b) display the results from this study for the two scenarios respectively. Clearly the MLE in MLSBM and RMLSBM reach NMI of close to 1 faster than the single layer ones as well as majority voting as the number of nodes increases. The algorithm in aggregate layer performs similarly to that in MLSBM and RMLSBM for the first (all strong signal) scenario (Figure 2(a)), however it performs poorly for the second (mixed signals) scenario (Figure 2(b)). This shows that aggregating edges across layers works fine if the information quality is similar across layers, but it is not robust if the information content changes across layers. The accuracy of majority voting behaves similarly to the single layer ones. Moreover, for a small number of nodes, the MLE in RMLSBM performs better than all the other methods considered in both scenarios.

Fixed N and M while K increases
In this simulation, we test the performance of the multi-layer methods against the single layer and baseline methods with increasing number of communities. We fix the number of nodes N and the number of layers M at 400 and 5 respectively, while we let K increase from 6 to 22 in steps of 4. The results from this simulation study are displayed in Figures 2(c) and (d). Whereas the accuracy of community detection in all the single layer methods and the majority voting decreases rapidly with increasing number of communities, the multi-layer methods explored here, especially the RMLSBM, perform well even with a large number of communities. Between RMLSBM and MLSBM, RMLSBM clearly outperforms MLSBM as the number of communities grows. This simulation also serves as a test of robustness of RMLSBM for small number of communities. We notice that in both scenarios, RMLSBM behaves similarly to MLSBM and does not break down for small number of communities. In the all-strong scenario, the MLE in aggregate SBM outperforms both MLSBM and RMLSBM for small communities, but similar to MLSBM, its accuracy also quickly drops as K increases (Figure 2(c)). In the mixed signal scenario, the MLE in aggregate SBM performs much worse compared not only to MLSBM and RMLSBM, but also to majority voting and the best performing MLE among the individual layers. To put things into perspective, for the all-strong scenario, while the NMI for MLSBM, aggregate SBM, majority voting and the single layer SBMs reduce below 0.5, it settles to a value close to 0.8 for RMLSBM as the number of communities increases to 20.

Fixed N and K while M increases
In this simulation, we keep the number of nodes N and the number of communities K fixed at 300 and 15 respectively, while we increase the number of layers M gradually from 3 to 12. For this simulation, each layer of the multi-layer network was generated from a K-class SBM with a simple connectivity matrix given by P K×K = λI K + 1 K×K − I K . In the first scenario, the parameters are = 0.10 + U (−0.02, 0.02) and λ = 3 , while in the second scenario, the parameters are = 0.09 + U (−0.03, 0.03) and λ = U (1.5, 3) . Here U (a, b) is a random number generated from the uniform distribution between a and b. Note that in the first scenario, all layers are sparse and have strong signals, while in the second scenario, we let both sparsity and signal strength vary across the layers. This second scenario would be a good test of the robustness of different multi-layer methods.
We compare the performance of MLE in MLSBM and RMLSBM with majority voting and aggregate SBM in terms of the accuracy of community detection in Figures 2(e) and (f). The curves for majority votes in both figures remain almost flat with increasing number of layers, indicating that the accuracy of community detection does not improve with more layers. The MLE of aggregate SBM performs well initially, but its accuracy quickly falls with increasing number of layers as the model assumption that m A (m) ij > 1 with vanishing probability breaks down. For MLSBM, the accuracy increases initially, however the improvement quickly slows down and both the curves in Figures 2(e) and (f) flatten with increasing layers. This is because the number of parameters to be estimated also keeps on increasing fast with increasing number of layers, which contributes to less efficiency. For RMLE, the accuracy of community detection generally increases with increasing number of layers and is almost always higher than all other methods.
The three studies clearly point out the advantages of the multi-layer methods over the single layer ones and the baseline ones, as well as the relative advantage of RMLSBM over MLSBM within the scope of the simulations.

Twitter UK politics dataset
In this section we test our methods on a real dataset on interactions between British Members of Parliament (MPs) in the social networking site Twitter curated by Greene and Cunningham (2013). Although the original dataset consists of 419 nodes, we only considered the largest subset that is connected across all layers for our analysis. Hence our multi-layer network consists of 381 nodes. The different layers of network we have correspond to three direct relations: "mentions", "follows" and "retweets", and three derived relations: "mentioned by the same person (co-mentions)", "followed by the same person (co-follows)", and "retweeted by the same person (co-retweets)". All relations are assumed to be binary by assigning one if the relation is true for at least one case (e.g., if at least one person follows both MP i and MP j, then the relation "co-follows" between the two MPs is true). All the relations individually can be represented as graphs. For the graphs with direct relations, "mentions", "follows" and "retweets", a directed edge from node i to node j implies that MP i mentioned, followed or retweeted respectively MP j at least once in his/her tweets. We converted all directed edges into undirected edges for this analysis. Average degrees of nodes in different network layers are presented in Table 1. Note that among the direct layers, "follows" is relatively dense compared to "mentions" and "retweets", while the derived networks are overall much denser compared to the direct ones. The goal here is to cluster the MPs into communities based on the information about their twitter activities. The ground truth communities are known to be consisting of five communities corresponding to the political affiliations of the MPs: 152 Conservative, 178 Labour, 39 Liberal Democrat, 5 SNP and 7 Other MPs. The clustering quality is assessed through NMI and CCR as before.
Part (a) of Table 2 reports the performance of the algorithm for the six individual layers considered. Note that the performance of the derived networks is worse compared to the (b) Combined network layers direct ones despite being denser. Clearly the signal in favor of the ground truth is stronger in the "direct networks" compared to the "derived networks". The performance of majority vote, MLEs in aggregate SBM, MLSBM and RMLSBM on multi-layer networks constructed from the three direct layers and all layers together are given in part (b) of Table 2. In both cases the multi-layer methods outperform the baseline methods, and between the two multilayer methods, RMLE outperforms MLE. From the results for direct networks, we note that the performance of multi-layer methods is not affected by inclusion of relatively sparse layers ("mentions", "retweets") and multi-layer methods perform better than the densest layer ("follows"), as long as all the signal strength is high. However the performance deteriorates as the signal quality becomes bad with the inclusion of poor performing derived networks. RMLSBM is more robust towards such layers with poor signal compared to MLSBM. The MLE in aggregate SBM performs poorly in the full network due to the number of layers in that network being too large.

Discussions
In this paper we extended the stochastic block model to the multi-layer settings with two related models, MLSBM and its restricted version RMLSBM. The community assignments through maximum likelihood estimation in both models are consistent under data generated from the more general model MLSBM with suitable conditions on the growth rate of the number of communities, the number of types of layers, and the total number of edges of the multi-layer graph. We also derived minimax rates of error and sharp thresholds for consistency of community detection in MLSBM, RMLSBM and a baseline model, the SBM obtained by aggregating the layers. We compared the proposed methods with the MLEs in single layer networks as well as two baseline methods, MLE in the aggregate SBM and majority voting, through results on asymptotic consistency and simulation. We demonstrate advantages of the MLE in RMLSBM over the MLEs from single-layer SBMs as well as the majority voting and the MLE in MLSBM, both in the asymptotic consistency analysis and the simulation studies, when either the number of communities is large or the graph layers are relatively sparse. This includes the case when the individual layers have bounded average degree, which is an extremely challenging case for single layer networks. We would like to emphasize that handling the bounded degree case would not be possible with the usual MLSBM extension. Both the baseline methods suffer from deficiencies that limit their abilities to detect communities in multi-layer networks effectively. While the aggregation of graphs performs poorly if the community structure information contained in different layers are heterogeneous, the majority voting fails to infer community structure correctly from a large number of layers with week signals. The observations of this paper are in line with previous work in regression settings where a parsimonious model with similar accuracy is preferred over a model with a large number of parameters. The RMLSBM approximates the MLSBM quite well with fewer parameters for most multi-layer networks which are sparse or have a large number of communities. Hence in such cases the RMLSBM outperforms the MLSBM.

Derivation of variational inference for RMLSBM
We derive the update rules for RMLSBM. Note that for the restricted model, the complete data log likelihood is given by The likelihood of the observed data can be obtained by summing the complete data likelihood over all possible values of the unobserved missing class assignment labels Z. However, note that the number of all possible assignments grows exponentially as K N , and the sum quickly becomes computationally intractable even for moderate N . Hence instead we use the EM algorithm for mixture models, where the unobserved class assignments are treated as missing values. However one needs to compute the conditional distribution of the missing values (class assignments here) given the observed data, i.e., P (Z|A). Unfortunately, as argued by Daudin et al. (2008), P (Z|A) is itself intractable, since the probability of the latent class assignments of a node depends not only on the observed edges connected to that node, but also on the connectivity pattern of the whole network. The variational approximation concentrates the search for optimal class assignments to a smaller set by assuming that the class assignments follow a multinomial distribution with parameters known as variational parameters. It aims at maximizing an expression containing the log likelihood and the negative of the Kullback-Liebler (KL) divergence between the true probability distribution of P (Z|A) and its variational approximation R A (·). If the approximation to the distribution coincides with the distribution, then the KL divergence is zero and the variational approximation is the same as the regular EM. So the new objective function to be optimized as a lower bound of l(A) is Here we constraint R A to have the following form of the product of multinomial densities The variational distribution R A (Z) has the interpretation of being an approximation of P (Z|A).
Algorithm 1: Variational EM algorithm for MLSBM while either convergence criterion on parameters not met or t < t max do // E-step: Compute variational estimates τ = {τ iq } while either convergence criteria on τ are not met or s < s max do In the E step of the following variational EM algorithm, we compute the variational approximation estimates of the probabilities of class assignments for each node. Given the model parameters α, π, β, the variational parameters τ can be computed by minimizing the function ij (π ql +β m ) (9.1) m ))}] s = s + 1 end end end // Normalize the variational estimates so that they sum to 1 for each î end // Use BFGS optimization method to find the parameters (π (t+1) ,β (t+1) ) = arg max π,β J(π, β) t = t + 1 end with the constraint that q τ iq = 1 for all i. The solution for the (t + 1)th EM step can be readily obtained aŝ In the M step we estimate the parameters of the model by maximizing the approximate likelihood. Since we do not have a closed form solution for the parameters π and β, we use a gradient descent algorithm (BFGS optimization algorithm) to simultaneously optimize the objective function with respect to all the parameters. The gradients of the objective function with respect to π and β are ∂ ∂β (t) m . (9. 3) The two algorithms corresponding to the two models are described in Algorithm 1 and Algorithm 2 respectively.

Proofs of consistency results
Proof of Equation (3.16)

Proofs of main results
Before we describe the proves of Theorems 1 and 2, we need the following lemma. , and thatπ R (z) can take as whereΠ (z) andΠ R (z) denote the range ofπ (z) andπ R (z) respectively for a fixed z. Proof. We first determine the size of the set of all possible values that the MLE of the parameter array π can take in the MLSBM. Notice that from Equation (3.5) the estimatê π (m) of the parameter matrix for any layer m can take any of the q≤l (n ql +1) values, since its K(K + 1)/2 upper diagonal components (π (m) ql , q ≤ l, q, l ∈ {1, . . . , K}) can take any of the n ql + 1 values in the set {0, 1/n ql , . . . , 1} independently. Hence, |Π| = m q≤l (n ql + 1). However this is subject to the constraint that q≤l n ql = N 2 . This implies that |Π| is a product of K + 1 2 positive terms whose sum is fixed. So |Π| is maximized when the terms are all equal, i.e., n ql = N 2 K + 1 2 uniformly across all m. Hence we have the following inequality |Π| ≤ N 2 . Now we turn our attention to the set of values the MLE of the parameter array in RMLSBM can take. Note that Equations (3.13) and (3.14) together represent K(K + 1)/2+ M equations involving partial sums of the MLEs of the K(K + 1)/2 + M elements in the parameter array π R (although the equations are written in terms of the transformation φ for convenience, they actually represent the same equations as Equations (3.10) and (3.11). The right hand side of the equations together are the sufficient statistics under the RMLSBM. Note that due to the identifiablility constraint, we have only K(K + 1)/2 + M − 1 free parameters. On the other hand, one of the equations in the set of equations is also redundant, since adding together the first M equations represented by Equation (3.13) and adding the remaining K(K +1)/2 equations represented by Equation (3.14) yield the same equation and hence there is one linear dependence. This set of equations determines the MLE of π R . Hence the size of the set of all distinct solutionsπ R is at most the number of possible sets of system of equations. To determine the later, we notice that the right hand side of each of the first set of M equations can take N (N + 1)/2 + 1 values from the set {0, 2/[N (N + 1)], . . . , 1}, while the right hand side of each of the next set of K(K + 1)/2 equations can take M n ql + 1 values from the set {0, 1/(M n ql ), . . . , 1}. So the size of the set of possible values the estimated parameter arrayπ R can take is The first term is maximized as before when all the n ql 's are equal, i.e., n ql = N 2 K + 1 2 .
The second term is a fixed quantity. So we have Lastly notice that the transformation defined by Equation (3.1) is an onto function but not necessarily one-to-one, so one or more parameter arrays π R map to one φ. Hence for every estimateφ there exists a corresponding estimate arrayπ R . Therefore we have For brevity of notation henceforth we remove the subscript (z) from π (z) , π R (z) and φ (z) , denoting the set of parameters of MLSBM, RMLSBM and the transformation of the set of parameters of RMLSBM respectively for a fixed z. We also remove the subscript (z) from Π (z) andΠ R (z) .

Proof of Theorem 1
The proof for the unrestricted case follows the structure of the proof of Theorem 1 in Choi et al. (2012). Following the arguments in the aforementioned paper, we first notice that for a fixed z, each estimateπ RecallΠ denotes the set of values the estimate arrayπ can take for a fixed class assignment z. In Lemma 1, we have bounded the size of this set as |Π| ≤ N K + 1 M K(K+1) . Now we consider the event that q≤l n ql m D(π ql ) is at least as large as some > 0, and derive an upper bound for its probability of occurrence: Hence for all > 0, we have over all K N possible class assignments z, The proof for the restricted case, although follows the same structure as before, is more involved as we need to deal with estimating equations instead of closed form solutions. Note that for a fixed z, the left hand side of each of the M estimating equations in (3.13) is ql , which is a sum of N (N + 1)/2 independent Bernoulli random variables with mean for q ≤ l, q, l ∈ {1, . . . , K}. Now since these K(K + 1)/2 + M estimating equations together determine the MLEπ R of RMLSBM, the probability of any realization ofπ R is bounded by the joint probability of the occurrence of the estimating equations. Note that although the equations within the two sets (3.13) and (3.14) are independent of each other, the two sets of equations are not independent of each other. Hence because of the inequalities that P (A ∩ B) ≤ P (A) and P (A ∩ B) ≤ P (B), we have and For brevity, we call the right hand sides of Equations (9.5) and (9.6) as exp(−E 1 ) and exp(−E 2 ) respectively. From Lemma 1, we have the size of set of all possible valuesπ R can take Now we consider the event that E i is at least as large as some > 0 for i = 1, 2 respectively.
Hence for all > 0, we have over all K N possible class assignments z,

Proof of Theorem 2
First we note that X, as defined in Equation (3.9), is a sum of bounded independent random variables, because each element X (m) ij in the sum is bounded by C = 2 log( √ M N ) in absolute value. So we can use a Bernstein type inequality for sums of bounded independent random variables (Chung and Lu 2006) to obtain . Combining this inequality with the result in Theorem 1, we have over all possible K N class assignments z, which goes to zero asymptotically as N grows under the growth conditions mentioned on K and L. So we have max z |l(A; z) −l P (z)| = o P (L).

Proof of Theorem 3
The proof for the RMLSBM will be a slight modification of the earlier proof for MLSBM. As before we need to bound the two terms in the decomposition of the difference between maximized likelihood and its expected value defined in Equation (3.16). For that we write the first part in the right hand side of (3.16), which we call E 3 here for brevity, in terms of the quantities we have already bounded in Theorem 1. We begin by noticing that, since the Kullback-Liebler divergence D(a||b) is convex, we can use a reverse of Jensen's inequality (Simic 2009;Budimir et al. 2001)  To derive the inequality, we used − log(φ ql ) as our convex function ofφ ql on the interval [1/(M N 2 ), 1 − 1/(M N 2 )] to obtain a reverse of the "log-sum inequality". Summing the two inequalities over m and q, l respectively, we have and Hence E 3 is bounded by the minimum of the above two upper bounds. Since the first part in the right hand side of the above two inequalities is bounded by the same quantity, we will take the inequality for which the second part is smaller. Under the conditions on the growth of L in the theorem, the minimum of the two second parts is o(L). Consequently, so under the growth conditions mentioned under different asymptotic settings, max z |l R (A; z) −l R P (z)| = o P (L).
Note that the termsl P (z) −l P (ẑ) and l(A,ẑ) − l(A,z) are positive quantities as mentioned earlier.
The rest of the proof requires the concepts of partition and refinement as laid out in Choi et al. (2012). We briefly review the concepts here and apply them to MLSBM and its regularized version RMLSBM. Let [N ] denote the set of integers {1, 2, . . . , N }. Any multi-layer blockmodel induces a partition of the M upper triangular probability matrices. Formally we define a partition of {P (m) ij } i<j into U subsets {S 1 , . . . , S U } by the following mapping Θ : Note that the partitions induced on all M probability matrices are the same, since the partition is a function only of the indices and not of the type of edges. There exists a bijection between the set [U ] and the upper triangular part of the parameter matrices of MLSBM, so we can write π Θ(i,j) = π z i z j . In MLSBM, for a general partition, we define S u = {(i, j) : Θ(i, j) = u, i < j} and ij , so that we can define the log likelihood under this partition as It is easy to see thatl * P (Θ z ) =l P (z), where Θ z is the partition corresponding to block model assignment z. A refinement Θ of partition Θ further subdivides the partitions in Θ into subgroups or sub-partitions so that Θ (i 1 , j 1 ) i 1 <j 1 = Θ (i 2 , j 2 ) i 2 <j 2 ⇒ Θ(i 1 , j 1 ) i 1 <j 1 = Θ(i 2 , j 2 ) i 2 <j 2 . From Lemma A2 of Choi et al. (2012), it can be easily obtained One such refinement is constructed in the following way (Choi et al. 2012). We consider a K class MLSBM with membership vectorz and let Θ z denote a partition of {P (m) ij } i<j for any z. Now, for a given membership class under z, partition the corresponding set of nodes into subclasses according to the true class assignmentz of each node. Then remove one node from each of the two largest subclasses so obtained, and group them together as a pair; continue this pairing process until no more than one nonempty subclass remains. If pair (i, j) is chosen from the above procedure, then z i = z j andz i =z j . Define C 1 as the number of (i, j) pairs selected by the above method. Since at least one of i or j is misclustered, we have N e (z)/2 ≤ C 1 ≤ N e (z).
Next, for each C 1 pairs find all other distinct indices k for which condition (3.26) of the theorem is satisfied. Let C 2 denote the total number of distinct triples that can be formed in this manner. For each of the C 2 such triples (i, j, k), we remove P ik and P jk from their previous subset assignment under Θ z and place them in a new distinct two element subset. This partition so created is a refinement of the original partition Θ z , and we call this refined partition Θ z . The condition (3.26) of the theorem implies that for each pair of classes (q, l), there exists at least one class c that satisfies, Consequently for any of the C 1 pairs of nodes under the true partition, we obtain triples at least as large as the cardinality of the smallest class. Hence C 2 is at least as large as C 1 s, where s the size of the smallest class. Now as per assumption, s = Ω(N/K). Hence we can bound the difference in the likelihood: N Ω(L).
Since the above procedure is valid for any class assignment vector z, we can apply it for the maximum likelihood estimateẑ as well. Note thatẑ induces partition Θẑ of the probability matrices {P (m) ij } i<j, m={1,...,M } and its refinement Θ ẑ increases the likelihood, i.e.,l * P (Θẑ) ≤l * P (Θ ẑ ). Also we havel * P (Θẑ) =l P (ẑ). Consequently we have, Combining this with the result from Equation (3.25), we have N e (ẑ) = o P (N ).

Proof of Theorem 5
Before we proceed with the proof we need two lemmas. The first lemma bounds the difference between the maximized expected likelihoods from the unrestricted and the restricted models under the true partition. The second lemma uses this result along with the result of Theorem 3 to bound the difference between the maximized expected likelihood for the restricted model under the RMLE and the maximized expected likelihood for the unrestricted model under the true partition.
Lemma 2. Under the true partitionz, if any of the five sets of conditions in Theorem 3 on the growth of multi-layer blockmodel parameters holds, thenl P (z) −l R P (z) = o P (L), where L is the expected number of edges in the multi-layer graph under the corresponding set of conditions. Proof. For large N , subtracting Equation (3.24) from Equation (3.23) we havē where C 1 is a constant and R = log The inequality in step 2 comes from the upper bound on D(p||q) which can be derived as follows. Without loss of generality, we can assume that p > q and D(p||q) ≤ p log p q ≤ p max log pmax q min . Next we replace p max and q min by the assumption on the lower and upper bounds of the restricted block model probabilities given in Equation ( Lemma 3. Under the true partitionz and the RMLE of the partitionẑ R (i.e., the MLE in the restricted model RMLSBM), we havel P (z) −l R P (ẑ R ) = o P (L) whenever the conclusion of Theorem 3 holds.
Proof. Note thatl P (ẑ R ) ≥l R P (ẑ R ) since the maximum of the unrestricted likelihoodl P (z) is uniformly larger than or equal to the maximum of the restricted likelihoodl R P (z) for all z. Moreover,z maximizesl P (·) and hencel P (z) −l R P (ẑ R ) ≥ 0. Notice that l R (A,ẑ R ) − l R (A,z) is positive since the observed restricted likelihood is maximized atẑ R . So we havē by Lemma 2 and Theorem 3. Now we are ready to show that the class membership assignment vector estimated through the maximum likelihood estimation in the restricted model RMLSBM is consistent under data generated from the MLSBM. We define regularized partition Θ R of the matrices of probabilities between nodes P (m) ij , computed according to the restricted model RMLSBM and its refinement Θ R in exactly the same way. We further define the corresponding restricted log likelihood associated with this partition Θ R asl * R P (Θ R ). For convenience we again resort to the transformation defined by Equation (3.1) For any membership assignment z R from the RMLSBM, letl * R P (Θ R z R ) be the corresponding partition of P (m) ij . It follows from this definition thatl * R P (Θ R z R ) =l R P (z R ). Hence we havē Now we specialize toẑ R . Since Θ R is a refinement of Θ R , it increases the restricted likelihood, i.e.,l * R P (Θ R z R ) ≥l * R P (Θ R z R ). Using this and the fact thatl * R The left hand side is o(L) by Lemma 3, and hence,

Proof of Theorem 6
For brevity we mention here only the results and proofs that differ from the proof contained in Zhang and Zhou (2015) and refer the reader to the aforementioned paper for a complete description of the techniques involved. We define the homogeneous/symmetric multi layer stochastic blockmodel as the MLSBM with the parameter space Θ M L 1 that has all intra-block connection probabilities equal to each other as well as all inter-block connection probabilities equal to each other for each layer. As before, we assume no relation among the connection probabilities of one layer with that of another layer. The parameter space can be written as (9.9) Note that this model space is homogeneous and uniquely determined by z, i.e., given the community assignments z, the block model parameters are uniquely determined. This model space is also closed under permutations, in the sense that the model obtained through permuting the class labels also belong to Θ M L 1 . We further define a submodel of this where the block sizes are all (almost) same as is the least favorable case for community detection in terms of the size of communities (See Section 5.1 of Zhang and Zhou (2015)). The parameter space can be written as The submodel spaces Θ M L 0 and Θ M L L are also homogeneous and closed under permutation. Letẑ be the class assignment obtained from some procedure under consideration. We break the proof up into two parts, the first one proves a lower bound for the minimax risk and the second one shows that there exists an algorithm which attains the lower bound.

Lower bound
It was argued in Section 5.1 of Zhang and Zhou (2015) that Θ M L 1 is the least favorable subspace of Θ M L using the property of being closed under permutation. Hence, a lower bound on the minimax rates established on Θ M L 1 will also be a good lower bound for the larger parameter space Θ M L . Since the supremum over a larger space is always greater than the supremum over any of its subspaces, the lower bound on Θ M L 1 is a lower bound for the larger space trivially, but being a least favorable subspace makes it match the rate. Throughout this section (proof of lower bound) we assume K ≥ 3. The proof for the case K = 2 follows from Zhang and Zhou (2015) with the same modifications described below for the K ≥ 3 case.
We start with a couple of lemmas. The next lemma due to Zhang and Zhou (2015) shows that for any homogeneous parameter space which is closed under permutation (e.g., Θ M L 1 and all its submodels defined above), the minimum global Bayesian risk ofẑ under the uniform prior is the same as the minimum of the local Bayesian risk for the first node. The local Bayesian risk for one node needs to be computed under an appropriate local loss function. Zhang and Zhou (2015) defined such a local loss function as the average over all possible permutations ofẑ that minimizes the distance from the true class assignment. Let S z (ẑ) = {ẑ = δ(ẑ) : d H (z,ẑ ) = inf δ d H (z, δ(ẑ))}. Then the local loss function is defined as (9.12) Lemma 4. (Lemma 2.1 of Zhang and Zhou (2015)) Let Λ be any homogeneous parameter space which is closed under permutation and τ be a uniform prior over the elements of Λ.
Defining the global Bayesian risk as B τ (ẑ) = 1 |Λ| z∈Λ E[r(z,ẑ)] and local Bayesian risk for the first node (under the local loss function) as B τ (ẑ 1 ) = 1 |Λ| z∈Λ E[r(z 1 ,ẑ 1 )], we have Now we have the following lemma on the Bayesian local risk for the first node in the parameter space Θ M L L under an uniform prior.
Lemma 5. Letẑ be an estimated class assignment from some procedure in the block model defined by (9.11). Let τ be a uniform prior over all elements in Θ M L L . For the first node, the local Bayesian risk, B τ (ẑ 1 ) = 1 , and X Proof. We follow the proof of Lemma 5.1 in Section 6.2 of Zhang and Zhou (2015). Define Θ M L L 1 as a subset of the parameter space of Θ M L L such that the class to which the first node belongs to is always of size N K + 1, i.e., Θ M L L 1 = {(z, P ij ) ∈ Θ M L L : N z 1 = N K + 1}. Letting x 2 = ( N K + 1)S 2 it was shown in Section 6.2 of Zhang and Zhou (2015) that the ratio of the cardinality of the set Θ M L L 1 to that of Θ M L L is a constant, i.e., |Θ M L L 1 |/|Θ M L L | = x 2 /N ≥ for some > 0. Consequently, For each z ∈ Θ M L L 1 , we define k (z ) = z 1 as the class to which the first node belongs to. Let k(z ) be the set of indices of the communities of size N K . Since the first community is of size N K + 1, k (z ) does not belong to k(z ). Now we define a new assignment z(z ) based on z as follows and z(z ) i = z i for all i ≥ 2. Clearly z(z ) ∈ Θ M L L 1 , differs from z only in the first node and by definition has a distance 1 from it. Moreover for any two distinct class assignments z , z ∈ Θ M L L 1 , z = z , the new assignments based on them z(z ) and z(z ) are also different . This implies that Θ M L L 1 = {z(z ) : z ∈ Θ M L L 1 }. Consequently, Next we will derive a lower bound for the Bayes risk, infẑ B τ (ẑ 1 ). Conditional on z or z(z ), the distribution of A in MLSBM involves a collection of M adjacency matrices. We define two sets J 0 and J 1 as follows Hence, f (A C ), (9.15) and (9.16) where the function f (A C ) is a function involving connections from node 1 to nodes not in J 0 ∪ J 1 and all connections not involving node 1. Letẑ B attains the infimum of the local Bayes risk. Since d H (z , z(z )) = 1, the loss with respect to the local loss function defined in Equation (9.12) is r(z 1 ,ẑ B 1 ) = d H (z 1 ,ẑ B 1 ) which is a 0-1 loss. Thenẑ B 1 is the Bayes estimator with respect to the local 0-1 loss function and consequentlyẑ B 1 would be the mode of the posterior distribution, i.e., 1i . (9.17) Hence we have, To derive the probability in the above lower bound, let ). Hence the moment generating function (MGF) of Z i is, The MGF, M Z i (t) is minimized at t * = 1 2 and the minimum value is for N = N K , we obtain for any δ > 0, .
We note that q m (w) is a probability mass function for all m ∈ {1, . .
. . , N }, be i.i.d random variables with probability mass function q m (w). Then we have, ) can take 3 values, ±c (m) and 0. The first two values correspond to the cases when X (1/2). The second one follows similarly. Hence we have, . . Clearly Since the ratio of δ to the square root of variance goes to infinity as N goes to infinity by the central limit theorem we have, Consequently from Equation (9.20), The last inequality is obtained by replacing N by N K . If however, N m I (m) /K = O(1), we can choose a δ so that N δ/K is also a constant. Then considering the cases a Now we need to obtain the minimax lower bound for the larger parameter space Θ M L in the next lemma which concludes the proof for lower bound.

Lemma 7. (Lower bound) Under the assumption that
for some sequence N = o(1) and some s > 0. Moreover, if Proof. By the argument of Zhang and Zhou (2015), for K = 2, Θ M L 0 is the least favorable case for Θ M L . Hence we can keep the same lower bound for Θ M L (obviously the lower bound holds since Θ M L 0 is a subspace of Θ M L ). However for K ≥ 3, this is not the case and we can improve the lower bound. The least favorable case consists of the case where at least a constant proportion of communities are of the size N sK . Define Θ M L L to contain all z ∈ Θ M L such that a constant proportion of communities have size N K , and another constant proportion of communities have size N K and all other communities are much larger in size. Then using identical arguments as Lemmas 4 and 5 we have, Combining these two cases we have the result for the entire parameter space Θ M L .

Upper bound
To prove the upper bound, we develop a penalized likelihood type algorithm similar to Zhang and Zhou (2015) and show that its risk is upper bounded by the lower bound obtained in the previous step. We note that in the homogeneous MLSBM case (Θ M L 0 and Θ M L 1 ), i.e., when all the intra-community connection probabilities are a (m) /N and all the inter-community connection probabilities are b (m) /N for layer m, the log likelihood function is The maximum likelihood estimatorẑ M LE is given by, where T (z) is given by (9.24) with c (m) > 0 is defined in Lemma 5 and k (m) = log 1−b (m) /N 1−a (m) /N . However in general the parameter space will not be homogeneous. Under the more general parameter space Θ M L , we still define an identical form of the penalized likelihood estimator asẑ M LE . Letz be the true class assignment andẑ ∈ Θ M L 0 be an arbitrary class assignment satisfying r(z,ẑ) = R/N , where 0 < R < N is a positive integer. Then note that where α(ẑ,z) = {(i, j) : i < j,z i =z j ,ẑ i =ẑ j } and γ(ẑ,z) = {(i, j) : i < j,z i =z j ,ẑ i =ẑ j }. Henceforth we will use shorthands α and γ respectively to denote the sets. Let P R = P (ẑ ∈ Θ M L 0 : r(z,ẑ) = R/N, T (ẑ) ≥ T (z)). We want to bound P m which is the probability that an arbitrary class assignmentẑ which does not agree with the truthz in exactly R places (after permutations) can maximize T (z), i.e., P (T (ẑ) ≥ T (z)). We start with the following lemma.
Lemma 8. Letẑ be an arbitrary class assignment satisfying r(z,ẑ) = R/N , where 0 < R < N is a positive integer. Then there exists a sequence → 0, independent ofẑ, such that A lower bound on the size of the sets α and γ was given in Lemma 5.3 of Zhang and Zhou (2015). We use the results directly here : for an arbitrary assignmentẑ ∈ Θ M L 0 satisfying r(z,ẑ) = R/N , where 0 < R < N is a positive integer, we have min(|α(ẑ,z)|, |γ(ẑ,z)|) ≥ (9.28) Using this lower bound for both |α| and |γ| immediately yields the result.
Let Γ(z) denotes an equivalent class for z consisting of all permutations of z. In order to use an union bound for P R , we need to count the cardinality of the set of Γs which have distance R fromz. Next we use Proposition 5.2 in Zhang and Zhou (2015) which states that |{Γ : ∃ẑ ∈ Γ s.t r(z,ẑ) = R/N }| ≤ min{( eN K R ) R , K N }, to conclude through a union bound that, Proof. The proof technique is similar to Zhang and Zhou (2015); we only modify the proof in places to suit our objective while keeping the approach the same. We first prove the result for the subspace Θ M L 0 and then extend it for Θ M L . We first consider the case K → ∞, break the assumption ≤ BN − (R−1)/6 . The penultimate step follows by replacing 1 − 2η by 8/9 and the last step follows since /4 − 3 2 /4 ≥ /6 for large N and small η and respectively. Hence ). The proof for finite K is similar and hence omitted. Now we prove the upper bound result for the entire parameter space Θ M L . The proof for the case K ≥ 3 is similar to the proof for Θ M L 0 with the result in (9.28) being replaced by Lemma A.1. of Zhang and Zhou (2015). However, for K = 2, we proceed as in Section A.2. of Zhang and Zhou (2015) and assume without loss of generality that N 2 = N 2 . Let r(z,ẑ) = R/N and define the sets α and γ as before. Note that R ≤ N/2 since distance between the two class assignments d(z,ẑ) = min(d H (z,ẑ), N − d H (z,ẑ)). We also have |α| + |γ| = R(N − R) if r(z,ẑ) = R/N . Hence from Equation (9.27) we have, The proof is similar to the one for Θ M L 0 and we only specify the specific results here omitting the technicalities. Let 0 ≤ ≤ 1/8 and recall that our assumption for K = 2 case is that N m I (m) 2 → ∞. We have the following 3 cases in parallel to the 3 cases earlier, and hence E[r(z,ẑ)] = (1 + o(1))R 0 /N .