Empirical Bayes Estimation for the Stochastic Blockmodel

Inference for the stochastic blockmodel is currently of burgeoning interest in the statistical community, as well as in various application domains as diverse as social networks, citation networks, brain connectivity networks (connectomics), etc. Recent theoretical developments have shown that spectral embedding of graphs yields tractable distributional results; in particular, a random dot product latent position graph formulation of the stochastic blockmodel informs a mixture of normal distributions for the adjacency spectral embedding. We employ this new theory to provide an empirical Bayes methodology for estimation of block memberships of vertices in a random graph drawn from the stochastic blockmodel, and demonstrate its practical utility. The posterior inference is conducted using a Metropolis-within-Gibbs algorithm. The theory and methods are illustrated through Monte Carlo simulation studies, both within the stochastic blockmodel and beyond, and experimental results on a Wikipedia data set are presented.


Introduction
The stochastic blockmodel (SBM) is a generative model for network data introduced in Holland et al. (1983). The SBM is a member of the general class of latent position random graph models introduced in Hoff et al. (2002). These models have been used in various application domains as diverse as social networks (vertices may represent people with edges indicating social interaction), citation networks (who cites whom), connectomics (brain connectivity networks; vertices may represent neurons with edges indicating axonsynapse-dendrite connections, or vertices may represent brain regions with edges indicating connectivity between regions), and many others. For comprehensive reviews of statistical models and applications, see Fienberg (2010), Goldenberg et al. (2010), Fienberg (2012). In general, statistical inference on graphs is becoming essential in many areas of science, engineering, and business.
The SBM supposes that each of n vertices is assigned to one of K blocks. The probability of an edge between two vertices depends only on their respective block memberships, and the presence of edges are conditionally independent given block memberships. By letting τ i denote the block to which vertex i is assigned, a K × K matrix B is defined as the probability matrix such that the entry B τi,τj is the probability of an edge between vertices i and j. The block proportions are represented by a K-dimensional probability vector ρ. Given a stochastic blockmodel graph, estimating the block memberships of vertices is an obvious and important task. Many approaches have been developed for estimation of vertex block memberships, including likelihood maximization (Bickel and Chen, 2009, Choi et al., 2012, Celisse et al., 2012, Bickel et al., 2013, maximization of modularity (Newman, 2006), spectral techniques (Rohe et al., 2011, Sussman et al., 2012, Fishkind et al., 2013, and Bayesian methods (Snijders and Nowicki, 1997, Nowicki and Snijders, 2001, Handcock et al., 2007, Airoldi et al., 2008. Latent position models for random graphs provide a framework in which graph structure is parametrized by a latent vector associated with each vertex. In particular, this paper considers the special case of the latent position model known as the random dot product graph model, introduced in Nickel (2006) and Young and Scheinerman (2007). In the random dot product model, each vertex is associated with a latent vector, and the presence or absence of edges are independent Bernoulli random variables, conditional on these latent vectors. The probability of an edge between two vertices is given by the dot product of the corresponding latent vectors. A stochastic blockmodel can be defined in terms of a random dot product graph model for which all vertices that belong to the same block share a common latent vector.
When analyzing random dot product graphs, the first step is often to estimate the latent positions, and these estimated latent positions can then be used for subsequent analysis. Obtaining accurate estimates of the latent positions will consequently give rise to accurate inference (Sussman et al., 2014), as the latent vectors determine the distribution of the random graph. Sussman et al. (2012) describes a method for estimating the latent positions in a random dot product graph using a truncated eigen-decomposition of the adjacency matrix. Athreya et al. (2013) proves that for a random dot product graph, the latent positions estimated using adjacency spectral graph embedding converge in distribution to a multivariate Gaussian mixture. This suggests that we may consider the estimated latent positions of a K-block SBM as (approximately) an independent and identically distributed sample from a mixture of K multivariate Gaussians.
In this paper, we demonstrate the utility of an estimate of this multivariate Gaussian mixture as an empirical prior distribution in a Bayesian inference methodology for estimating block memberships in a stochastic blockmodel graph. This paper is organized as follows. In Section 2, we formally present the stochastic blockmodel as a random dot product graph model and describe how the theorem of Athreya et al. (2013) motivates our mixture of Gaussians empirical prior. We then present our empirical Bayes methodology for estimating block memberships in the stochastic blockmodel, and the Markov chain Monte Carlo (MCMC) algorithm that implements the Bayesian solution. In Section 3, we present simulation studies and an experimental analysis demonstrating the performance of our empirical Bayes methodology. Finally, Section 4 discusses further extensions and provides a concluding summary.

Empirical Bayes estimation for block memberships in a stochastic blockmodel
Network data on n vertices may be represented as an adjacency matrix A ∈ {0, 1} n×n . We consider simple graphs, so that A is symmetric (undirected edges imply A ij = A ji for all i, j), hollow (no self-loops implies A ii = 0 for all i), and binary (no multi-edges or weights implies A ij ∈ {0, 1} for all i, j). For our random graphs, the vertex set is fixed; it is the edge set that is random.

The Stochastic Blockmodel as a Random Dot Product Graph Model
Definition 1 (Random Dot Product Graph). A random graph G with adjacency matrix A is said to be a random dot product graph (RDPG) if Thus, in the RDPG model, each vertex i is associated with a latent vector X i . Furthermore, conditioned on the latent positions X, the edges A ij ind ∼ Bern( X i , X j ).
For an RDPG, we also define the n × n edge probability matrix P = XX T ; P is symmetric and positive semidefinite and has rank at most d. Hence, P has a spectral decomposition given by P ∈ R n×n , and U P ∈ R n×d has orthonormal columns, and S P ∈ R d×d is diagonal matrix with non-negative non-increasing entries along the diagonal. It follows that there exists an orthonormal W n ∈ R d×d such that U P S 1/2 P = XW n . (This introduces obvious non-identifiability.) Without loss of generality, we suppose X = U p S 1/2 P . Letting U A ∈ R n×d and S A ∈ R d×d be the adjacency matrix versions of U P and S P , the adjacency spectral graph embedding (ASGE) of A to dimension d is given by The stochastic blockmodel can be formally defined as a random dot product graph for which all vertices that belong to the same block share a common latent vector, according to the following definition.
Definition 2 ((Positive Semidefinite) Stochastic Blockmodel). An RDPG can be parameterized as an SBM with K blocks if the number of distinct rows in X is K. That is, let the probability mass function f associated with the distribution F of the latent positions X i be given by the mixture of point masses f = k ρ k δ ν k , where the probability vector ρ ∈ (0, 1) K satisfies K k=1 ρ k = 1 and the distinct latent positions are represented by ν = [ν 1 | · · · |ν K ] ∈ R K×d . Thus the standard definition of the SBM with parameters ρ and B = νν is seen to be an RDPG with X i iid ∼ k ρ k δ ν k .
Additionally, for identifiability purposes, we impose the constraint that the block probability matrix B have distinct rows; that is, B k,· = B k ,· for all k = k .
In this setting, the block memberships τ 1 , . . . , τ n |K, ρ iid ∼ Discrete([K], ρ) such that τ i = τ j if and only if X i = X j . Let N k be the number of vertices such that τ i = k; we will condition on N k = n k throughout. Given a graph generated according to the stochastic blockmodel, our goal is to assign vertices to their correct blocks.
To date, Bayesian approaches for estimating block memberships in the SBM have typically involved a specification of the prior on the block probability matrix B = νν ; the Beta distribution (which includes the uniform distribution as a special case) is often chosen as the prior Nowicki, 1997, Nowicki andSnijders, 2001). Facilitated by our re-casting of the SBM as an RDPG and motivated by recent theoretical advances described in Section 2.2 below, we will instead derive an empirical prior for the latent positions ν themselves.

The Empirical Prior
Recently, Athreya et al. (2013) proved that for an RDPG the latent positions estimated using adjacency spectral graph embedding converge in distribution to a multivariate Gaussian mixture. We can express this more formally in the following theorem and corollary to motivate our empirical Bayes prior.
Theorem 3 (Athreya et al. (2013)). Let G be an RDPG with d-dimensional latent positions X 1 , . . . , X n iid ∼ F , and assume distinct eigenvalues for the second moment matrix of F . Let X ∈ R n×d be the (uncentered) principal components version of X so that the columns of X are orthogonal. Then for each row X i of X and where the integral denotes a mixture of the covariance matrices and, with ∆ = E[X j X j ], The special case of the SBM gives rise to the following corollary.
Corollary 4. In the setting of Theorem 3, suppose G is an SBM with K blocks. Then, if we condition on X i = ν k , we obtain This gives rise to the mixture of normals approximation X 1 , · · · , X n iid ∼ k ρ k ϕ k for the estimated latent positions obtained from the adjacency spectral embedding. That is, based on these recent theoretical results, we can consider the estimated latent positions as (approximately) an independent and identically distributed sample from a mixture of multivariate Gaussians.
A similar Bayesian method for latent position clustering of network data is proposed in Handcock et al. (2007). Their latent position cluster model is an extension of Hoff et al. (2002), wherein all the key features of network data are incorporated simultaneously -namely clustering, transitivity (the probability that the adjacent vertices of a vertex having a connection), and homophily on attributes (the tendency of vertices with similar features to possess a higher probability of presenting an edge). The latent position cluster model is similar to our model, but they use the logistic function instead of the dot product as their link function.
Our theory gives rise to a method for obtaining an empirical prior for ν using the adjacency spectral embedding. Subsequently, an independent Metropolis-Hasting-within-Gibbs algorithm will be employed to sample from the posterior.
Given the estimated latent positions X 1 , . . . , X n obtained via the spectral embedding of the adjacency matrix A, the next step is to cluster these X i using Gaussian mixture models (GMM). There are a wealth of methods available for this task; we employ the model-based clustering of Fraley and Raftery (2002). This mixture estimate, in the context of Corollary 4, quantifies our uncertainty about ν, suggesting its role as an empirical Bayes prior distribution. That is, our empirical Bayes prior distribution for ν is expressed as is the density function of a multivariate normal distribution with mean µ k and covariance matrix Σ k estimated from the estimated latent positions X i and the indicator I S (ν) enforces homophily and block identifiability constraints for the stochastic blockmodel via Algorithm 1 Empirical Bayes estimation using the adjacency spectral embedding empirical prior 1: Given graph G 2: Obtain adjacency spectral embedding X 3: Obtain empirical prior via GMM of X 4: Sample from the posterior via Metropolis-Hasting-within-Gibbs

Models and Algorithms
This section presents the models and algorithms we will use to investigate the utility of our empirical Bayes inference methodology for estimating block memberships in a stochastic blockmodel graph, dubbed ASGE and presented in Section 2.3.3.
For comparison purposes, we construct a hierarchy of models: Exact, Gold, ASGE, and Flat. The names of these models are coined from the respective prior distributions used for the latent positions ν. The Exact model is formulated as a primary benchmark model where all model parameters except the block membership vector τ are assumed known. The Gold model is a secondary benchmark model where ν and τ are the unknown parameters; the gold standard mixture of Gaussians prior distribution for ν takes its hyperparameters to be the true latent positions and theoretical limiting covariances motivated by the distributional results from Athreya et al. (2013) presented above. Continuing this naming convention, our ASGE model uses an empirical prior estimated from the adjacency spectral graph embedding. Finally, we consider the Flat model, since in the absence of the adjacency spectral graph embedding theory a natural choice for the prior on ν is the uniform distribution on the parameter space, giving rise to the formulation of an alternative to our empirical Bayes prior distribution for ν.
The adjaceny spectral graph embedding theory suggests that we might expect increasingly better performance as we go from Flat to ASGE to Gold to Exact. (As a teaser, we hint here that we will indeed see precisely this progression, in Section 3.)

"Exact"
In the setting of Corollary 4, for an adjacency matrix A, the likelihood for the block membership vector τ ∈ [K] n and the latent positions ν ∈ R K×d is given by In the case that the latent positions ν and the block membership probabilities ρ are known, we can draw inferences about the block memberships τ based on the posterior f (τ |A, ν, ρ) via an Exact Gibbs sampler. First, We appeal to Gibbs sampling methods to sample the posterior of τ , which can be updated sequentially. The idea behind this method is to first posit a full conditional posterior distribution of τ . Let τ −i = τ \ τ i denote the block memberships for all but vertex i. Then, conditioning on τ −i , we have Therefore, the posterior distribution for τ i is a multinomial with parameter ρ * i given by Hence, for our Exact Gibbs sampler, once a vertex is selected, the exact calculation of ρ * i is available and sampling from the Multinomial(ρ * i ) is straightforward. Initialization of τ will be τ

"Gold"
For this model, we assume the block membership probability vector ρ is known and that both ν and τ are unknown. In order to obtain a posterior distribution for τ and ν given the data A, a prior distribution on the latent positions ν is required. In this section, we describe what we call the Gold prior distribution, which will be used for comparison with our empirical Bayes procedure presented below in Section 2.3.3.
Let the true value for the latent positions be represented by ν * . Based on Corollary 4, we can suppose that the prior distribution for ν k follows a (truncated) multivariate Gaussian centered at ν * k and with covariance matrix Σ * k = (1/n)Σ k given by the theoretical limiting distribution for the adjacency spectral embedding X presented in Eqn (1). This corresponds to the approximate distribution of X i if we condition on τ i = k.
Inference for τ and ν is based on the posterior distribution, f (τ, ν|A, ρ), estimated by samples obtained from a Gibbs sampler for τ and an Independent Metropolis sampler for ν. Thus, the posterior distribution for the unknown quantities is given by where f (ν|{ν * k }, {Σ * k }) is a prior distribution for the latent positions ν, with {ν * k } and {Σ * k } being the hyperparameters for this prior. Theory suggests we consider Our Gibbs sampler for τ will be identical to that for the Exact model except the initial state τ (0) will be given by τ , the block assignment vector given by GMM clustering of the estimated latent positions X as described in Section 2.2.
For the Metropolis-within-Gibbs sampler for ν, the prior distribution f (ν|{ν * k }, {Σ * k }) will be employed as the proposal distribution. We generate a proposed state ν k ∼ f (ν k |ν * k , Σ * k ) with the acceptance probability defined as where ν k in the denominator denotes the current state. The initialization of ν is

"ASGE"
In the previous section we proposed using, as a prior for ν, a (truncated) multivariate normal mixture based on the theoretical limiting distribution of the adjacency spectral embedding X. This gold standard prior can be thought of as an oracle; however, in practice the theoretical ν * k and Σ * k are not available.
This section presents a procedure to obtain the posterior samples for the unknown parameters by using an empirical Bayes prior as in Eqn (2). This is the case where the block memberships τ , the latent positions ν, and the block membership probabilities ρ are assumed unknown. Thus, our empirical posterior is given by where θ represents the hyperparameters in the unit simplex ∆ K for the prior distribution for ρ and the µ k , Σ k are from the estimated GMM for X described in Section 2.2.
By choosing a conjugate Dirichlet prior for ρ, f (ρ|θ) = Dirichlet(θ) for θ ∈ ∆ K , we can marginalize the posterior distribution over ρ as follows: Then the resulting prior distribution is given by , which follows a Multinomial-Dirichlet distribution with parameters θ and n. Therefore, the marginal posterior distribution can be expressed as Metropolis-Hasting-within-Gibbs sampling is carried out using the marginal posterior distribution for τ and ν. To update the block membership vector τ , a standard Gibbs update is performed using its full-conditional distribution, The procedure consists of visiting each τ i for i = 1, . . . , n and executing Algorithm 2. Similar to the Gold model described in Section 2.3.2, we initialize τ with τ (0) = τ , the block assignment vector obtained from the Gaussian mixture modeling. Again, the distribution in Eqn (2) is utilized as the prior distribution for ν. As with the Gold model, the proposal distribution for the Metropolis sampler is the prior distribution.

"Flat"
In the event that no special prior information is available, a natural choice of prior is the uniform distribution on the parameter space. This results in the formulation of the Flat model as an alternative to an empirical Bayes prior distribution for ν. In this section, we consider a flat prior distribution on the constraint set S, where the marginal posterior distribution for τ and ν is given by The Gibbs sampler for τ is identical to the procedure presented in Section 2.3.3. As for the Metropolis sampler for the latent positions ν, the flat prior distribution is used as the proposal. However, we initialize ν by generating it from the prior distribution of ν as in the ASGE model, i.e. f (ν|{ µ k }, { Σ k }). Table 1 provides a summary of our Bayesian modeling schemes.

Performance Comparisons
We illustrate the performance of our ASGE model via various Monte Carlo simulation experiments and one real data experiment. Specifically, we consider in Section 3.1 a K = 2 stochastic blockmodel, in Section 3.2 a generalization of this K = 2 SBM to a more general random dot product graph, in Section 3.3 a K = 3 stochastic blockmodel, and in Section 3.4 a three-class Wikipedia graph example. We demonstrate the utility of the ASGE model for estimating vertex block assignments via comparison to competing methods.
Throughout our performance analysis, we generate posterior samples of τ and ν for a large number of iterations for two parallel Markov chains. The percentage of misassigned vertices per iteration is calculated and used to compute Gelman-Rubin statistics to check convergence of the chains. The posterior inference for τ is based on iterations after convergence. Performance is evaluated by calculating the vertex block assignment error. This procedure is repeated multiple times to obtain estimates of the error rates.

A Simulation Example with K = 2
Consider the stochastic blockmodel parameterized by The block proportion vector ρ indicates that each vertex will be in block 1 with probability ρ 1 = 0.6 and in block 2 with probability ρ 2 = 0.4. Edge probabilities are determined by the entries of B, independent and a function of only the vertex block memberships. This model can be parameterized as an RDPG in R 2 where the distribution F of the latent positions is a mixture of point masses positioned at ν 1 ≈ (0.5489, 0.3446) with prior probability 0.6 and ν 2 ≈ (0.3984, 0.5842) with prior probability 0.4.
For each n ∈ {100, 250, 500, 750, 1000}, we generate random graphs according to the SBM with parameters as provided in Eqn (6). For each graph G, the spectral decomposition of the corresponding adjacency matrix A provides the estimated latent positions X. Subsequently, GMM is used to cluster the (embedded) vertices, the result of which (estimated block memberships τ ) is then employed as the initial point in the Gibbs step for updating τ . The mixture component means µ k and variances Σ k determine our empirical Bayes ASGE prior for the latent positions ν. To avoid the model selection quagmire we assume d = 2 and K = 2 are known in this experiment.  Results comparing various competing methods are presented in Figure 2. As expected, the error decreases for all models as the number of vertices n increases. As previously explained in Section 2.3, the Exact and Gold models formulated in this study are perceived as benchmark models; it is expected that these models will show the best performance -for the Exact model, all the parameters are assumed known apart from the block memberships τ , while in the case of the Gold model, although the latent positions ν and τ are unknown parameters, their prior distributions were taken from the true latent positions and the theoretical limiting covariances. The main message from Figure 2 is that our empirical Bayes model, ASGE, is vastly superior to that of both the alternative Flat model and the GMM algorithm (the sign test p-value for the paired Monte Carlo replicates is less than 10 −10 for both comparisons for all n) and indeed achieves Gold /Exact performance by n = 1000.
As an aside, we note that when we put a flat prior directly on B, we obtain results indistinguishable from our Flat model on the latent positions. Figure 2: Comparison of vertex block assignment methodologies for the K = 2 SBM considered in Section 3.1. Shaded areas represent standard errors. The plot indicates that utilizing a multivariate Gaussian mixture estimate for the estimated latent positions as an empirical Bayes prior (ASGE ) can yield substantial improvement over both the GMM vertex assignment and the Bayesian method with a Flat prior. See text for details and analysis.

A Dirichlet Mixture RDPG Generalization
Here we generalize the simulation setting presented in Section 3.1 above to the case where the latent position vectors are distributed according to a mixture of Dirichlets as opposed to the SBM's mixture of point masses.
That is, we consider X i iid ∼ k ρ k Dirichlet(r · ν k ). Note that the SBM model presented in Section 3.1 is equivalent to the limit of this mixture of Dirichlets model as r → ∞.
For n = 500, we report illustrative results using r = 100, for comparison with the SBM results from Section 3.1. Specifically, we obtain mean error rates of 0.4194, 0.2865, and 0.3705 for Flat, ASGE, and GMM, respectively; the corresponding results for the SBM, from Figure 2, are 0.3456, 0.2510, and 0.3910. Thus we see that, while the performance is slightly degraded, our empirical Bayes approach works well in this RDPG generalization of Section 3.1's SBM. This demonstrates robustness of our method to violation of the SBM assumption.

A Simulation Example with K = 3
Our final simulation study considers the K = 3 SBM parameterized by In a same manner as Section 3.1, the model is parameterized as an RDPG in R 3 where the distribution of the latent positions is a mixture of point masses positioned at ν 1 ≈ (0.68, 0.20, −0.30), ν 2 ≈ (0.68, −0.36, −0.02), ν 3 ≈ (0.68, 0.16, 0.33) with equal probabilities. In this experiment, we assume that d = 3 and K = 3 are known. Table 2 displays error rate estimates for this case, with n = 150 and n = 300. In both cases, the ASGE model yields results vastly superior to the Flat model; e.g., for n = 300 the mean error rate for Flat is approximately 11% compared to a mean error rate for ASGE of approximately 1%. Based on the paired samples, the sign test p-value is less than 10 −10 for both values of n. While the results of the GMM appear competitive to the results of our empirical Bayes with ASGE prior in terms of mean and median error rate, the paired analysis shows again that the ASGE prior is superior, as seen by sign test p-values < 10 −10 for both values of n.
As an illustration, Figure 3 presents the histogram of the differential number of errors made by the ASGE model and GMM for n = 300. From Table 2, we see that for n = 300, empirical Bayes with ASGE prior has a mean error rate of 1 percent (3 errors per graph) and a median error rate of 1/3 percent (1 error per graph), while GMM has a mean and median error rate of 1 percent. The histogram shows that for most graphs, empirical Bayes with ASGE prior performs as well as or better than GMM. (NB: In the histogram presented in Figure 3, eight ouliers in which ASGE performed particularly poorly are censored at a value of 10; we believe these outliers are due to chain convergence issues.) n = 150 n = 300  Table 2: Error rate estimates for the K = 3 SBM considered in Section 3.3. Figure 3: Histogram (500 Monte Carlo replicates) of the differential number of errors made by ASGE and GMM for the K = 3 SBM considered in Section 3.3, with n = 300, indicating the superiority of ASGE over GMM. For most graphs, emprical Bayes with ASGE prior performs as well as or better than GMM -the sign test for this paired sample yields p ≈ 0.

Wiki Experiment
In this section we analyze an application of our methodology to the Wikipedia graph. The vertices of this graph represent Wikipedia article pages and there is an edge between two vertices if either of the associated pages hyperlinks to the other. The full data set consists of 1382 vertices -the induced subgraph generated from the two-hop neighborhood of the page "Algebraic Geometry." Each vertex is categorized by hand into one of six classes -P eople, P laces, Dates, T hings, M ath, and Categories -based on the content of the associated article. (The adjacency matrix and the true class labels for this data set are available at http://www.cis.jhu.edu/~parky/Data/data.html.) We analyze a subset of this data set corresponding to the K = 3 classes P eople, P laces, and Dates, labeled here as Class 1, 2 and 3, respectively. After excluding three isolated vertices in the induced subgraph generated by these three classes, we have a connected graph with a total of m = 828 vertices; the classconditional sample sizes are m 1 = 368, m 2 = 269, and m 3 = 191. Figure 4 presents one rendering of this graph (obtained via one of the standard force-directed graph layout methods, using the command layout.drl in the igraph R package); Figure 5 presents the adjacency matrix; Figure 6 presents the pairs plot for the adjacency spectral embedding of this graph into R 3 . (In all figures, we use red for Class 1, green for Class 2 and blue for Class 3.) Figures 4, 5, and 6 indicate clearly that this Wikipedia graph is not a pristine SBM -real data will never be; nonetheless, we proceed undaunted.   We illustrate our empirical Bayes methodology, following Algorithm 1, via a bootstrap experiment. We generate bootstrap resamples from the adjacency spectral embedding depicted in Figure 6, with n = 300 (n 1 = n 2 = n 3 = 100). This yields X (b) for each bootstrap resample b = 1, . . . , 200. As before, GMM is used to cluster the (embedded) vertices and obtain block label estimates τ and mixture component means µ k and variances Σ k for each cluster k of the estimated latent positions X (b) . The clustering result from the GMM algorithm for one resample is presented in Figure 7. (We choose d = 3 for the adjacency spectral embedding dimension because a common and reasonable choice is to use d = K, which choice is justified in the SBM case (Fishkind et al., 2013).) The GMM clustering provides the empirical prior and starting point for our Metropolis-Hasting-within-Gibbs sampling (Algorithm 2) using the subgraph of the full Wikipedia graph induced by X (b) . (NB: For this Wikipedia experiment, the assumption of homophily is clearly violated; as a result, the constraint set used here is given by S = {ν ∈ R K×d : ∀i, j ∈ [K], 0 ≤ ν i , ν j ≤ 1}.) Figure 7: Illustrative empirical prior for one bootstrap resample (n = 300) for our Wikipedia experiment; colors represent true classes, K = 3 estimated Gaussians are depicted with level curves, and symbols represent GMM cluster memberships.
Classification results for this experiment are depicted via boxplots in Figure 8. We see from the boxplots that using the adjacency spectral empirical prior does yield statistically significant improvement; indeed, our paired sample analysis yields sign test p-values less than 10 −10 for both ASGE vs Flat and ASGE vs GMM. We have shown that using the empirical ASGE prior has improved performance compared to the Flat prior and GMM on this Wikipedia dataset. However, Figure 8 also indicates that ASGE performance on this data set, while representing a statistically significant improvement, might seem not very good in absolute terms: the mean probability of misclassification over bootstrap resamples is L ≈ 0.456 for ASGE versus L ≈ 0.476 for both Flat and GMM. That is, empirical Bayes using the adjacency spectral prior provides a statistically significant but perhaps unimpressive 2% improvement in the error rate. (Note that chance performance is L = 2/3.) Given that the Bayes optimal probability of misclassification L * is unknown, we consider inf φ∈C L(φ) where C denotes the class of all classifiers based on class-conditional Gaussians. This yields an error rate of approximately 0.401. Note that this analysis assumes that a training set of n = 300 labeled exemplars is available, which training information is not available in our empirical Bayes setting. Nonetheless, we see that our empirical Bayes methodology using the ASGE prior improves more than 25% of the way from the Flat and GMM performance to this (presumably unattainable) standard. As a final point, we note that a k-nearest neighbor classifier (again, with a training set of n = 300 labeled exemplars) yields an error rate of approximately 0.338, indicating that the assumption of class-conditional Gaussians was unwarranted. (Indeed, this is clear from Figure 6.) That our ASGE provides significant performance improvement despite the fact that our real Wikipedia data set so dramatically violates the stochastic block model assumptions is a convincing demonstration of the robustness of the methodology.

Conclusion
In this paper we have formulated an empirical Bayes estimation approach for block membership assignment. Our methodology is motivated by recent theoretical advances regarding the distribution of the adjaceny spectral embedding of random dot product and stochastic blockmodel graphs. To apply our model, we derived a Metropolis-within-Gibbs algorithm for block membership and latent position posterior inference.
Our simulation experiments demonstrate that the ASGE model consistently outperforms the GMM clustering used as our emprical prior as well as the alternative Flat prior model -notably, even in our Dirichlet mixture RDPG model wherein the SBM assumption is violated. For the Wikipedia graph, our ASGE model again performs admirably, even though this real data set is far from an SBM.
We considered only simple graphs; extension to directed and weighted graphs is of both theoretical and practical interest.
To avoid the model selection quagmire, we have assumed throughout that the number of blocks K and the dimension of the latent positions d are known. Model selection is in general a difficult problem; however, automatic determination of both the dimension d for a truncated eigen-decomposition and the complexity K for a Gaussian mixture model estimate are important practical problems and thus have received enormous attention in both the theoretical and applied literature. For our case, Fishkind et al. (2013) demonstrates that the SBM embedding dimension d can be successfully estimated, and Fraley and Raftery (2002) provides one common approach to estimating the number of Gaussian mixture components K. We note that since d = K is justified for the adjacency spectral embedding dimension of an SBM, it may be productive to investigate simultaneous model selection methodologies for d and K. Moreover, robustness of the empirical Bayes methodology to misspecification of d and K is of great practical importance.
Finally, we note that we have made heavy use of the dot product kernel. Tang et al. (2013) provides some useful results for the case of a latent position model with unknown kernel, but we see extending our empirical Bayes methodology to this case as a formidable challenge. Recent results on the SBM as a universal approximation to general latent position graphs (Airoldi et al., 2013, Olhede andWolfe, 2013) suggest, however, that this challenge, once surmounted, may provide a simple consistent framework for empirical Bayes inference on general graphs.
In conclusion, adopting an empirical Bayes approach for estimating block memberships in a stochastic blockmodel, using an empirical prior obtained from a Gaussian mixture model estimate for the adjacency spectral embeddings, can significantly improve block assignment performance.