1 Introduction

A network can be represented as a graph \({\mathbb {G}}=(V,E)\), where V is a set of nodes and \(E\subseteq V\times V\) is a set of edges indicating the pairs of nodes which have interacted. The graph can be characterised by the adjacency matrix \(\mathbf {A}\in \{0,1\}^{n\times n}\), where \(n=\vert {V}\vert \) and for \(1\le i,j\le n\), \(A_{ij}=\mathbb {1}_E\{(i,j)\}\), such that \(A_{ij}=1\) if a link between the nodes i and j exists, and \(A_{ij}=0\) otherwise. The graph is said to be undirected if \((i,j)\in E \iff (j,i) \in E\) and \(\mathbf {A}\) is constrained to be symmetric; otherwise, the graph is said to be directed. It will be assumed that a node cannot link to itself, implying \(\mathbf {A}\) is a hollow matrix.

Latent space models (Hoff et al. 2002) represent a flexible approach to statistical analysis of networks: each node i is assigned a latent position \(\varvec{x}_i\) in a d-dimensional latent space \({\mathbb {X}}\), and edges between pairs of nodes are typically generated independently, with the probability of observing a link between nodes i and j obtained through a kernel function \(\psi :{\mathbb {X}}\times {\mathbb {X}}\rightarrow [0,1]\) of the respective latent positions: \({\mathbb {P}}(A_{ij}=1)=\psi (\varvec{x}_i,\varvec{x}_j)\). Different ideas and techniques for embedding observed graphs into low dimensional spaces are explored in the literature (for a survey, see, for example, Cai et al. 2018). Random dot product graphs (RDPGs) (Young and Scheinerman 2007) are a popular class of latent position models, where \({\mathbb {X}}\subseteq {\mathbb {R}}^d\), and the function \(\psi (\cdot )\) is an inner product \(\langle \cdot ,\cdot \rangle \) on \({\mathbb {X}}\times {\mathbb {X}}\). RDPGs are analytically tractable and have therefore been extensively studied; a survey of the existing statistical inference techniques is presented in Athreya et al. (2017).

The stochastic blockmodel (SBM) (Holland et al. 1983) is the classical statistical model for clustering graphs (Snijders and Nowicki 1997; Nowicki and Snijders 2001). Assuming K communities, each node is assigned a community membership \(z_i\in \{1,\dots ,K\}\) with probabilities \(\varvec{\theta }\), from the \(K-1\) probability simplex. The probability of a link only depends on the community allocations \(z_i\) and \(z_j\) of the two nodes. Given a symmetric matrix \(\mathbf {B}\in [0,1]^{K\times K}\) of inter-community probabilities, then independently \({\mathbb {P}}(A_{ij}=1)=B_{z_iz_j}\). Stochastic blockmodels have appealing statistical properties, and can well approximate any independent-edge network model if the number of communities is sufficiently large (Bickel and Chen 2009; Wolfe and Olhede 2013). Stochastic blockmodels can also be easily represented as random dot product graphs: each community is assigned a latent position, which is common to all the nodes belonging to the cluster, and \(\mathbf {B}\) is obtained from the inner products of those positions. Hence, in this framework, \(d={{\,\mathrm{rank}\,}}(\mathbf {B})\le K\).

Spectral clustering (von Luxburg 2007) provides a consistent statistical estimation procedure for the latent positions and communities in SBMs (Rohe et al. 2011; Lei and Rinaldo 2015) and more generally in random dot product graphs (Tang et al. 2013; Sussman et al. 2014). Rubin-Delanchy et al. (2017) directly links adjacency and Laplacian spectral embedding to the generalised random dot product graph (GRDPG), an extension of the RDPG, and advocates for Gaussian mixture modelling (GMM) of the rows of the embedding. Alternatives to spectral clustering include variational methods (Celisse et al. 2012) and pseudo-likelihood approaches (Amini et al. 2013). SBMs have been extended to the directed case (Wang and Wong 1987; Rohe et al. 2016), and appropriate embeddings for co-clustering, in most cases based on the singular value decomposition (SVD), have been derived in the literature (Dhillon 2001; Malliaros and Vazirgiannis 2013; Zheng and Skillicorn 2015).

One of the practical issues of spectral embedding, and in general most graph embedding algorithms, is that it requires a suitable prespecified latent dimensionality d (usually \(d\ll \vert {V}\vert \)) as input, and, subsequently, a suitable number of clusters K, conditionally, crucially, on the previous choice of d. For a practical example of this procedure on a real world network, see Priebe et al. (2019). In general, in spectral clustering, similarly to what practitioners do in principal component analysis (PCA), the investigator examines the scree-plot of the eigenvalues and chooses the dimension based on the location of elbows in the plot (Jolliffe 2002), or uses the eigengap heuristic (see, for example, von Luxburg 2007). Automatic methods for thresholding have also been suggested (Zhu and Ghodsi 2006; Chatterjee 2015). A relevant body of literature is also devoted to methods for the selection of the number of communities in stochastic blockmodels (Zhao et al. 2011; Bickel and Sarkar 2016; Newman and Reinert 2016; Franco Saldaña et al. 2017; Chen and Lei 2018). Often, practitioners simply set \(d=K\), for some d, assuming that \(\mathbf {B}\) has full rank in the stochastic blockmodel framework. Under the full rank assumption, one may estimate \(d=K\) as the number of eigenvalues of \(\mathbf {A}\) which are larger than \(\sqrt{n}\) (Chatterjee 2015; Lei 2016). In this work, the problem of selecting d is approached from the perspective of variable selection in model based clustering, which is widely studied in the literature (Fowlkes et al. 1988; Law et al. 2004; Tadesse et al. 2005; Raftery and Dean 2006; Maugis et al. 2009; Witten and Tibshirani 2010). Similarly, the problem of correctly selecting the number of clusters is also common in K-means or GMMs, since it is usually required to specify a number of components in the mixture. Usually the parameter is chosen by minimising information criteria (for example, AIC or BIC). A widely used selection criterion is the Integrated Classification Likelihood (ICL) of Biernacki et al. (2000). Exact versions of the ICL based on the adoption of prior conjugated distributions of the model parameters have been obtained in the literature (for example, Côme and Latouche 2015; Wyse et al. 2017). Numerous other techniques have also been proposed for GMMs with unknown number of components (Mengersen et al. 1996; Richardson and Green 1997; Stephens 2000; Nobile 2004; Dellaportas and Papageorgiou 2006; Miller and Harrison 2018).

Clearly, the sequential approach in estimating d and K is suboptimal, and it is desirable to jointly estimate the two parameters, a problem which is not explored in the literature. This article addresses the problem in a Bayesian framework, proposing a novel methodology to automatically select d and K, simultaneously. Techniques for selection of K in GMMs will be incorporated within the spectral embedding framework, allowing for K and d, the number of communities and latent dimension of the latent positions, to be random and learned from the data. A structured Bayesian model for simultaneously inferring the dimension of the latent space, the number of communities, and the community allocations is proposed. The model is based on asymptotic results (Athreya et al. 2016; Rubin-Delanchy et al. 2017; Tang and Priebe 2018) on the leading components of spectral embeddings, obtained for d fixed and known. The asymptotic theory is combined with realistic assumptions about the remaining components of the embedding, empirically tested and justified on simulated data. Furthermore, extensions to the directed and bipartite case will be discussed. The proposed model has multiple advantages: the latent dimension d and number of communities K are modelled separately, and the Bayesian framework allows for automatic selection of the two parameters. The model also allows estimation of d even when \(d<K\), and gives insights on the goodness-of-fit of the stochastic blockmodel on observed network data, based on the embedding. The method is tested on simulated data and applied to real world computer and transportation networks. It should be noted that Yang et al. (2019) have simultaneously and independently proposed a similar inferential procedure within a frequentist framework.

The article is organised as follows: Sect. 2 introduces adjacency spectral and Laplacian embeddings and the GRDPG. The novel Bayesian model for selection of the appropriate dimension of the latent space is discussed in Sect. 3, followed by careful illustration of posterior inference procedures in Sect. 4. Section 5 discusses the effects of the curse of dimensionality on the model, and suggests a remedy. Extensions of the model are presented in Sect. 6, and results and applications are finally discussed in Sect. 7.

2 GRDPG and spectral embeddings

In this work, the stochastic block model will be interpreted as a specific case of a generalised random dot product graph (GRDPG) (Rubin-Delanchy et al. 2017). For \(d>0\), let \(d_+,d_-\) be non-negative integers such that \(d_++d_-=d\). Let \(\mathbb X\subseteq {\mathbb {R}}^{d}\) such that \(\forall \ \varvec{x},\varvec{x}^\prime \in {\mathbb {X}}\), \(0\le \varvec{x}^\top {\mathbf {I}}(d_+,d_-)\varvec{x}^\prime \le 1\), where

$$\begin{aligned} \mathbf {I}(p,q) = \text {diag}(1,\ldots ,1,-1,\ldots ,-1) \end{aligned}$$

with p ones and q minus ones. Let \({\mathcal {F}}\) be a probability measure on \({\mathbb {X}}\), \(\mathbf {A}\in \{0,1\}^{n\times n}\) be a symmetric matrix and \(\mathbf {X}=(\varvec{x}_1,\dots ,\varvec{x}_n)^\top \in {\mathbb {X}}^n\). Then \((\mathbf {A},\mathbf {X})\sim \text {GRDPG}_{d_+,d_-}({\mathcal {F}})\) if \(\varvec{x}_1,\dots ,\varvec{x}_n\overset{iid}{\sim }{\mathcal {F}}\) and for \(i<j\), independently,

$$\begin{aligned} {\mathbb {P}}(A_{ij}=1)=\varvec{x}_i^\top {\mathbf {I}}(d_+,d_-)\varvec{x}_j. \end{aligned}$$

To represent the K-community stochastic blockmodel with inter-community probabilities \(\mathbf {B}\) as a GRDPG, \({\mathcal {F}}\) can be chosen to have mass concentrated on atoms \(\varvec{\mu }_1,\dots , \varvec{\mu }_K \in {\mathbb {R}}^{d}\) such that \(\varvec{\mu }_i^\top {\mathbf {I}}(d_+,d_-)\varvec{\mu }_j=B_{ij}\ \forall \ i,j\in \{1,\dots ,K\}\). For consistent estimation of the latent positions in a SBM, interpreted as a GRDPG, Rubin-Delanchy et al. (2017) suggest to fit a Gaussian mixture model with K components to the d-dimensional adjacency or Laplacian spectral embedding.

Adjacency spectral embedding (ASE) and Laplacian spectral embedding (LSE) are two common techniques to embed the adjacency matrix of an undirected graph into a latent space of dimension d. Suppose \(\mathbf {A}\in \{0,1\}^{n\times n}\) is a symmetric adjacency matrix of an undirected graph with n nodes. Then:

Definition 1

(ASE – Adjacency spectral embedding) For \(d\in \{1,\ldots ,n\}\), consider the spectral decomposition \( \mathbf {A} = \hat{\varvec{\Gamma }}\hat{\varvec{\Lambda }}\hat{\varvec{\Gamma }}^\top + \hat{\varvec{\Gamma }}_\perp \hat{\varvec{\Lambda }}_\perp \hat{\varvec{\Gamma }}_\perp ^\top , \) where \(\hat{\varvec{\Lambda }}\) is a \(d\times d\) diagonal matrix containing the top d eigenvalues in magnitude, in decreasing order, \(\hat{\varvec{\Gamma }}\) is a \(n\times d\) matrix containing the corresponding orthonormal eigenvectors, and the matrices \(\hat{\varvec{\Lambda }}_\perp \) and \(\hat{\varvec{\Gamma }}_\perp \) contain the remaining \(n-d\) eigenvalues and eigenvectors. The adjacency spectral embedding \(\hat{\mathbf {X}}=[\hat{\varvec{x}}_{1},\dots ,\hat{\varvec{x}}_{n}]^\top \) of \(\mathbf {A}\) in \({\mathbb {R}}^d\) is \( \hat{\mathbf {X}} = \hat{\varvec{\Gamma }}\vert \hat{\varvec{\Lambda }}\vert ^{1/2}\in \mathbb R^{n\times d}, \) where the operator \(\vert \cdot \vert \) applied to a matrix returns the absolute value of its entries.

Definition 2

(LSE – Laplacian spectral embedding) For \(d\in \{1,\ldots ,n\}\), consider the Laplacian matrix \( \mathbf {L} = \mathbf {D}^{-1/2}\mathbf {A}\mathbf {D}^{-1/2},\ \mathbf {D} = \text {diag}(\sum \nolimits _{j=1}^n A_{ij}), \) and its spectral decomposition \( \mathbf {L} = \tilde{\varvec{\Gamma }}\tilde{\varvec{\Lambda }}\tilde{\varvec{\Gamma }}^\top + \tilde{\varvec{\Gamma }}_\perp \tilde{\varvec{\Lambda }}_\perp \tilde{\varvec{\Gamma }}_\perp ^\top . \) The Laplacian spectral embedding \(\tilde{\mathbf {X}}=[\tilde{\varvec{x}}_{1},\dots ,\tilde{\varvec{x}}_{n}]^\top \) of \(\mathbf {A}\) in \({\mathbb {R}}^d\) is \(\tilde{\mathbf {X}} = \tilde{\varvec{\Gamma }}\vert \tilde{\varvec{\Lambda }}\vert ^{1/2}\).

The modified Laplacian \(\mathbf {D}^{-1/2}\mathbf {A}\mathbf {D}^{-1/2}\) (Rohe et al. 2011) is preferred to \(\mathbf {I}_n - \mathbf {D}^{-1/2}\mathbf {A}\mathbf {D}^{-1/2}\) since the eigenvalues of the former lie in \((-1,1)\), providing a convenient interpretation for disassortative networks (Rubin-Delanchy et al. 2016).

Intuitively, the estimation procedure proposed by Rubin-Delanchy et al. (2017) holds because, taking a graph with m nodes, and restricting the attention to the first n nodes, with \(n<m\), the following central limit theorem holds: \(\mathbf {Q}_m\hat{\varvec{x}}_{i} \longrightarrow {\mathbb {N}}\{\varvec{\mu }_{z_i}, m^{-1}\varvec{\Sigma }(\varvec{\mu }_{z_i})\}\) in distribution for \(m\rightarrow \infty ,\ i=1,\dots ,n\), where \(\mathbf {Q}_m\) is a matrix from the indefinite orthogonal group \(\mathbb O(d_+,d_-)\) and \(\varvec{\Sigma }(\varvec{\mu }_{z_i})\) can be analytically computed (Rubin-Delanchy et al. 2017). The result holds for d fixed and known, but in this work it is of interest to treat d as a random, unknown parameter. If a m-dimensional embedding is considered, with \(m>d\), then asymptotic theory implies an approximate normal distribution with non-zero means and a full covariance within each cluster for the top-d components of the embedding; but, to the best of our knowledge, no theoretical results have been obtained for the remaining \(m-d\) columns. It is therefore necessary to propose a model for the remaining part of the embedding, which will be carefully described in Sect. 3, and empirically justified and assessed in Sect. 7.1.

3 A Bayesian model for SBM embeddings

For simplicity, the embeddings will be generically denoted as \({\mathbf {X}}=[\varvec{x}_1,\dots ,\varvec{x}_n]^\top \in {\mathbb {R}}^{n\times m},\ \varvec{x}_i\in {\mathbb {R}}^m\) for some m, \(d\le m\le n\). In this article, m is always assumed to be fixed and obtained from a preprocessing step. Choosing an appropriate value of m is arguably much easier than choosing the correct d, and, in the proposed model, the correct d can be recovered for any choice of m, as long as \(d\le m\). Let \(\mathbf {X}_{:d}\) denote the first d columns of \({\mathbf {X}}\), and \(\mathbf {X}_{d:}\) the \(m-d\) remaining columns. The notation \(\varvec{x}_{i,:d}\) denotes the first d elements \((x_1,\dots ,x_d)\) of the vector \(\varvec{x}_i\), and similarly \(\varvec{x}_{i,d:}\) denotes the last \(m-d\) elements \((x_{d+1},\dots ,x_{m})\). Suppose a latent dimension d, K communities, and latent community assignments \(\varvec{z}=(z_1,\dots ,z_n)\). The latent positions of nodes within each community are assumed to be generated from an m-dimensional community specific Gaussian distribution:

$$\begin{aligned} \varvec{x}_i \vert d,z_i,\varvec{\mu }_{z_i},\varvec{\Sigma }_{z_i}, \varvec{\sigma }^2_{z_i} \sim {\mathbb {N}}_m \left( \begin{bmatrix} \varvec{\mu }_{z_i} \\ \varvec{0} \end{bmatrix}, \begin{bmatrix} \varvec{\Sigma }_{z_i} &{} \varvec{0} \\ \varvec{0} &{} \varvec{\sigma }^2_{z_i}\mathbf {I}_{m-d} \end{bmatrix} \right) . \end{aligned}$$
(1)

Following the results in Sect. 2, the initial components \(\varvec{x}_{i,:d}\) are assumed to have unconstrained mean vector \(\varvec{\mu }_k\in {\mathbb {R}}^d\) and positive definite \(d\times d\) covariance matrix \(\varvec{\Sigma }_k\). In contrast, for \(\varvec{x}_{i,d:}\), two constraints are imposed: the mean is an \((m-d)\)-dimensional vector of zeros, and the covariance is a diagonal matrix \(\varvec{\sigma }^2_k\mathbf {I}_{m-d}\) with positive entries \(\varvec{\sigma }^2_k=(\sigma ^2_{k,d+1},\dots ,\sigma ^2_{k,m})\). The validation of the model assumptions will be discussed in details in Sect. 7.1. Assuming group-specific covariances \(\varvec{\sigma }^2_k\) for \(\mathbf {X}_{d:}\) adds extra complexity to the model, and implies that d cannot take the simple interpretation of being the number of dimensions relevant for clustering (see, for example, Raftery and Dean 2006). However, empirical evidence, discussed in Sects. 5 and 7.1, suggests that the last \(m-d\) components of the embedding contain a cluster structure which reflects the communities in the top-d dimensions. Following the discussion in the previous sections, the parameter d actually represents the dimension of the latent positions that generate the network, equivalent to the rank of the unobserved matrices \({\mathbb {E}}(\mathbf {A})\) and \(\mathbf {B}\in [0,1]^{K\times K}\).

In order to complete the model specification, conjugate priors can be placed on the parameters as follows:

$$\begin{aligned} (\varvec{\mu }_{k},\varvec{\Sigma }_{k})\vert d \overset{iid}{\sim }&\text {NIW}_d(\varvec{0}, \kappa _0,\nu _0+d-1,\varvec{\Delta }_d),\nonumber \\ \sigma ^{2}_{k,j} \overset{iid}{\sim }&\text {Inv-}\chi ^2(\lambda _0,\sigma _{0}^2),\ j=d+1,\dots ,m, \nonumber \\ d\vert \varvec{z} \sim&\text {Uniform}\{1,\dots ,K_\varnothing \}, \nonumber \\ z_i\vert \varvec{\theta }\overset{iid}{\sim }&\text {Multinoulli}(\varvec{\theta }),\ i=1,\dots ,n,\nonumber \\ \varvec{\theta }\vert K \sim&\text {Dirichlet}\left( {\alpha }/{K},\dots ,{\alpha }/{K}\right) , \nonumber \\ K \sim&\text {Geometric}(\omega ), \end{aligned}$$
(2)

where, if \(n_k= \sum \nolimits _{i=1}^n \mathbb {1}_k\{z_i\}\) is the size of community k, \(K_\varnothing =\sum _{k=1}^K\mathbb {1}_{\mathbb N_+}\{n_k\}\) is the number of non-empty communities. In (1), \(\text {NIW}_d\) denotes the d-dimensional normal inverse Wishart distribution, and \(\text {Inv-}\chi ^2\) is a scaled inverse chi-square distribution with \(\lambda _0\) degrees of freedom and scaling parameter \(\sigma ^2_0\).

Yang et al. (2019) have simultaneously and independently proposed a model similar to (1) in a frequentist framework, reaching similar assumptions. The conjecture on the distribution of \(\mathbf {X}_{d:}\) is essentially the same, except for the choice of the diagonal elements of the cluster-specific covariance matrix: Yang et al. (2019) use a common variance parameter \(\sigma ^2_k\) for the last \(m-d\) columns of the embedding, whereas a \((m-d)\)-dimensional vector of variances \(\varvec{\sigma }^2_k\) is used in this paper. Additionally, as a second difference from Yang et al. (2019), the full model proposed here will also incorporate a second-level community cluster structure on these vectors of variances, which will be introduced in Sect. 5.

Note that the condition \(d\le K\) is explicitly enforced in (1). More specifically, \(d\le K_\varnothing \), which avoids an artificial matching between d and K using empty clusters, which are given non-zero probability mass under the Dirichlet-Multinoulli prior on \((\varvec{\theta },\varvec{z})\). One can also model d and K separately in an analogous way, changing the prior p(d) to, for example,

$$\begin{aligned} d \sim \text {Geometric}(\delta ), \end{aligned}$$
(3)

independently of K and \(\varvec{z}\); this will later be referred to as the unconstrained model. The alternative prior (3) is particularly useful in practical applications and provides a useful interpretation of d: when \(d\le K\), then \(d=\text {rank}(\mathbf {B})\), but when \(d>K\), this implies that the observed data might deviate from the stochastic blockmodel assumption, and provides a useful diagnostic for model validation and goodness-of-fit.

The likelihood associated with the spectral embedding \(\mathbf {X}\in {\mathbb {R}}^{n\times m}\) obtained from a stochastic blockmodel can be expressed as:

$$\begin{aligned} L({\mathbf {X}}) = \prod _{i=1}^n \bigg \{\phi (\varvec{x}_{i,:d};\varvec{\mu }_{z_i},\varvec{\Sigma }_{z_i})\prod _{j=d+1}^m \phi (x_{i,j};0,\sigma _{z_i,j}^2)\bigg \}, \end{aligned}$$

where \(\phi (\cdot )\) denotes the (possibly multivariate) Gaussian density function. Hence, the posterior distribution, up to a normalising constant, has form

$$\begin{aligned}&p(\{\varvec{\mu }_k\},\{\varvec{\Sigma }_k\},\{\varvec{\sigma }^2_{k}\},\varvec{z},\varvec{\theta },K,d\vert \mathbf {X}) \propto L(\mathbf {X}) \times \\&\prod _{k=1}^K \bigg \{p(\varvec{\mu }_k,\varvec{\Sigma }_k\vert d) \prod _{j=d+1}^{m} p(\sigma ^2_{k,j}\vert d)\bigg \} \prod _{i=1}^n p(z_i\vert \varvec{\theta })p(K)p(d). \end{aligned}$$

The \(\text {NIW}_d(\varvec{0},\kappa _0,\nu _0+d-1,\varvec{\Delta }_{d})\) prior on the pair \((\varvec{\mu }_{k},\varvec{\Sigma }_{k})\) is conjugate and yields a conditional posterior \((\varvec{\mu }_{k},\varvec{\Sigma }_{k})\vert \mathbf {X},\varvec{z},d\sim \text {NIW}_d(\varvec{m}^{(k)}_{:d},\kappa _{n_k},\nu _{n_k}+d-1,\mathbf {D}^{(k)}_{:d})\). By standard methods for inference in a multivariate Gaussian mixture model with NIW prior, the covariance matrix \(\varvec{\Sigma }_k\) can be explicitly integrated out from the posterior to obtain the marginal \( p(\varvec{\mu }_k\vert \mathbf {X},\varvec{z}, d) = t_{\nu _{n_k}}\{\varvec{\mu }_k\vert \varvec{m}^{(k)}_{:d},\mathbf {D}^{(k)}_{:d}/(\kappa _{n_k}{\nu _{n_k}})\}, \) the density of the multivariate Student t distribution with \(\nu _{n_k}\) degrees of freedom, where \(\nu _{n_k} =\nu _0+n_k\), \(\kappa _{n_k} =\kappa _0+n_k\),

$$\begin{aligned} \varvec{m}^{(k)}_{:d}&= \sum \nolimits _{i:z_i=k}\varvec{x}_{i,:d}\big / \kappa _{n_k},\nonumber \\ \mathbf {D}^{(k)}_{:d}&= \varvec{\Delta }_d+ \sum \nolimits _{i:z_i=k} \varvec{x}_{i,:d}\varvec{x}_{i,:d}^\top - \kappa _{n_k}\varvec{m}^{(k)}_{:d}{\varvec{m}^{(k)}_{:d}}^\top . \end{aligned}$$
(4)

Henceforth, \(\varvec{\mu }_k\) can easily be resampled in a simple Gibbs sampling step, conditional on the actual value of d and on the community allocations \(\varvec{z}\). In this work, the location vectors \(\varvec{\mu }_k\) are collapsed out too, but the distribution is instructive to present other distributional results below, and could be also used if the objective of the analysis is to also recover the explicit form of the latent positions.

In a multivariate Gaussian model with conjugate NIW prior, it is also possible to analytically express the marginal likelihood of the observed data, conditioning on a community specific Gaussian, on the assignments \(\varvec{z}\) and on the dimension of the latent space d:

$$\begin{aligned} \left. p\left( \mathbf {X}^{(k)}_{:d}\right| d,\varvec{z}\right) =&\ \pi ^{-n_kd/2}\frac{\kappa _0^{d/2}\vert \varvec{\Delta }_d\vert ^{(\nu _0+d-1)/2}}{\kappa _{n_k}^{d/2}\vert \mathbf {D}^{(k)}_{:d}\vert ^{(\nu _{n_k}+d-1)/2}} \nonumber \\&\times \prod _{i=1}^d \frac{\Gamma \{(\nu _{n_k}+d-i)/2\}}{\Gamma \{(\nu _0+d-i)/2\}}, \end{aligned}$$
(5)

where \(\mathbf {X}^{(k)}_{:d}\) is the subset of rows of \(\mathbf {X}_{:d}\) such that \(z_i=k\).

Given the \(\text {Inv-}\chi ^2(\lambda _0,\sigma ^2_0)\) prior, the posterior for \(\sigma ^2_{j,k}\) is \(\text {Inv-}\chi ^2(\lambda _{n_k},s^{(k)}_j)\), where \(\lambda _{n_k} = \lambda _0 + n_k\), and \(s^{(k)}_j= \left\{ \lambda _0\sigma ^2_{0} + \sum \nolimits _{i:z_i=k} x_{i,j}^2\right\} /\lambda _{n_k}\). Similar calculations give the full marginal likelihood for the remaining portion of the embedding \(\mathbf {X}^{(k)}_{d:}\):

$$\begin{aligned} p\left. \left( \mathbf {X}^{(k)}_{d:}\right| d,\varvec{z}\right) =&\ \pi ^{-n_k(m-d)/2} \bigg \{\frac{\Gamma (\lambda _{n_k}/2)}{\Gamma (\lambda _0/2)}\bigg \}^{m-d} \nonumber \\&\times \prod _{j=d+1}^{m}\frac{(\lambda _0\sigma ^2_{0})^{\lambda _0/2}}{\left( \lambda _{n_k}s^{(k)}_j\right) ^{\lambda _{n_k}/2}}. \end{aligned}$$
(6)

Also, note that the probabilities \(\varvec{\theta }\) associated to the community assignment can be easily integrated out, resulting in the following marginal likelihood, conditional on K:

$$\begin{aligned} p(\varvec{z}\vert K)= \frac{\Gamma (\alpha )\prod _{k=1}^K \Gamma (n_k+\alpha /K)}{\Gamma (\alpha /K)^K\Gamma (n+\alpha )}. \end{aligned}$$
(7)

The distributional results presented in (4), (5), and (7) (for a proof, see, for example, Murphy 2007) are the building blocks for the MCMC sampler which is used to make Bayesian inference on the model parameters of interest.

4 Inference via MCMC

Since the full posterior is not analytically tractable, inference is performed using MCMC sampling with trans-dimensional moves (Green 1995). The main objective of the analysis is to cluster the nodes, and therefore the locations \(\varvec{\mu }_k\), the variance parameters \(\varvec{\Sigma }_k\) and \(\varvec{\sigma }^2_k\) and the community probabilities \(\varvec{\theta }\) are considered as nuisance parameters and integrated out. Essentially, in this type of collapsed Gibbs sampler (Liu 1994), four moves are available (Richardson and Green 1997), described in the subsequent four subsections. A similar sampler is used by Wyse and Friel (2012) to estimate the number of clusters in stochastic blockmodels.

4.1 Change in the community allocations

A fully collapsed Gibbs update for each community assignment is available:

$$\begin{aligned} p(z_i=k\vert \varvec{z}_{-i},{\mathbf {X}},d,K) \propto&\ p(z_i=k\vert \varvec{z}_{-i},d,K) \nonumber \\&\times p({\varvec{x}}_i\vert \{{\varvec{x}}_j\}_{j\ne i:z_j=k},d). \end{aligned}$$
(8)

In the special case where \(d=K_\varnothing \) and \(n_{z_i}=1\), the full conditional distribution for \(z_i\) assigns probability one to retaining the same value since the model does not permit \(d>K_\varnothing \). Otherwise, from (7):

$$\begin{aligned} p(z_i=k\vert \varvec{z}_{-i},d,K) \propto \frac{n_k^{-i} + \alpha /K}{n-1+\alpha }. \end{aligned}$$
(9)

where \(n_k^{-i}=n_k-\mathbb {1}_k(z_i)\). Similarly, the remaining term in (8), \(p({\varvec{x}}_i\vert \{{\varvec{x}}_j\}_{j\ne i:z_j=k},d)\), can be obtained as the ratio of marginal likelihoods

$$\begin{aligned} p({\varvec{x}}_i\vert \{{\varvec{x}}_j\}_{j\ne i:z_j=k}, d) = \frac{p({\varvec{x}}_i,\{{\varvec{x}}_j\}_{j\ne i:z_j=k}\vert d)}{p(\{{\varvec{x}}_j\}_{j\ne i:z_j=k}\vert d)}. \end{aligned}$$
(10)

The ratio (10) can be decomposed as the product of two ratios of marginal likelihoods. Using (5), the first ratio is equivalent to t distribution (Murphy 2007):

$$\begin{aligned}&p(\varvec{x}_{i,:d}\vert \{\varvec{x}_{j,:d}\}_{j\ne i:z_j=k},d) = \nonumber \\&\quad t_{\nu _{n_k^{-i}}}\left( \varvec{x}_{i,:d}\left| \varvec{m}^{(k)}_{:d},\frac{\kappa _{n_k^{-i}}+1}{\kappa _{n_k^{-i}}\nu _{n_k^{-i}}} \mathbf {D}^{(k),-i}_{:d}\right) \right. , \end{aligned}$$
(11)

where the additional superscript \(-i\) denotes a cluster quantity that is computed excluding the allocation \(z_i\) of the i-th node, and \(t_\nu (\cdot \vert \varvec{\mu },\varvec{\Sigma })\) denotes the density function of a (possibly multivariate) Student’s t distribution with \(\nu \) degrees of freedom, location \(\varvec{\mu }\), and shape matrix \(\varvec{\Sigma }\). The second ratio, which accounts for the last \(m-d\) dimensions, has the form

$$\begin{aligned} p(\varvec{x}_{i,d:}\vert \{\varvec{x}_{j,d:}\}_{j\ne i:z_j=k},d) = \prod _{j=d+1}^m t_{\lambda _{n_k^{-i}}}\left( x_{i,j}\left| 0,s^{(k),-i}_j\right) \right. . \nonumber \\ \end{aligned}$$
(12)

4.2 Split or merge two communities

To vary the number of communities, move proposals inspired by Sequentially-Allocated Merge-Split sampling (Dahl 2003; Jain and Neal 2004) are used here. Two indices i and j are sampled at random from the n nodes, and without loss of generality assume \(z_i\le z_j\). If \(z_i=z_j\), then the single cluster is split, whereas if \(z_i> z_j\) the two clusters are merged. In both move types, node i will remain in the same cluster, denoted \(z^\star _i=z_i\). In the merge move, all elements of cluster \(z_j\) are reassigned to cluster \(z_i\) (with any higher indexed clusters subsequently decremented). For the split move, node j is first reassigned to cluster \(K^\star =K+1\) with new allocation \(z^\star _{j}=K^\star \); then, in random order the remaining nodes currently allocated to cluster \(z_i\) are randomly reassigned to clusters \(z_i\) or \(K^\star \) with probability proportional to their predictive distribution from the generative model (10). Denoting the resulting product of renormalised predictive densities from these reallocations by \(q(K^\star ,\varvec{z}^\star \vert K, \varvec{z})\), the acceptance probability for a split move, for example, is

$$\begin{aligned}&\alpha (K^\star ,\varvec{z}^\star \vert K,\varvec{z}) =\nonumber \\&\min \left\{ 1, \frac{p(\mathbf {X}\vert d,K^\star ,\varvec{z}^\star ) p(d\vert \varvec{z}^\star ,K^\star )p(\varvec{z}^\star ,K^\star ) }{p(\mathbf {X}\vert d,K,\varvec{z}) p(d\vert \varvec{z},K)p(\varvec{z},K) q(K^\star ,\varvec{z}^\star \vert K, \varvec{z})} \right\} . \end{aligned}$$
(13)

The ratio of densities for \(\mathbf {X}\) in (13) will depend only upon the rows of the matrix corresponding to the cluster being split (or similarly, merged), and these expressions will decompose as a products of terms for the first d and remaning \(m-d\) components [cf. (11), (12)].

4.3 Create or remove an empty community

Adding or removing empty communities whilst fixing \(\varvec{z}\) corresponds to proposing \(K^\star =K+1\) or \(K^\star =K-1\) respectively, although the latter proposal is not possible if \(K=K_\varnothing \), meaning there are currently no empty communities. The acceptance probability is simply

$$\begin{aligned} \alpha (K^\star \vert K )= \min \left\{ 1, \frac{p(\varvec{z}\vert K^\star )p(K^\star )q_\varnothing }{p(\varvec{z}\vert K)p(K)} \right\} , \end{aligned}$$

where the proposal ratio \(q_\varnothing =2\) if \(K^\star =K_\varnothing \), \(q_\varnothing =0.5\) if \(K=K_\varnothing \) and \(q_\varnothing =1\) otherwise.

4.4 Change in the latent dimension

This move is only required when the latent dimension is not marginalised out. Given a current value d, a new value \(d^\star \) is proposed from a density \(q(d^\star \vert d)\propto \xi ^{\vert {d^\star -d}\vert }\mathbb {1}_{\mathcal {D}}(d^\star )\) on a neighbourhood \( {\mathcal {D}}=\{\max \{1,d-l\},\dots ,d-1,d+1,\dots ,\min \{d+l,m\}\}, \) typically with \(l\le 5\) and \(\xi \in (0,1)\). The acceptance ratio reduces to

$$\begin{aligned} \alpha (d^\star \vert d) = \min \left\{ 1, \frac{p(\mathbf {X}\vert d^\star ,K,\varvec{z})p(d^\star \vert \varvec{z})}{p(\mathbf {X}\vert d,K,\varvec{z})p(d\vert \varvec{z})}\frac{q(d\vert d^\star )}{q(d^\star \vert d)} \right\} . \end{aligned}$$

Notably, if \(d^\star >d\), the ratio \(p(\mathbf {X}\vert d^\star ,K,\varvec{z})/p(\mathbf {X}\vert d,K,\varvec{z})\) only depends on the first \(d^\star \) components of the embedding, since the last \(m-d^\star \) components remain independent by (1).

4.5 Inferring communities

Markov Chain Monte Carlo samplers for mixture models with varying number of clusters are well known to be affected by label switching (Jasra et al. 2005), since the likelihood is invariant to permutations of the cluster labels. However, the estimated posterior similarity between nodes i and j, \({\hat{\pi }}_{ij}=\hat{\mathbb P}(z_i=z_j\vert \mathbf {X})=\sum _{s=1}^{M} \mathbb {1}_{z_i^{(s)}}\{z_j^{(s)}\}/M\) is invariant to label switching. Communities can be estimated from the MCMC chains using the posterior similarity matrix \(\{{\hat{\pi }}_{ij}\}\) and the PEAR method (maximisation of the posterior expected adjusted Rand index, Fritsch and Ickstadt 2009). Alternatively, if a configuration with a fixed number of clusters K is required, the clusters can been estimated using hierarchical clustering with average linkage, using \(1-{\hat{\pi }}_{ij}\) as distance measure (Medvedovic et al. 2004).

5 Second-level clustering of community variances

Empirical analyses of simulations from the stochastic blockmodel show that identifying and clearly separating the K clusters in \(\mathbf {X}_{d:}\) is particularly difficult for the sampler in settings when \(d\ll m\). The problem is particularly evident when \(m=n\) and d is small. In this case, it has been assessed empirically that the within-cluster variance of the true communities in simulated datasets seems to converge to similar values, such that \(\sigma ^2_{k,j}\approx \sigma ^2_{\ell ,j}\) for \(j\gg d\) and \(k\ne \ell \). Therefore, when m is large enough, the selected model tends to be under-specified: the correct dimension d is identified, but the true number of communities K is underestimated. This is also one of the main reasons why it is not advisable to directly fit a Gaussian mixture model on \(\mathbf {X}\in {\mathbb {R}}^{n\times m}\) and allow K to be random, ignoring the role of d.

The problem is illustrated in Fig. 1, which shows the within-cluster variance for each dimension, obtained from performing ASE for a simulation of \(n=500\) nodes from a stochastic block model containing five communities with well-separated mean locations. In Fig. 1, the clustering structure \(\varvec{z}\) is assumed to be known. More details about simulations of SBMs are given in Sect. 7.1. In the plot, the within-cluster variance of three of the five communities of the simulated graph fluctuate around the same values on each dimension larger than d. For a dimension larger than approximately 150, four of the five clusters have approximately the same variance on the subsequent dimensions. Therefore, when \(m=n\), the MCMC sampler selects the MAP estimate \({\hat{K}}=2\) for parsimony, and increases the variance of the Gaussian distributions on the first two dimensions, on which the clusters are well separated.

Fig. 1
figure 1

Empirical within-block variance and total variance for the adjacency embedding obtained from a simulated 5-block SBM with \(d=2\), \(n=500\), d means \(\varvec{\mu }_1=[0.7,0.4], \varvec{\mu }_2=[0.1,0.1], \varvec{\mu }_3=[0.4,0.8], \varvec{\mu }_4=[-0.1,0.5]\) and \(\varvec{\mu }_5=[0.3,0.5]\), and \(n_k=100\) for \(k=1,\dots ,K\). The right panel is the left panel plot zoomed in to the first 25 dimensions

The solution proposed here is to assume shared variance parameters between some of the clusters for dimensions larger than d. Specifically, each community \(k\in \{1,\dots ,K\}\) is assigned a second-level cluster allocation \(v_k\in \{1,\dots ,H\}\), with \(H\le K\). If \(v_k=v_{\ell }\), then for \(j>d\), \(\sigma ^2_{k,j}=\sigma ^2_{\ell ,j}\). Formally,

$$\begin{aligned} \varvec{x}_i \vert d,K,z_i,v_{z_i} \sim \&{\mathbb {N}}_m \left( \begin{bmatrix} \varvec{\mu }_{z_i} \\ \varvec{0} \end{bmatrix}, \begin{bmatrix} \varvec{\Sigma }_{z_i} &{} \varvec{0} \\ \varvec{0} &{} \varvec{\sigma }^2_{v_{z_i}}\mathbf {I}_{m-d} \end{bmatrix} \right) , \\ v_k \vert K,H \sim \&\text {Multinoulli}(\varvec{\phi }),\ k=1,\dots ,K, \\ \varvec{\phi }\vert H \sim \&\text {Dirichlet}\left( {\beta }/{H},\dots ,{\beta }/{H}\right) , \\ H\vert K\sim \&\text {Uniform}\{1,\dots ,K\}. \end{aligned}$$

Essentially, the vector \(\varvec{v}=(v_1,\dots ,v_K)\) defines a clustering of communities. Note that if \(H=1\), all the communities are assigned to the same second-level cluster, and the problem of selecting d essentially reduces to an ordinal version of the feature selection problem in clustering (Raftery and Dean 2006). Also, if \(H=1\), then there is no information about the clusters in the last \(m-d\) components of the embedding.

Under this extended model, the posterior distribution for \(\sigma ^2_{j,k}\) changes due to the aggregation of communities in the second level. Under the \(\text {Inv-}\chi ^2(\lambda _0,\sigma ^2_0)\) prior, the posterior is \(\text {Inv-}\chi ^2(\lambda _{n_{\bullet k}},s^{(\bullet k)}_j)\), where \(n_{\bullet k} = \sum \nolimits _{\ell :v_\ell =k} n_{\ell }\) and

$$\begin{aligned} s^{(\bullet k)}_j= \left\{ \lambda _0\sigma ^2_{0} + \sum \nolimits _{i:z_i=k} x_{ij}^2\right\} \Big /\lambda _{n_{\bullet k}}. \end{aligned}$$

Calculations similar to (6) give the correct form of the marginal likelihood for the right hand side of the matrix, restricted to a given value of \(v_k\). Clearly, \(\varvec{\phi }\) can be again marginalised out, yielding the marginal likelihood

$$\begin{aligned} p(\varvec{v}\vert H) = \frac{\Gamma (\beta )\prod _{h=1}^H \Gamma (\sum _{k=1}^K \mathbb {1}_{h}\{v_k\}+\beta /H)}{\Gamma (\beta /H)^H\Gamma (K+\beta )}. \end{aligned}$$

The MCMC sampler described in Sect. 4 must be slightly adapted. For the Gibbs sampling move in Sect. 4.1, the product of univariate Student’s t densities in (12) is modified using the appropriate \((\lambda _{n_{\bullet k}},s^{(\bullet k)}_j)\) pair. For the change in dimension, \(p(\mathbf {X}\vert d,K,\varvec{z}, \varvec{v})\) should be computed using the shared variances and the allocations \(\varvec{v}\). When an empty community is proposed, as in Sect. 4.3, the ratio \(p(\varvec{v}^\star \vert K)/p(\varvec{v}\vert K)\) must be added, limited to the second level allocation of the additional community. The value \(v_k\) for the proposed empty cluster can be simply chosen at random from \(\{1,\dots ,H\}\). Finally, for the split-merge move in Sect. 4.2, if \(z_i=z_j\) for the two selected nodes, then \(v_{z_i}=v_{z_j}\) after the split move. Alternatively, if \(z_i\ne z_j\), then the new value of \(v_k\) is sampled at random from \(v_{z_i}\) and \(v_{z_j}\).

Finally, three additional moves are required: resampling the second-level cluster allocations \(\varvec{v}\) using a Gibbs sampling step; proposing a second-level split-merge move; and adding or removing an empty second-level cluster. When \(\varvec{\phi }\) and the parameters of the Gaussian distributions are marginalised out, the second-level allocations are resampled according to the following equation:

$$\begin{aligned}&p(v_k=h\vert \varvec{v}_{-k},{\mathbf {X}},\varvec{z},d,K) \propto p(v_k=h\vert \varvec{v}_{-k},K) \nonumber \\&\quad \times p\left( \mathbf {X}^{(k)}_{d:}\left| \left\{ \mathbf {X}^{(\ell )}_{d:}\right\} _{\ell \ne k:v_\ell =h},v_k=h,d,K\right. \right) , \end{aligned}$$
(14)

where the independence assumption between \(\mathbf {X}^{(k)}_{:d}\) and \(\mathbf {X}^{(k)}_{d:}\) is used. Similarly to (9):

$$\begin{aligned} p(v_k=h\vert \varvec{v}_{-k},K) = \frac{\sum _{\ell \ne k} \mathbb {1}_h\{v_\ell \}+\beta /H}{K-1+\beta }. \end{aligned}$$

The calculations for the second term in (14) are similar to (10):

$$\begin{aligned}&p\left( \mathbf {X}^{(k)}_{d:}\left| \left\{ \mathbf {X}^{(\ell )}_{d:}\right\} _{\ell \ne k:v_\ell =h}\right. ,v_k=h,d,K\right) = \\&\quad \frac{\left. p\left( \mathbf {X}^{(k)}_{d:},\left\{ \mathbf {X}^{(\ell )}_{d:}\right\} _{\ell \ne k:v_\ell =h}\right| v_k=h,d,K\right) }{\left. p\left( \left\{ \mathbf {X}^{(\ell )}_{d:}\right\} _{\ell \ne k:v_\ell =h}\right| d,K\right) }, \end{aligned}$$

which can be computed using (6). The second-level split-merge move and the proposal of an empty cluster follows the same guidelines in Sects. 4.2 and 4.3.

Potentially, the model could be extended further using the same reasoning: from the plot in Fig. 1, it is clear that the different clusters begin to share the same variance at different points in the plot. Empirically, all the variances approximately converge to the same values at large dimensions, and it is therefore possible to identify a \((K-1)\)-vector of discrete points in \(\{d,d+1,\dots ,m\}\) at which different community variances coalesce. For the plot in Fig. 1, such vector could be (ddd, 150, n), with \(d=2\) and \(n=m=500\).

6 Extension to directed and bipartite graphs

A directed graph \({\mathbb {G}}=(V,E)\) has the property that , meaning the corresponding adjacency matrix \(\mathbf {A}\in \{0,1\}^{n\times n}\) is not, in general, symmetric. Directed graphs are useful for representing directed interaction networks, such as email traffic patterns; knowing that individual i broadcasts emails to individual j does not immediately imply that j also issues communications to i. In a random dot product graph context, it can be assumed that each node has two underlying latent positions \(\varvec{x}_i\) and \(\varvec{x}_i^\prime \), characterising, respectively, its behaviour as a source and as a destination of the connection. Therefore, \(\mathbb P(A_{ij}=1)=\varvec{x}_i^\top \varvec{x}_j^\prime \). For a stochastic blockmodel \({\mathbb {P}}(A_{ij}=1)= B_{z_iz_j}=\varvec{\mu }_{z_i}^\top \varvec{\mu }_{z_j}^\prime \) for latent positions \(\varvec{\mu }_{z_i},\varvec{\mu }_{z_j}^\prime \in {\mathbb {R}}^d\), where the matrix \(\mathbf {B}\in [0,1]^{K\times K}\) is in this case asymmetric.

Definition 3

(Adjacency embedding of the directed graph) Given a directed graph with adjacency matrix \(\mathbf {A}\in \{0,1\}^{n\times n}\), and an integer \(d\in \{1,\dots ,n\}\), consider the singular value decomposition

$$\begin{aligned} \mathbf {A} = \begin{bmatrix} \hat{\mathbf {U}}&\hat{\mathbf {U}}_\perp \end{bmatrix} \begin{bmatrix} \hat{\mathbf {D}} &{} \mathbf {0} \\ \mathbf {0} &{} \hat{\mathbf {D}}_\perp \end{bmatrix} \begin{bmatrix} \hat{\mathbf {V}}^\top \\ \hat{\mathbf {V}}^\top _\perp \end{bmatrix} = \hat{\mathbf {U}}\hat{\mathbf {D}}\hat{\mathbf {V}}^\top + \hat{\mathbf {U}}_\perp \hat{\mathbf {D}}_\perp \hat{\mathbf {V}}^\top _\perp , \end{aligned}$$

where \(\hat{\mathbf {D}}\in {\mathbb {R}}_+^{d\times d}\) is diagonal matrix containing the top d singular values in decreasing order, \(\hat{\mathbf {U}}\in {\mathbb {R}}^{n\times d}\) and \(\hat{\mathbf {V}}\in \mathbb R^{n\times d}\) contain the corresponding left and right singular vectors, and the matrices \(\hat{\mathbf {D}}_\perp \), \(\hat{\mathbf {U}}_\perp \), and \(\hat{\mathbf {V}}_\perp \) contain the remaining \(n-d\) singular values and vectors. The d-dimensional directed adjacency embedding of \(\mathbf {A}\) in \({\mathbb {R}}^{d}\), is defined as the pair \( \hat{\mathbf {X}}=\hat{\mathbf {U}}\hat{\mathbf {D}}^{1/2},\ \hat{\mathbf {X}^\prime }=\hat{\mathbf {V}}\hat{\mathbf {D}}^{1/2}\).

Writing \(\mathbf {X}={\mathbf {U}}{\mathbf {D}}^{1/2}\) and \(\mathbf {X}^\prime ={\mathbf {V}}{\mathbf {D}}^{1/2}\), the rows of \(\mathbf {X}\) characterise the activities of each node as a source, and the rows of \(\mathbf {X}^\prime \) characterise the same nodes as destinations.

The model in (1) can be easily adapted to directed graphs. Treating the embeddings \(\mathbf {X}\) and \(\mathbf {X}^\prime \) as independent, each is modelled separately using the same Gaussian structure and prior distributions (1), except for three parameters which are initially assumed common to both embeddings: the latent dimension d, the number of communities K and the vector of node assignments to those communities, \(\varvec{z}\).

In some contexts it will be more relevant to consider different community membership structures for the same set of nodes when considering them as source nodes or destination nodes. In this case, let K denote the number of source communities and \(K^\prime \) denote the number of destination communities; similarly let \(\varvec{z}\) denote the assignments of nodes to source communities, and \(\varvec{z}^\prime \) the allocations to destination communities. The problem of jointly learning \(\varvec{z}\) and \(\varvec{z}^\prime \) (as well as d) is commonly known as co-clustering, and the corresponding network model is known as the stochastic co-blockmodel (ScBM) (Rohe et al. 2016), or Latent Block Model (LBM) (Govaert and Nadif 2010). Given an asymmetric matrix \(\mathbf {B}\in [0,1]^{K\times K^\prime }\), then \({\mathbb {P}}(A_{ij}=1)=B_{z_iz^\prime _j}\). From a random dot product graph perspective, it is assumed that \(B_{z_iz^\prime _j}=\varvec{\mu }_{z_i}^\top \varvec{\mu }^\prime _{z^\prime _i}\), for some latent positions \(\varvec{\mu }_{z_i},\varvec{\mu }^\prime _{z^\prime _i}\in {\mathbb {R}}^d\) and \(d={{\,\mathrm{rank}\,}}(\mathbf {B})\le \min (K,K^\prime )\).

The Bayesian model for ScBMs can be easily represented as a separate model for \(\mathbf {X}\) and \(\mathbf {X}^\prime \), of the form given in (1), with the latent dimension of the embedding d now the only common parameter. Inference via MCMC can be performed in an equivalent way to the method described in Sect. 4; the only difference is in the expression of the acceptance ratio for a change in the shared latent dimension d, but the procedure can exploit the results obtained in Sect. 4.4, using the fact that

$$\begin{aligned}&p(\mathbf {X},\mathbf {X}^\prime \vert d,K,K^\prime ,\varvec{z},\varvec{z}^\prime ) = \prod _{k=1}^{K} \left. p\left( \mathbf {X}^{(k)}_{:d}\right| d\right) \left. p\left( \mathbf {X}^{(k)}_{d:}\right| d\right) \\&\quad \times \prod _{k^\prime =1}^{K^\prime } \left. p\left( \mathbf {X^{\prime }}^{(k^\prime )}_{:d}\right| d\right) \left. p\left( \mathbf {X^{\prime }}^{(k^\prime )}_{d:}\right| d\right) , \end{aligned}$$

where all the marginal likelihoods can be equivalently obtained from (5). Furthermore, the model can be appropriately modified when \(d\ll m\) to include the second-level cluster allocations proposed in Sect. 5.

Finally, in bipartite graphs, the observed nodes can be partitioned into two sets V and \(V^\prime \), with \(V\cap V^\prime =\varnothing \) and \(E\cap (V\times V) \cup (V^\prime \times V^\prime )=\varnothing \). Assume that V plays the role of the set of source nodes and \(V^\prime \) of the set of destination nodes. Bipartite graphs are usually represented by a rectangular bi-adjacency matrix \(\mathbf {A}\in \{0,1\}^{n\times n^\prime }\), with \(n^\prime =\vert {V^\prime }\vert \). In this case, it is still possible to apply the methods described in this section to the SVD embedding obtained from the rectangular matrix \(\mathbf {A}\). Note that the ScBM extends trivially to the bipartite graph case, which is essentially a special case of a directed graph, with the cluster configurations for source and destination nodes now inescapably unrelated and each node possessing only one latent representation in \({\mathbb {R}}^d\).

7 Applications and results

The Bayesian latent feature network models described in this article have been applied to both simulated and real world network data from undirected and directed graphs. The real network data analysed are from an undirected network obtained from the Santander bikes hires in London, and the Enron Email Dataset ; details are given in the corresponding sections.

The model and MCMC sampler have been tested using different combinations of the hyperparameters, showing robustness to the prior choice. In absence of strong prior information about the community structure, it is advisable to use the usual uniformative values \(\kappa _0=\nu _0=\lambda _0=\alpha =\beta =1\), and \(\omega =\delta =0.1\). For the proposal of change in dimension (cf. Sect. 4.4), \(\xi =0.8\). Those values have been used as default settings for the MCMC sampler in the next sections. Inferential performance is sensitive to extreme values of the variance parameters, relative to the prior mean, but otherwise robust. So in practice, the expectation of the prior for the variance parameters should be chosen to be on the same scale as the observed data. The cluster configuration could be suitably initialised using K-means for some pre-specified K, usually chosen according to the scree-plot criterion. The second-level clusters have been initialised setting \(H=K\). In order to set the prior covariances to a realistic value, the correlations in the \(\varvec{\Delta }_d\) matrices could be set to zero, and the elements on the diagonal of \(\varvec{\Delta }_d\) to the average within-cluster variance based on the K-means cluster configuration. Similarly, the prior \(\sigma ^2_{0j}\) values could be set to the total variance on the corresponding column of the embedding.

In Sects. 7.2 and 7.3, the algorithms were initialised using the above guidelines, and run for a total of samples with burn-in , for a number of different chains to check for convergence.

Fig. 2
figure 2

Adjacency embedding for an undirected graph with nodes, \(K=5\), obtained from a symmetric \(\mathbf {B}\in [0,1]^{K\times K}\) with \(B_{k\ell }\sim \text {Beta}(1.2,1.2)\), \(\varvec{\theta }=(1/K,\dots ,1/K)\) and \(d=2\). For (e) and (f), \(m=50\) is used

7.1 Synthetic data and model validation

In order to validate the model assumptions in (1), stochastic blockmodels have been simulated and the fit of the proposed model has been evaluated on the estimated latent positions. A stochastic blockmodel can be simulated starting from a matrix \(\mathbf {B}\in [0,1]^{K\times K^\prime }\) containing the probabilities of connection between communities, and a vector \(\varvec{\theta }\) of community allocation probabilities. For an undirected graph, \(K=K^\prime \) and the constraint \(B_{k\ell }=B_{\ell k}\) is imposed; similarly, for directed graphs with a shared cluster configuration (cf. Sect. 6), \(K=K^\prime \). Each element \(B_{k\ell }\) of \(\mathbf {B}\) could be generated, for example, from independent beta draws.

The latent dimension d corresponds to the rank of the matrix \(\mathbf {B}\), and a random matrix \(\mathbf {B}\) generated from independent beta draws has full rank with probability 1. Therefore, to simulate \(d<K\) a low-rank approximation of \(\mathbf {B}\) must be used to generate the embedding. For undirected graphs, a truncated spectral decomposition can be used: \(\tilde{\mathbf {B}}=\varvec{\Gamma }_d\varvec{\Lambda }_d\varvec{\Gamma }_d^\top \) (recall Definition 1). Similarly, for the directed and bipartite graphs, the truncated SVD is an appropriate approximation: \(\tilde{\mathbf {B}}= \mathbf {U}_d\mathbf {D}_d\mathbf {V}_d^\top \) (see Definition 3). Note that under this low-rank approximation, it must be checked that each element \({\tilde{B}}_{k\ell }\in [0,1]\). The procedure is summarised in Algorithm 1.

figure a

The algorithm can be also extended to directed and bipartite graphs. In this section, each element \(B_{k\ell }\) of \(\mathbf {B}\) was generated from a \(\text {Beta}(1.2,1.2)\) distribution, which produces communities with a moderate level of separation. For the given choice of K, the community allocations probabilities in the simulations were chosen to be \(\varvec{\theta }=(1/K,\dots ,1/K)\), providing balanced clusters.

If a stochastic blockmodel is simulated using Algorithm 1, all the parameters are known, and the fit of model (1) can be evaluated using the true underlying cluster allocations \(\varvec{z}\).

7.1.1 Emprical analysis of spectral embeddings

Figure 2 illustrates results for a synthetic undirected stochastic blockmodel with \(d=2\) and \(K=5\). Figure 2a shows the scatterplot of the first two columns of the adjacency embedding \(\mathbf {X}\), coloured using the true underlying communities. The plot shows well-separated clusters, which can be suitably modelled using a Gaussian mixture. Figure 2b shows the scatterplot of the next two dimensions. Clearly, the community mean locations are significantly different from zero in just the first two dimensions. This is further illustrated in Fig. 2c, which plots the empirical within-cluster means for each dimension, obtained using the known clustering \(\varvec{z}\). Fig. 2c gives empirical evidence that the form \([\varvec{\mu }_{z_i}^\top ,\varvec{0}^\top ]^\top \) for the mean of \(\varvec{x}_i\) in (1) suitably describes what is observed in SBMs. Similarly, Fig. 2d shows the empirical within-cluster variances for each dimension, again obtained using the known values of \(\varvec{z}\) from the simulation. In Fig. 2d, the difference between the within-cluster and overall variance is evident only in the first two dimensions, after which the quantities are of the same order of magnitude. The plot also shows that the within-cluster variances differ across communities, suggesting that it is appropriate to have cluster-specific values of \(\sigma ^2_{j,k}\) for \(j>d\); this phenomenon can also be witnessed in Fig. 2b, and Fig. 1 in Sect. 5. Nevertheless, if the MCMC sampler were run on the simulated data in Fig. 2, it could also be appropriate to use a second-level clustering with \(H=3\), since the variances of three of the five communities are approximately the same for dimensions larger than \(d=2\). Furthermore, for small m and fairly large n, it seems from Fig. 2d that the vector \(\varvec{\sigma }^2_k\) could be approximated by a constant \(\sigma ^2_k\) as in Yang et al. (2019). However, for small values of n and large m, as in Fig. 1, parameter vectors \(\varvec{\sigma }^2_k\) are clearly required. If a constant \(\sigma ^2_k\) is used, the inferential procedure is essentially identical, but (6) should be modified accordingly.

Figure 2e shows the histogram of the empirical correlation coefficients \(\rho _{ij}^{(k)},\ i,j=1,\dots ,m,\ i<j,\) for each community, obtained from the known \(\varvec{z}\). The cluster-specific correlation coefficients \(\rho _{12}^{(k)}\) between \(\mathbf {X}_1\) and \(\mathbf {X}_2\) are represented by bullet points in the plot, suggesting dependence for at least one of the clusters, confirming the result of Rubin-Delanchy et al. (2017), and providing further evidence that cluster-specific full covariance matrices \(\varvec{\Sigma }_{k}\) should be used for the covariance of \(\varvec{x}_i\) in (1). On the other hand, the empirical within-cluster correlations for \(\mathbf {X}_{d:}\) tend to be small and centred around 0, suggesting that the assumption of independence is appropriate in that part of the model in (1). Finally, Fig. 2f plots the marginal log-likelihood as a function of d, using the known cluster configuration \(\varvec{z}\). The marginal likelihood strongly favours the true value \(d=2\), resulting in a posterior distribution essentially consisting of a point mass at the true value.

Fig. 3
figure 3

Simulated adjacency embedding for a bipartite \(250\times 300\) graph with \(K=5\) and \(K^\prime =3\), obtained from \(\mathbf {B}\in [0,1]^{K\times K^\prime }\) with \(B_{k\ell }\sim \text {Beta}(1.2,1.2)\), \(\varvec{\theta }=(1/K,\dots ,1/K)\), \(\varvec{\theta }^\prime =(1/K^\prime ,\dots ,1/K^\prime )\), and \(d=2\). For (f), \(m=50\)

Figure 3 shows similar results for a simulated bipartite graph with separate community structures for nodes as sources and destinations, with \(d=2\), \(K=5\) and \(K^\prime =3\). Again, the scatterplot for \(\mathbf {X}_1^\prime \) and \(\mathbf {X}_2^\prime \) in Fig. 3a are well-separated and can be easily estimated using the Gaussian mixture model. Figure 3b, c show the empirical within-cluster means for each dimension, obtained using \(\varvec{z}\) and \(\varvec{z}^\prime \). The zero-mean assumption for the columns with index larger than d seems to hold even for a relatively small number of nodes per community. Figure 3d, e show the empirical within-cluster variances for each dimension. Again, it is confirmed that the variances are different for each community, even on the last \(m-d\) components. The within-cluster variance on the last \(m-d\) also seem to be decreasing, showing that for small graphs the parameter vector \(\varvec{\sigma }^2_k\) could be preferable to a constant \(\sigma ^2_k\) as in Yang et al. (2019). In Fig. 3f, the marginal likelihood strongly favours the true value \(d=2\), which again results in a point mass posterior centred at the true value. Similar considerations hold for the correlations, with results which are similar to the plots in Fig. 2e for the undirected graph.

Table 1 Results of the inferential procedure for undirected SBMs simulated using different (dK) pairs

7.1.2 Model parameter estimation

Table 1 shows the results from inference for model (1) when applied to SBMs simulated with different values of d and K, for all the possible combinations of the models considered in this article. For each (dK) pair, an undirected SBM with nodes is generated, and the MCMC sampler is run three times, each with samples after a burn-in of . The samplers are initialised using the guidelines discussed in the introduction to Sect. 7. The table reports the averaged values of d, \(K_\varnothing \) and \(H_\varnothing \) obtained from the MCMC chains. The posterior mean values of d and K which are obtained are extremely close to the true values, implying that the results support the proposed model.

From Table 1, the performance of adjacency and Laplacian spectral embedding for recovering d, K and \(\varvec{z}\) seems to be equivalent. Similar results for \(\mathbf {A}\) and \(\mathbf {L}\) are also obtained for the application on real data in Sect. 7.2. Note that, for fixed values of K, Priebe et al. (2019) demonstrate that, in practical applications, LSE tends to better capture the affine structure in stochastic blockmodels, whereas ASE better identifies the core-periphery structure. Also, Table 1 shows that the constrained and unconstrained models seem to give an equivalent performance. The difference between the constrained model (2) and unconstrained model (3) will be more evident in Sects. 7.2 and 7.3, where the data might deviate from the stochastic blockmodel assumption. Overall, for synthetic data, the model seems robust and able to detect the correct d and K in a variety of different settings.

Table 2 Results for the MCMC sampler on simulated undirected SBMs for different values of m, with and without second order clustering

7.1.3 Second-level clustering

It is also possible to evaluate the effect of the second-order clustering for estimation of the community structure \(\varvec{z}\), and the parameters d and K. For the same simulated graph, the MCMC sampler has been run with and without the second order clustering, using the same settings as the simulation in Table 1. Note that the absence of second order clustering corresponds to the case \(H=K\). The procedure is repeated for different values of m, for n fixed, to study the effect of second-order clustering when m is increased. The results of the simulations are summarised in Table 2 for two different simulated graphs. The table reports the maximum a posteriori (MAP) values of d and \(K_\varnothing \), and the adjusted Rand index (ARI) (Hubert and Arabie 1985) for the estimated clustering of the nodes, obtained setting \(K_\varnothing \) to its MAP estimate. For simplicity, only the results for the unconstrained model using ASE are reported. For the case when H is unknown, the table reports the posterior mean of \(H_\varnothing \).

In all the simulation settings, the model was able to recover the correct values of d, but when m is large, only the second-order clustering model allows K to be estimated correctly. When the second-order clustering is not used and m is very large relative to d and K, the MAP estimate \({\hat{K}}_\varnothing \) tends to be underestimated, and the inferred clustering structure is therefore also negatively affected. Also, the table shows that as \(m\rightarrow n\), the estimated value of \(H_\varnothing \) tends to decrease towards 1. Hence, to reduce the computational burden for large m, it would be possible to set \(H=1\), corresponding to the scenario \(\varvec{\sigma }^2_k=\varvec{\sigma }^2\ \forall k\), studied in Raftery and Dean (2006) in the context of variable selection in GMMs. Table 2 also suggests that the model with second-order clustering is robust to the choice of m. This is one of the main advantages of the proposed model: the correct d can be recovered for any choice of m, provided \(d\le m\). Choosing an upper bound m is easier than choosing the correct d, especially because of the robustness of the model. On the other hand, choosing large values of m makes the MCMC sampler computationally more expensive. A suitable choice would be to set m based on common criteria to obtain d (for example, the profile likelihood criterion of Zhu and Ghodsi 2006). For a given value \(d^\star \) obtained using such criterion, it would be appropriate to choose \(m>d^\star \), and then correct the initial estimate of d using the proposed model.

7.2 Undirected graphs: Santander bikes

The Santander Cycle hire scheme is a bike sharing system implemented in central London. Transport for London periodically releases data on the bike hiresFootnote 1. Considering this as a network, the nodes correspond to bike sharing stations, and an undirected edge between stations i and j is drawn if at least one ride between the two stations is completed within the time period considered. In this example, one week of data were considered, from 5 September until 11 September, 2018. The total number of stations used, \(n=783\); the total number of undirected edges , implying the adjacency matrix is fairly dense. Note that it is possible to collect and return the bike to the same docking station. Those edges, amounting to less than \(1\%\) of |E|, have been removed, since our modelling framework is specifically developed for hollow binary matrices.

Fig. 4
figure 4

Posterior distributions and scree-plots for the Santander bike network data using adjacency and Laplacian embeddings, for the unconstrained model (3). The dashed line in a and b, and the large bullet point in c represent MAP estimates of d

The results of the Bayesian inferential procedure, using the unconstrained prior (3) for d, applied to the adjacency and Laplacian embeddings for the Santander bike network are presented in Fig. 4. The initial value of K was set to 10, with \(m=25\), but similar estimates were obtained using different starting points for K and different values of m. It is interesting to note the different shapes of the posterior barplots of \(K_\varnothing \) and \(H_\varnothing \), Figs. 4a, b showing that the second-level clustering is crucial to obtain a more accurate model fit when the adjacency embedding is used. On the other hand, for the Laplacian embedding, the posteriors for \(K_\varnothing \) and \(H_\varnothing \) are extremely similar, suggesting that the second-level clustering is not required for \(m=25\). The MAP values \(d=11\) (adjacency) and \(d=12\) (Laplacian) correspond to the elbow in the scree-plots (see Fig. 4c for \(\mathbf {A}\)).

Note that, especially in the case of the adjacency embedding, d and K have similar values, showing that the two graphs might be well described by a stochastic blockmodel. Similarly, the constrained model with \(d\sim \text {Uniform}\{1,\dots ,K_\varnothing \}\) (1) returns the same MAP estimates for d, but the constraint \(d\le K_\varnothing \) results in a larger number of small clusters; the posterior of \(K_\varnothing \) essentially reduces to the rescaled probability mass function obtained from the unrestricted model, constrained such that \(d\le K_\varnothing \), since the posterior for d is approximately a point mass.

Fig. 5
figure 5

Santander bike sharing stations in London and maximum a posteriori estimates of the cluster allocations of the stations, obtained using hierarchical clustering with distance \(1-{\hat{\pi }}_{ij}\), \(K=11\). Stations in the same convex hull share the same cluster

The resulting estimated clustering for the unconstrained model (3) based on the adjacency embedding and \(K=11\) (the MAP for d), plotted in Fig. 5, shows a clear structure: the largest communities have approximately the same extension, with a few exceptions. This is expected, since the bikes are free for the first 30 minutes and have limited speed, and are therefore used for small distance journeys. Two clusters are significantly smaller than the others, and correspond to touristic areas around Westminster, Covent Garden and Buckingham Palace. On the other hand, two clusters have a large geographical extension, and cover the East and West London areas. For the adjacency embedding, the MAP clustering obtained from the restricted model is almost identical. The PEAR method (Fritsch and Ickstadt 2009) suggests \(K=7\) communities instead. Similarly, if the Laplacian embedding is used, the MAP clustering structure suggested by PEAR has \(K=7\) communities for the unconstrained model (3) and \(K=12\) for the constrained model (1).

Fig. 6
figure 6

Scree-plot of singular values of \(\mathbf {A}\) and posterior distributions of \(K_\varnothing , H_\varnothing \) for the Enron data. The large bullet points in a, and the dashed lines in be represent MAP estimates of d

7.3 Directed graphs: Enron email dataset

Next, the algorithm is applied to a directed network: the Enron Email DatasetFootnote 2. The Enron database is a corpus of emails sent by the employees of the Enron corporation. The version of the Enron data which has been analysed here is described in Priebe et al. (2005), and consists of \(n=184\) nodes and directed edges. A directed edge \(i\rightarrow j\) is drawn if the employee i sent an email to the employee j.

The results of analysing this network are presented in Fig. 6. The initial value of K was set to 10, with \(m=25\), but again similar results were obtained using different starting points for K and different values of m. The plots in Figs. 6b, c report the estimated posterior distributions of \(K_\varnothing \) and \(H_\varnothing \) for the constrained (1) and unconstrained (3) models. Interestingly, the MAP estimate for d coincides with the MAP estimate for K in the unconstrained model, which is promising. For the constrained model, the MAP for K exceeds the MAP for d by 1, allowing for \(\text {rank}(\mathbf {B})<K\). Overall, d and K have similar values, showing that the graph might be well described by a directed stochastic blockmodel.

The posterior distributions for \(K_\varnothing \) and \(H_\varnothing \) in Fig. 6b, c are fairly different, showing that the second-level clustering might be relevant for this model. Inference on the model without second-level clustering confirms this impression: the posteriors for \(K_\varnothing \), presented in Fig. 6d, e have a more symmetric shape, and the MAP latent dimension is \(d = 6\). As before, the MAP for K is \(d+1=7\), providing some evidence for the possibility \(\text {rank}(\mathbf {B})<K\).

From Fig. 6a, the selected MAP values \(d=6\) and \(d=9\) for the models with and without second-level clustering seem to be a tradeoff between the two most popular criteria for selection of the appropriate latent dimension: the eigengap heuristic suggests \(d=5\) if the second largest difference is considered, and the elbow in the scree-plot is approximately located around \(d\approx 15\).

8 Conclusion

In this article, a novel Bayesian model has been proposed for automatic and simultaneous estimation of the number of communities and latent dimension of stochastic blockmodels, interpreted as special cases of generalised random dot product graphs. The Bayesian framework allows the number of communities K and latent dimension d to be treated as random variables, with associated posterior distributions. The postulated model is based on asymptotic results in the theory of network embeddings and random dot product graphs, and has been validated on synthetic datasets, showing good performance at recovering the latent parameters and communities. The model has been extended to directed and bipartite graphs, using SVD embeddings and allowing for co-clustering.

Overall, the main advantage of the proposed methodology is to allow for an arbitrarily large value of m, the number of columns (dimension) of the embedding at the first stage of the analysis, and then to treat d and K separately, allowing for the case \(d=\hbox {rank}(\mathbf {B})<K\), which is often overlooked in the literature. Problems arising from overspercification of m are tackled using a second-level clustering procedure. Also, the model provides an automated procedure and criterion to select the dimension of the embedding and an appropriate number of communities. If d is not constrained to be less than or equal to K, the model also provides empirical evidence of the goodness-of-fit of a stochastic blockmodel for the observed data. Results on real world network data sets show encouraging results in recovering the correct d, when compared to commonly used heuristic methods, and the community structure.

9 Supplementary material

The python code and datasets are available at https://www.github.com/fraspass/sbm.