Latent Space Approaches to Community Detection in Dynamic Networks

: Embedding dyadic data into a latent space has long been a popular approach to modeling networks of all kinds. While clustering has been done using this approach for static networks, this paper gives two methods of community detection within dynamic network data, building upon the distance and projection models previously proposed in the literature. Our proposed approaches capture the time-varying aspect of the data, can model directed or undirected edges, inherently incorporate transitivity and account for each actor’s individual propensity to form edges. We provide Bayesian estimation algorithms, and apply these methods to a ranked dynamic friendship network and world export/import data.


Introduction
Researchers are often interested in detecting communities within dyadic data. These dyadic data are represented as networks with a certain number of actors which can form amongst themselves relationships/connections called edges. Some examples of such data include social networks, collaboration networks, biological networks, food-webs, power grids, linguistic networks. These dyadic data can have directed or undirected edges, have zero-one or weighted edges and can come in the form of static or dynamic (time-varying) networks. Clustering these data into communities can lead to better understanding of the organization of the objects in the network, and, for dynamic networks, how this organization evolves over time. Xing et al. (2010) developed a dynamic mixed membership stochastic blockmodel. This work builds on the stochastic blockmodel (Holland et al., 1983), further developed into the mixed membership blockmodel (Airoldi et al., 2005). In the work of Xing et al. (2010), each actor has an individual membership probability (time-varying) vector and, based on this probability vector, can choose certain roles with which to interact with other actors. A different approach to be taken in this paper begins with the work of Hoff et al. (2002) where the actors are embedded either within a latent Euclidean space, referred to as the distance model, or within a hypersphere, referred to as the projection model. Handcock et al. (2007) used their distance model and performed community detection on time point, each actor belongs to one of a fixed number G of clusters; this cluster assignment may change over time. We will denote the latent position of actor i at time t as X it and the cluster assignment for actor i at time t as Z it , a G dimensional vector in which one element is 1 and the others are zero. We will also let X t = (X 1t , . . . , X nt ) and Z t = (Z 1t , . . . , Z nt ) . While the dependency structure of the model may vary, we assume throughout the paper that given the latent positions X t , Y t and Y s , s = t, are conditionally independent; in many cases (such as binary networks) this assumption can be further extended such that y ijt and y i j t are conditionally independent given X t .
In community detection within a latent space approach, we use the decomposition The idea here is that the edge probabilities are determined by some underlying attributes which are captured in the latent variables. Thus if we detect a community in the network it is because there is a corresponding cluster of attributes. For example, if we see in a social network a group of close friends, this close group, or community, exists because these friends are similar in some fundamental ways, i.e., they have attributes that are clustered together.

Distance model
Within the context of the distance model, the network is embedded within a latent Euclidean space, where the probability of edge formation increases as the Euclidean distance between actors decreases. Let D(X t ) denote the n × n distance matrix constructed such that D(X t ) ij d ijt = X it − X jt . In general we will assume that the density of Y t can be written as a function of the distance matrix D(X t ) and some set of likelihood parameters, which we will denote as θ . For example, the original likelihood for binary networks in Hoff et al. (2002) is where in this context θ = {α}. Variants of this likelihood have been proposed, such as in Sarkar and Moore (2005), Krivitsky et al. (2009), and Sewell and Chen (2015b). This last was then extended to account for a wide range of weighted networks in Sewell and Chen (2016). Other likelihoods may be better suited for various other types of weighted edges (see, e.g., Sewell and Chen, 2015a). Handcock et al. (2007) clustered static network data by clustering the latent positions via a normal mixture model. This cannot be directly applied to dynamic network data since the latent positions must have some sort of temporal dependency imposed. Therefore we propose applying the model-based longitudinal clustering model given by Sewell et al. (2016) to the latent positions. Our focus here is the modeling of the latent positions, which can then be used for whatever likelihood formulation is most appropriate to the data. We will now describe this model for the latent variables.
We make two assumptions on the latent positions and the cluster assignments. First, the cluster assignments are assumed to follow a Markov process, i.e., Second, given the current cluster assignment and all previous cluster assignments and latent positions, we assume the current latent positions depend only on the previous latent positions and the current cluster assignments, i.e., The joint density of the latent positions and the cluster assignments is given as where N (X|µ, Σ) is the normal density with mean vector µ and covariance matrix Σ evaluated at X. Thus the communities are each modeled as a multivariate normal distribution in the latent space with mean µ g and covariance matrix Σ g . Since these refer to the location and shape of the g th community in the latent network space, we will refer to µ g and Σ g as the g th community location and community shape respectively. The mean of the latent position X it is then modeled as λµ g + (1 − λ)X i(t−1) , λ ∈ (0, 1), which is a blending of the current cluster effect µ g with the individual temporal effect X i(t−1) . Hence we will refer to λ as the blending coefficient. The β 0g 's determine the probability of initially belonging to the g th community and the β hk 's determine the probability of transitioning from the h th community to the k th community. We will therefore refer to the vectors β 0 = (β 01 , . . . , β 0G ) and β h = (β h1 , . . . , β hG ), h = 1, . . . , G, respectively as the initial clustering parameter and the transition parameter for group h. Cox and Cox (1991) and Banerjee et al. (2005) gave many contexts in which there has been empirical evidence that embedding data onto a hypersphere and/or using cosine distances is preferable to Euclidean space/distances. Here we continue this tradition by embedding dynamic network data onto the hypersphere. In this section we assume the more specific, but most commonly encountered, context of directed binary edges (the model to be proposed can be simplified for undirected edges). In the projection model, every actor is embedded within some latent hypersphere; the probability of an edge forming between two actors depends on the angle, rather than the Euclidean distance, between them. Thus it is the angle between any two actors that represents the "closeness" of the actors. Though the latent space is strictly a Euclidean space rather than a hypersphere, it is more helpful to think of the positions within p as unit vectors on a p − 1 dimensional hypersphere with individual edge propensities reflected in the magnitude of the latent positions. Our proposed likelihood of the adjacency matrices adapts the likelihood of the projection model originally proposed by Hoff et al. (2002), and extends Durante and Dunson (2014) to allow for directed edges. The specific form of the likelihood is given as

Projection model
where φ ijt is the angle between X it and X jt ; in this context θ = {α, s}, where α reflects a baseline edge propagation rate and s = (s 1 , . . . , s n ) is a vector of actor specific parameters that reflect how the tendency of the actors to receive edges relates to the tendency to send edges. We therefore refer to s as the receiver scaling parameters. While (5) is simpler, (6) makes it clear how the probability of an edge from i to j is made up of some constant plus the product of the sending effect of i, the receiving effect of j, and the closeness between i and j in the latent space as measured by the cosine of the angle between the two actors. The question remains as to how to perform clustering. With the projection model the latent positions are embedded within a hypersphere, and thus the clustering must be done in a fundamentally different way than that done for the distance model. Since we would expect a group of highly connected actors to have small angles between them all, we propose clustering based on the angles of the actors' latent positions.
We first assume that the latent positions follow a hidden Markov model, with the cluster assignments as the hidden states. That is, the cluster assignments follow a Markov process (i.e., given Z i(t−1) , Z it is conditionally independent of Z i(t−s) for any s > 1), and given the cluster assignments Z t , the latent positions X t are assumed to be conditionally independent of X s for any s = t.
The joint density on the latent positions and cluster assignments is given as imsart-generic ver. 2014/10/16 file: DNC14.tex date: May 19, 2020 where I p is the p × p identity matrix. As with the distance model of Section 2.1, the communities are modeled as multivariate normal distributions within the latent space. Here r = (r 1 , . . . , r n ), the radii of the means of the X it 's, are individual effects representing the individual propensities to send edges; hence we refer to r as the sender propensities. u g is the unit vector corresponding to the direction of the g th community, and hence we refer to the u g 's as the community directions. τ = (τ 1 , . . . , τ n ) are the precision parameters, and β 0 = (β 01 , . . . , β 0G ) and β h = (β h1 , . . . , β hG ), h = 1, . . . , G, are again respectively the initial clustering parameter and the transition parameter for group h. From (7) we can see how the different aspects of the network are captured in the joint density of {X t } T t=1 and {Z t } T t=1 . The clusters are completely determined by the community directions u g . Thus if two actors belong to the same cluster then they have the same mean direction, and therefore the model will deem these two actors as similar (based on the cosine of their angle). The permanence and transience of the clusters are captured in the transition parameters β h , h = 1, . . . , G. The individual effects are captured by the sender propensities r and the receiver scaling parameters s. To see this more clearly, notice that the square of the individual sending effect (and the scaled individual receiving effect), X it 2 , has mean pτ −1 i +r 2 i ; under the quite reasonable assumption that ↑ r i =⇒ ↓ τ −1 i (we would expect the opposite to occur), we see that r i has a direct effect on the individual effect. The difference in individual i's sending and receiving effect is given by the i th receiver scaling parameter s i .
Note that the parameterization (5) of the likelihood (4) is not identifiable, as s and X t can be scaled arbitrarily. The estimation is done within a Bayesian framework, however, and thus by fixing the hyperparameters corresponding to the priors on the unknown parameters, the posterior distribution is identifiable.

Estimation
Our estimation is done within the Bayesian framework, with the goal of finding the maximum a posteriori (MAP) estimators of the unknown parameters and latent positions.

MCMC for the distance model
We propose a Markov chain Monte Carlo (MCMC) method to obtain posterior modes to estimate the latent positions and model parameters of the distance model given in Section 2.1. Specifically, we implement a Metropolis-Hastings (MH) within Gibbs sampler.
With the exception of the latent positions and θ , these priors are conjugate with respect to the full conditional distributions; these distributions are given in the supplementary material. For the latent positions, MH steps are necessary. The context specific form of the likelihood will determine whether the likelihood parameters θ can be sampled directly or whether the user needs to implement MH steps here as well (see Sections 4.1 and 5.1 for examples). Polson et al. (2013) gave a data augmentation scheme for logistic models by utilizing the Pólya-Gamma distribution. This scheme starts by introducing a random variable ω ijt which, given η ijt , follows P G(1, η ijt ), where P G(b, c) denotes the Pólya-Gamma distribution with parameters b > 0 and c ∈ . This auxiliary variable ω ijt is conditionally independent of y ijt given η ijt . Polson et al. show that the conditional joint density of y ijt and ω ijt can be written as

Variational Bayesian inference for the projection model
where P G(ω|b, c) is the Pólya-Gamma density with parameters b and c evaluated at ω. This data augmentation leads to tractable forms for the full conditional distributions of the model parameters and latent positions, leading to efficient and accurate estimation for binary data using Gibbs sampling (Choi and Hobert, 2013), the EM algorithm (Scott and Sun, 2013) and, as we will show here, variational Bayes (VB) approaches. Using Polson et al.'s work we may either implement a Gibbs sampler, as each full conditional distribution belongs to a well known family from which we can sample, or alternatively we may implement a mean field VB algorithm. Unlike a MCMC approach which obtains samples approximately from the posterior distribution, the VB algorithm here iteratively finds an approximation to the posterior density π where θ p is all the remaining model parameters corresponding to the prior on {X t , Z t } T t=1 . Using the mean field VB implies that we are finding a factorized approximation Q of the posterior which minimizes the Kullback-Liebler divergence between the true posterior and Q. This factorized form will be given shortly.
VB procedures have been gaining popularity in large part due to their greatly decreased computational cost in comparison with most sampling methods. Salter-Townshend and Murphy (2013) applied VB to the static latent space cluster model for networks given by Handcock et al. (2007) (which is a static form of the distance model). Within this iterative scheme, the factorized distributions of the latent positions and many of the model parameters required numerical optimization techniques, as a closed form analytical solution was unavailable.
By utilizing the projection model as described in Section 2.2, however, we can find closed form solutions for each iteration, thereby reducing the computational cost involved in the estimation algorithm.
We assign the following priors: To estimate the posterior π({X t , Z t } T t=1 , θ , θ p |{Y t } T t=1 ), we use the factorized approximation Q, which looks like where Ω = {ω ijt } t,i =j . Using the priors given above, the factorized distributions on the right hand side of (22) all belong to well known families of distributions. The exact forms are given in the supplementary material. Of interest is the computational time required for our proposed methods, and in particular how the VB algorithm decreases the computational time required. We recorded the times required to implement both our VB approach (500 iterations) and the corresponding Gibbs sampler (50,000 samples drawn), letting n be 100, 200, 400, 600, 800, and 1,000. The times are given graphically in Figure  1. From this we can see that the VB algorithm shows drastic reduction in computational cost. We will see in Section 4, however, that the performance of the VB and Gibbs sampler are very similar.

Initialization
Our context involves a high dimensional estimation problem, and so how we initialize the MCMC or the VB algorithm plays a non-negligible role in the performance. We performed a small sensitivity analysis of the starting conditions of our algorithms, the details of which can be found in the supplementary material. The results indicated that under some conditions the VB algorithm for the projection model can be sensitive to the initialization scheme, though it did not appear that either of the MCMC algorithms (the Gibbs sampler for the projection model and the MH within Gibbs sampler for the distance model) were particularly sensitive. The full details on how we initialized the algorithms are given in the supplementary material.

Number of communities
An implicit challenge underlying the previous discourse is that in practice we do not in general know the number of communities G. We found the strategy given by Handcock et al. (2007) to be quite successful in our simulation study (see Section 4.3). We briefly summarize this method and refer the interested reader to the original source for more details.
Rather than estimating the integrated likelihood π({Y t } T t=1 |G) as would typically be done, we instead consider the joint distribution of the observed network data and unobserved latent positions, using our MAP estimator as the fixed values of the latent positions, i.e., π is the MAP estimators of the latent positions. We can rewrite this as where all distributions are implicitly conditioning on G. The two integrals on the right hand side of (23) can each be estimated via the Bayesian information criterion (BIC), thus allowing us to find the BIC approximation of Rather than using maximum likelihood estimators forθ andθ p in computing the BIC's, we used the MAP estimators, as was also done in, e.g., Fraley and Raftery (2007). We remark that for the projection model, since the posterior modes found by the VB and the Gibbs sampler perform comparably (see Section 4.1), this BIC model selection method is still valid for the VB estimates. This is because we only need the posterior mode, and hence any inaccuracies in the posterior variances/covariances of the parameters induced by approximating the posterior distribution with the VB factorized distribution will not affect the BIC criterion. One last note is that we utilized recursive relations identical or similar to those given in Sewell et al. (2016) in order for the number of terms required to compute π({X t } T t=1 |θ p ) to be linear, rather than exponential, with respect to T .

Method evaluation
We simulated 200 binary networks, each with n = 100 actors and T = 10 time points. These 200 data sets were subdivided evenly in two different ways. First, half of the data sets were generated according to the distance model, the other half via the projection model. Second, half of the data sets had sticky cluster transition probabilities, letting the β hh 's take large values (recall that β h is the transition parameter for group h), while the other half had more transitory transition probabilities, letting the β hh 's to take more moderate values. In summary, we had 50 data sets from the distance model with sticky transition probabilities, 50 from the distance model with transitory transitions, 50 from the projection model with sticky transition probabilities, and 50 from the projection model with transitory transitions. Details on how the data were generated will be given shortly.
We compared various methods in four ways. The first was to evaluate how well the model explains the data used to fit the model. To this end we obtained in-sample edge predictions and computed the AUC (area under the receiver operating characteristic curve); a value of one implies a perfect fit, whereas a value of 0.5 implies that the predictions are random. As a good in-sample fit may be due to overfitting the data, we also looked at one step ahead predictions. We obtained one step ahead predicted probabilities and computed the correlation with the true one step ahead probabilities. We aim to stress, however, that prediction is not the primary purpose of this methodology, but rather to accurately recover hidden communities in the network object. We thus compared the true clustering assignments with the estimated clustering assignments using two methods. The first is the corrected Rand index (CRI), which can be viewed as a measure of misclassification. Values close to 1 indicate nearly identical clustering assignments and values near zero indicate what one might expect with two random clustering assignments. Second, we computed the variation of information (VI) (Meilȃ, 2003). The VI is a true metric, and hence a smaller VI value implies that the two clusterings being compared are closer to being identical.
For each of the 200 simulations we compared six methods. The first two are the VB algorithm and the Gibbs sampler for the projection model. The third is the distance model. Here we used the likelihood formulation found in the dynamic latent space model of Sewell and Chen (2015b). This likelihood is given as where β IN and β OU T are global parameters that reflect the relative importance of popularity and activity respectively, the s i 's are actor specific parameters that reflect the tendency to send and receive edges, and d ijt is the distance between actors i and j within the latent Euclidean space at time t. Estimation is done by putting a bivariate normal prior on β IN and β OU T , a Dirichlet prior on the s i 's, and incorporating these parameters in the MH within Gibbs MCMC algorithm of Section 3.1. The fourth and fifth methods were the clustering models of Handcock et al. (2007) and of Krivitsky et al. (2009), implemented in the latentnet R package Handcock, 2008, 2015). These latter two models cluster static networks via a latent space approach; to apply them to dynamic networks, clustering was performed at each time point and then combined sequentially using the relabeling algorithm given in Papastamoulis and Iliopoulos (2010). Note that these two methods, being static models, cannot be used to perform one step ahead predictions. Lastly we used the temporal exponential random graph model (TERGM) (Hanneke et al., 2010), as implemented in the btergm R package (Leifeld et al., 2015). The terms we specified for the TERGM were the total number of edges in the network, the number of reciprocated edges, the number of transitive triples, the number of cyclic triples, in-degrees and out-degrees, the number of lagged reciprocated edges, and the stability of the network. Note that this method can be used to determine in-sample predictions and one step ahead predictions, but has no functionality for determining cluster assignments. All MCMC methods were used to obtain 50,000 samples. For the data sets generated according to the distance model, we set the blending coefficient λ = 0.8, the dimension of the latent space p = 2, the total number of clusters G = 6, and the likelihood parameters β IN = 0.3, and β OU T = 0.7. We set the community locations µ g to be (−0.03, 0), (−0.01, 0), (0.01, 0), (0.03, 0), (0, 0.02), and (0, −0.02). We drew the community shapes Σ g , g = 1, . . . , G, from W −1 (13, (1 × 10 −5 )I 2 ), the initial clustering parameter β 0 ∼ Dir(10, . . . , 10), and for h = 1, . . . , 6, the transition parameter β h for group h was set to was set to be proportional to maxj (1/ Xj,1 ) . Finally, the adjacency matrices were simulated according to (24). This led to an average density of the simulated networks (taken over all time points of all simulations) of 0.221 and 0.222 for sticky and transitory transition probabilities respectfully. The average modularity (again averaged over all time points of all simulations) was 0.299 and 0.287 for sticky and transitory transition probabilities respectfully, giving a measure of how well separated the clusters are. Specifically, the modularity (originally defined by Clauset et al., 2004, for undirected networks) as implemented in the igraph package (Csardi and Nepusz, 2006) is where S t is the number of edges in the network at time t, Y * t is the n × n symmetric adjacency matrix constructed by setting Y * ijt = Y * jit = (Y ijt + Y jit )/2, and k it is the average of the in degree and out degree for actor i at time t. For comparison, an Erdős-Rényi graph with comparable density has on average a modularity of 0.076, and a network consisting of 5 fully connected subgraphs, each of which are fully disconnected, has a modularity of 0.8 (and a density of 0.19).
For the data sets generated according to the projection model, we set the total number of clusters G = 6, the dimension of the latent space p = 3, the baseline propagation rate α = −5, the initial clustering parameter β 0 = (1/6, . . . , 1/6) and the community directions where u g are given in the spherical coordinate angles in degrees. For h = 1, . . . , 6, the transition parameter β h was set to be proportional to (exp(const · u h u 1 ), . . . , exp(const · u h u 6 )). For sticky transition probabilities we set the constant above equal to 8 which yields probabilities from 0.68 to 0.96 of remaining in the same cluster, and for transitory transition probabilities we set the constant equal to 5 which yields probabilities from 0.52 to 0.83 of remaining in the same cluster. For i = 1, . . . , 100, we simulated the receiver scaling parameters s i ∼ N (1, 0.15), the sender propensities r i ∼ N (2.3, 0.05 2 ), and set the precision parameters τ i = 175/r 2 i . The cluster assignments {Z t } T t=1 and latent positions {X t } T t=1 were drawn according to (7). Finally, the adjacency matrices were simulated according to (4) and (5). This led to an average modularity of 0.305 and 0.279 for sticky and transitory transition probabilities respectfully. The average density of the simulated networks was 0.183 and 0.191 for sticky and transitory transition probabilities respectfully. Table 1 gives the simulation results. The AUC values show that the TERGM fits the data poorly, but all the other methods fit rather comparably. However, looking at the CRI and VI we see that the static methods are overfitting the model; that is, they are providing good predicted probabilities for the observed data used to fit the model but are not doing so well at capturing the underlying truth. The correlation between the estimated one step ahead probabilities and the true probabilities are much higher for our methods than for the TERGM. Note that both the projection model and the distance model provide good predictive performance regardless of the true geometry of the latent space and regardless of the cluster transition probability matrix. Once we start looking at the CRI and the VI, which is of primary importance with respect to the goals of the proposed work, we notice several things. First, when using the projection model, the VB and the Gibbs sampler yield very similar performance. Second, when the geometry of the latent space is misspecified, our proposed models still perform quite well and in fact perform similarly to the correctly specified model. Lastly, we note that the performance of the static methods deteriorate when the probabilities of changing clusters increase. The VI values are also given graphically in Figure 2, visually demonstrating the performance disparities between the dynamic and static methods.  Table 1 Simulation results from data generated according to the distance and projection models with both sticky and transitory cluster transition probabilities. The median values are reported, with standard deviations in parentheses. The AUC corresponds to the data used to fit the model, the Correlation (one step ahead) values correspond to the correlation between the estimated probabilities and the true probabilities, the CRI and VI are the corrected Rand index and variation of information respectively between the true and estimated cluster assignments. imsart-generic ver. 2014/10/16 file: DNC14.tex date: May 19, 2020

Sensitivity study
It is not obvious how to choose the values of the hyperparameters from Section 3. In the above simulation study as well as in Section 5, we used an automatic selection method for these hyperparameters, the details of which can be found in the supplementary material. It is important, however, to determine how sensitive the estimation procedures are to the choice of hyperparameters. To this end we analyzed 100 data sets simulated according to the projection model and 100 according to the distance model, in each case fitting the data using the model with the correct geometry. Each set of 100 data sets was evenly divided between sticky and transitory cluster transition probabilities. For each simulation we evaluated the clustering performance using CRI and VI.
For each simulation we set the hyperparameters in the following way. For the distance model, we drew ν λ ∼ U nif (0.5, 1) and fixed ξ λ = 1, fixed a = 3 and drew b ∼ U nif (0.01, 0.05), fixed c = 1.001 and drew d ∼ N (10, 2.5 2 ). For the projection model, we drew c ∼ Γ(20, 0.5), a * 2 ∼ N (600, 100 2 ), and b * 2 ∼ Γ(1, 0.05), and fixed b * 3 = 100. Table 2 provides the results from this sensitivity analysis. From this we see that the projection models still perform quite well, although the Gibbs sampler for the projection model has a larger standard deviation of the performance measures. What we should immediately notice is the appalling performance of the distance model when ξ λ = 1. Upon closer inspection we noticed that the parameter estimates of the blending coefficient λ were in nearly all cases very close to zero, which means that the model was not using much of the cluster information to predict the latent positions. As a remedy, we altered this part of the sensitivity analysis, drawing ν λ ∼ U nif (0.7, 0.95) and fixing ξ λ = 5 × 10 −4 , thereby setting a very low prior probability that λ is small. With this alteration we see from Table 2 that the clustering performance is quite satisfactory. In summary, the estimation methods are not particularly sensitive to the selection of hyperparameters with the exception of those associated with λ.  Table 2 Simulation results testing prior sensitivity for data generated according to the distance and projection models with both sticky and transitory cluster transition probabilities. The median values are reported, with standard deviations in parentheses.

Fig 2:
The Variation of Information (VI) values from the simulation study are given here graphically, separated by the underlying true geometry (distance/projection) and the type of transition (sticky/transitory).

BIC model selection
The last simulation study evaluates the BIC method described in Section 3.4. Due to the increased computational cost to fit the model for several values of G, we generated 15 data sets each from the distance model and the projection model (30 total). We fitted both the distance and projection models to each data set for G ∈ {3, . . . , 9}, and selected the G with the optimal BIC value. One important comment is that the BIC method of Section 3.4 is not appropriate to select the geometry of the latent space, i.e., choose whether we should use the distance or the projection model. Instead we used the deviance information criterion (DIC) (Spiegelhalter et al., 2002) to make this distinction. We originally attempted to use DIC to choose both the geometry and the number of clusters, but DIC performed extremely poorly at determining G. DIC was, however, perfect at selecting the geometry (in this simulation study) once the optimal number of clusters had been chosen (via BIC). Therefore based on this simulation study, we recommend to the practitioner the admittedly inelegant procedure of first using the BIC (as described in Section 3.4) to choose G for each geometry, and then using DIC to compare these two models with differing geometries. Figure 3 provides the results. As mentioned above, DIC perfectly selected the geometry, and so we only present the BIC values for the model with the correctly specified geometry for varying G. Specifically, Figure 3 gives the average ranking of the BIC values, where low rankings indicate better BIC values. From this we see that the true number of clusters (6) is frequently chosen as the optimal number of clusters, and values of G far from the truth rank poorly.

Newcomb's fraternity data
Newcomb (1956) discussed data collected on 17 male college students who were previously unknown to each other. These 17 students, as part of Newcomb's study, agreed to live together for sixteen weeks (though the data set excludes the ninth week due to school vacation). For each week, every student ranks the other 16 students from 1 (most favored) to 16 (least favored). In this context Y t is the t th n × n adjacency matrix whose i th row, denoted y it , is how the i th actor ranks the other n − 1 actors. Without loss of generality, assume that the rankings go, in order of most favored to least favored, from 1 to n − 1. Then we let o it = (o i1t , o i2t , . . . , o i(n−1)t ) denote the (n − 1) × 1 vector which is the ordering of the rank vector y it (e.g., if y 1t = (0, 4, 3, 1, 2) then o 1t = (4, 5, 3, 2)). We assume that, conditioning on (X t , Ψ), y it is independent of y i t , i = i .
The likelihood we will use is that used by Sewell and Chen (2015a), given as where again s = (s 1 , . . . , s n ) are actor specific parameters which indicate an actor's social reach, and for identifiability n i=1 s i = 1. This is a Plackett-Luce model (Plackett, 1975), and as such satisfies Luce's Choice axiom which can be characterized by having actor i rank actor j over actor k with the same probability whether or not actor is included in the set to be ranked. See Sewell and Chen (2015a) for further motivation and details of this model. As this likelihood depends on the latent positions through the distances D(X t )'s, we implement the distance model of Section 2.1. This flexible framework allows us to detect communities through the latent positions of the students. Estimation is done by putting a Dirichlet prior on s and incorporating these parameters in the MH within Gibbs MCMC algorithm of Section 3.1.
For G = 2, . . . , 9, we ran 100,000 iterations of the MCMC algorithm of Section 3.1, thus having a maximum of nine clusters. For each of the 8 chains, we used a short MCMC chain (the same chain for each G) following the model with no clustering of Sewell and Chen (2015a) to initialize the latent positions {X t } T t=1 and the actor specific likelihood parameters s, and for the remaining prior parameters we used the generalized EM algorithm given by Sewell et al. (2016).
The BIC method described in Section 3.4 led us to choose five communities. These BIC values ranged from −13, 531 to −13, 066. The MCMC chain converged relatively quickly, as is seen in Figure 4a, which provides a trace plot of the posterior value for all 100,000 samples. Adjacent in Figure 4b is the ACF plot, which shows that the correlation decays at a reasonable rate, and, together with Figure 4a indicates that we had good mixing. Geweke's diagnostic test, as implemented in the coda R package (Plummer et al., 2006), yielded a p-value of 0.611 using a burn in of 5,000, implying convergence.
The goodness of fit was evaluated using the pseudo-R 2 value described in Sewell and Chen (2015a). The pseudo-R 2 takes values in the interval [0, 1), where a higher value implies a better fit of the data. After analyzing the data, we obtained a pseudo-R 2 value of 0.575. This is slightly less than that obtained by Sewell and Chen (0.622), which we feel satisfied with since we are imposing more structure via the clustering on the prior of the latent positions; that is, though we are imposing more structure on the prior of the latent positions, we are not losing much in terms of model fit.
This data set has been analyzed many times since its genesis, and several of these analyses have focused at least in part on community detection. Nakao and Romney (1993), when analyzing Newcomb's fraternity data, created similarity matrices for each time point and then performed multidimensional scaling to obtain latent network positions, visually determining the communities. Moody et al. (2005) used various visualization methods and also commented on some clustering that were noticed via visual inspection. Sewell and Chen (2015a)   provided a detailed analysis of Newcomb's fraternity data which included a post-hoc analysis of the subgroup formation.
An important advantage of our proposed approach over these ad hoc or post hoc methods is the ability to compute the posterior probabilities of pairwise membership to the same cluster; that is, we can quantify the uncertainty of our hard clustering assignments. With the MCMC output these quantities can easily be computed, and hence we can determine if the previously described results are reasonable according to our analysis. Figure 5 depicts the pairwise posterior probabilities of two actors belonging to the same cluster at week 7 (chosen for a stabilized representation of the dynamic cluster memberships). Dark shaded regions indicate high probabilities, and light regions indicate low probabilities. If a method estimates that two actors belong to the same cluster, then a square (our proposed method), triangle (Nakao and Romney), circle (Moody et al.), or an asterisk (Sewell and Chen) is given in the appropriate cell. Note that all methods other than the proposed do not assign clusters to all actors in the network. From this figure we see that there is often agreement on pairs belonging to the same cluster for most of the very high pairwise probabilities; there is also most often agreement on pairs not belonging to the same cluster for the low pairwise probabilities. For the numerical values of the pairwise posterior probabilities for week 7 as well as for all other weeks, see the supplementary material. Figure 6 shows the latent space with the MAP estimators of the latent positions, thus showing the overall structure of the subgroups of the network. All actors at all time points are shown here. Figure 7 shows the latent positions at weeks 1, 7, and 15. The community structure stabilized at around week 4, where it did not change at all until week 12, and only slightly until week 14. We can Fig 5: Pairwise probabilities of actors belonging to the same cluster at week 7. Actors (rows/columns) are ordered according to the MAP estimates of the communities. Different methods' estimates are given by the methods' corresponding shapes in the appropriate cell. Shown are our proposed MAP estimates (square) as well as those from Nakao and Romney (triangle), Moody et al. (circle), and Sewell and Chen (asterisk). Note that all methods other than the proposed do not assign clusters to all actors in the network. characterize our five communities, referencing these groups using the shapes given in Figures 6 and 7. The community matches well with communities discovered by Nakao and Romney, Sewell and Chen, and the main community discovered by Moody et al.. Once all the members eventually joined this community within the first few weeks (none departed the community), it remained constant for the remainder of the study until student 14 joined the final week. The • community seemed to be the opposite, in that it was the most transient. Similar to the • community, the community was also fairly transient, with many students leaving and some joining throughout the study. The + community was characterized by students joining and remaining in the community, and in this manner was similar to the community. The + community was also the most popular group in terms of rankings received, unlike the community which was more isolated and not very popular, and matches well with a community discovered by Sewell and Chen. The community evolved into the least popular group (until the least popular student, 16, formed his own community the last two weeks), consisting of several of those students Nakao and Romney termed "outliers," and several of the students that Sewell and Chen described as having departed the main communities. The symbols correspond to the community assignments given.
As the network was completely nascent at the first week, it is hardly a surprise that there are quite a number of actors that switch communities, especially during the beginning of the study. Our model was able to capture this evolution of the network, unlike clustering algorithms which assume constant cluster assignments over time. In all there were 15 transitions, 8 of which were during the first three transition periods, and 5 of which were during the last two transition periods. This implies that the subgroup formation of the social network was fairly stable after week four, though the stability of the network faltered at the end of the semester; this last comment regarding the deterioration of the network stability also corroborates statements made by various other researchers (e.g., Nakao and Romney, 1993;Krivitsky and Butts, 2012;Sewell and Chen, 2015a). The symbols correspond to the community assignments given.

World trade data
We consider world trade data with the goals of determining trade blocs and gleaning what information we can from these blocs. We look at annual export and import data between countries during the years 1964 to 1976 (so T = 13). A (directed) trade relation is established from country i to country j, i.e., Y ijt = 1, if country i exports some non-negligible amount of goods to country j during year t. During this time, for a variety of reasons a few countries are not constant throughout, and so we only include the n = 111 countries which exist throughout the entirety of the study period. Thus we have thirteen 111 × 111 binary adjacency matrices. As this is primarily a pedagogical example, we chose these years to strike a balance between a large number of time points with a large number of countries. The data we used were obtained through the Eco-nomic Web Institute at http://www.economicswebinstitute.org/worldtrade.htm, originally obtained through the IMF Direction of Trade Yearbook.
To detect trade blocs within the binary trade relations data, we implemented both the distance model and the projection model, letting G take values from 2 to 9. Using the procedure described in Sections 3.4 and 4.3, we selected the projection geometry with four clusters; the BIC values for the projection model ranged from −35, 838 to −34, 448. Figure 8a provides a trace plot of the posterior value of all 100,000 samples, and Figure 8b provides the ACF plot. From these we see evidence of convergence and good mixing. Geweke's diagnostic test yielded a p-value of 0.210 using a burn in of 35,000, implying convergence.   Most of the blocs are relatively densely interconnected, as seen in Table 3. The exception is Bloc 1, a global trade bloc with nations representing all inhabited continents, which is loosely interconnected. This community is also the most transitory, as seen in Table 4 which gives the estimated values of β 0 and β h , h = 1, . . . , 4. It is intuitive that these two things should coincide, in that trade blocs that are not actively trading with each other should be more likely to lose member nations to other trade blocs. Bloc 2 is the largest bloc averaging 49 nations per year, and involves with very few exceptions only eastern hemisphere nations, indicating that geography may be playing a role in the formation of trade blocs. Bloc 3 consists of the U.S.S.R., several eastern European countries, Fig 9: Estimates of latent locations (plotting the unit vectors indicating direction and ignoring the magnitude of the vectors that correspond to individual effects) of countries in the international export/import data. The four communities have been labeled along segments from the origin to the communities' centers.
1 2 3 4 1 0.10 0.14 0.08 0.07 2 0.14 0.25 0.15 0.12 3 0.08 0.15 0.24 0.05 4 0.07 0.12 0.05 0.19 Table 3 Densities within each of the four communities and between each community, averaged over all time points. These densities are computed by dividing the total number of edges by the total possible number of edges. and most of Latin America. This gives quantitative evidence in favor of claims of close ties between U.S.S.R. and Latin America and the Soviet influence in the western hemisphere (e.g., Blasier, 1988). Bloc 4 is a community that is indicative of a very interesting vestigial effect from French colonization. Of the countries that belonged to bloc 4, France and her former colonies constitute 2/3 of them. French colonial policy required her colonies to import only from or through France, export only to France, and to ship using French vessels (Grier, 1999). That France and her former colonies behave similarly as participants in world trade gives evidence that colonial policy established a longer term trend.

Discussion
Community detection is an important topic in network analysis. We have extended the commonly used distance and projection latent space models to incorporate clustering of dynamic network data, utilizing the temporal information to build the model. This model can handle directed or undirected dynamic network  Table 4 Estimates of initial clustering parameter (first row) and transition parameters (last four rows).
data, and can also be used to model a wide range of weighted network data. We have also given the first, to our knowledge, clustering model corresponding to the projection model in Hoff et al. (2002), Durante and Dunson (2014), and others. This model also can handle directed or undirected dynamic network data, and the VB algorithm we have described provides computationally fast estimation of the model. While the VB algorithm using the projection model for binary networks is relatively fast, the corresponding Gibbs sampler we have also implemented is time intensive for larger networks, as seen in Figure 1. However, this burden could potentially be alleviated by adapting the likelihood approximation method first derived by Raftery et al. (2012) for binary networks. For the distance model, we expect that creating a VB algorithm would be non-trivial and context specific; we therefore leave that for future research.
In this paper we have discussed a method of selecting the number of clusters and the latent space geometry. However, a difficult topic we have not yet addressed is the selection of the dimension of the latent space. Durante and Dunson (2014) developed a non-parametric approach to this problem in a simpler setting, which may inspire similar type strategies to select the dimensionality of the latent space in our context. A very useful area of future research then would be to construct a unifying model selection method to determine the latent space geometry, the dimension of the latent space, and the number of clusters.
One last comment is that the clustering models that have been proposed are based on the assumption that actors within a cluster are more likely to form edges than actors in different clusters. While this is, we expect, the most common context, there may be certain scenarios in which this is not the case. Instead there may be varying roles that the actors can take on, and these roles do not necessitate that each role is well interconnected, that is, actors in the same community may not be densely connected to each other. In such a case a blockmodel approach would be more appropriate to modeling the data.