Fitting Position Latent Cluster Models for Social Networks with latentnet

latentnet is a package to ﬁt and evaluate statistical latent position and cluster models for networks. Hoﬀ, Raftery, and Handcock (2002) suggested an approach to modeling networks based on positing the existence of an latent space of characteristics of the actors. Relationships form as a function of distances between these characteristics as well as functions of observed dyadic level covariates. In latentnet social distances are represented in a Euclidean space. It also includes a variant of the extension of the latent position model to allow for clustering of the positions developed in Handcock, Raftery, and Tantrum (2007). The package implements Bayesian inference for the models based on an Markov chain Monte Carlo algorithm. It can also compute maximum likelihood estimates for the latent position model and a two-stage maximum likelihood method for the latent position cluster model. For latent position cluster models, the package provides a Bayesian way of assessing how many groups there are, and thus whether or not there is any clustering (since if the preferred number of groups is 1, there is little evidence for clustering). It also estimates which cluster each actor belongs to. These estimates are probabilistic, and provide the probability of each actor belonging to each cluster. It computes four types of point estimates for the coeﬃcients and positions: maximum likelihood estimate, posterior mean, posterior mode and the estimator which minimizes Kullback-Leibler divergence from the posterior. You can assess the goodness-of-ﬁt of the model via posterior predictive checks. It has a function to simulate networks from a latent position or latent position cluster model.


Introduction
Latent position models for social networks postulate an existence of an unobserved "social space" within which the probability of a tie between two actors is modeled as a function of some measure of distance between the latent space positions of these actors.Introduced by Hoff, Raftery, and Handcock (2002), these models were extended by Handcock, Raftery, and Tantrum (2007) to model grouping among actors as spherical Gaussian clusters in the latent space, and applied to non-binary data by Hoff (2005).
The R (R Development Core Team 2007) package latentnet implements the model described by Handcock et al. (2007), with slight modifications.The package implements Bayesian inference for the models based on an Markov chain Monte Carlo (MCMC) algorithm.It can also compute maximum likelihood estimates for the latent position model and a two-stage maximum likelihood method for the latent position cluster model.
As in Handcock et al. (2007), the package uses Minimum Kullback-Leibler (MKL) positions (Shortreed, Handcock, and Hoff 2006) for visualizing the posterior distribution of latent space positions and MKL label-switching algorithm (Stephens 2000) to deal with nonidentifiability of cluster assignments.It also adds support for nonbinary response variables, allowing them to be modeled as either binomial (with fixed, user-supplied number of trials) or Poisson.This paper describes this package, and, as examples, fits two standard datasets.The first, Sampson's "liking" network, among monks, is used an an example of a binary response.The second, the Highland Tribal alliances data, demonstrates nonbinary models.It is not the goal to describe all the functionality and options available, and more information about all of the commands and methods mentioned can be gotten through R's help system.
The latentnet software itself is available from the Comprehensive R Archive Network (CRAN) at http://CRAN.R-project.org/package=latentnet and can be installed directly using, for example, install.packages("latentnet").
The latentnet package is part of the statnet suite of packages.The installer for the statnet suite installs all the required packages for you along with latentnet.It is easier for beginners or those interested in more general network models, visualization or simulation (including exponential family models).

Stochastic network models
We consider data on the relations between a set of n actors.The observations consist of a relational tie y i,j and covariates {x i,j,k } p k=1 measured on each ordered pair of actors i, j = 1, . . ., n.In the most simple cases, y i,j is a dichotomous variable, indicating the presence or absence of some relation of interest, such as friendship, collaboration, transmission of information or disease, etc. Denote the n × n sociomatrix Y = [y i,j ] and the dyadic covariates via the n × n × p array x = [x i,j,k ] .In the case of a binary relationship we denote y i,j = 1 if there exists a relation from actor i to actor j, while y i,j = 0 will denote that no such directed relation exists.This can be thought of as a graph in which the nodes are actors and the edge set is {(i, j) : y i,j = 1}.
The sociomatrix Y can be viewed as a random variable with a sample space of Y ⊆ {0, 1} n(n−1) .We consider models for Pr(Y = y|x), the probability mass function of Y given the values of the covariates.Most papers in this special issues focus of statistical exponential family models (ERGM).In this paper we focus on a class of hierarchical models.The class of models assume that the conditional probability of a tie between two actors, given the covariates, depends only on the distance between them in an unobserved "social space."The model also assumes that ties between actors occur independently, given their distance apart and the covariates.

Model specification
The model postulates that each actor has an unobserved position in a d-dimensional Euclidean latent social space.The most general form for the models fit by latentnet is Intuitively, Z = {z i } are the positions in social space of the actors.Equation (1) states that the ties between individuals are independent, given the positions Z = {Z i } n i=1 in social space of the two individuals.Here Z i ∈ R d is the d−dimensional location of the ith individual.Equation (2) states that the conditional probability depends on the covariates and latent positions only through their conditional mean given their distance apart and directed pair specific covariates.Equation (3) expresses this conditional mean in terms of a predictor function, η i,j , and a known link function g.Equation (4) states that the predictor function is linear in the covariates and the distance apart.The package uses Euclidean distance, denoted by |Z i − Z j |, although other distances functions are possible.
Note that is slightly different from the specification of Handcock et al. (2007), in that latent space positions are not constrained and there is no multiplier for their distance.The specification in Handcock et al. (2007) is now implemented in latentnetHRT (see the section on "Package history" below).
The density-link combinations of (2)-( 3) that are currently implemented are given in Table 1.
For example, if the tie is dichotomous we can choose the Bernoulli family for the model.Then (2)-( 3) specify a logistic regression model in which the probability of a tie depends on the Euclidean distance between Z i and Z j in social space: As in Handcock et al. (2007), when fitting in a Bayesian context, as an hierarchical model, prior distribution is imposed on the latent space positions and covariate coefficients, and hyperpriors on the latent space prior parameters: Intuitively, the positions of the actors are drawn from a mixture of Gaussians.Each component of the mixture represents a difference group (i.e., class, category, type, etc) and the positions form a relative cluster of actors within the space.The model of Hoff et al. (2002) is essentially the case with G = 1.
Within this specification the user can specify d, G and the hyper parameters . The prior density of the means is effected by ω 2 .The prior delineation of groups is controlled by σ 2 0 and α.Larger values of the within-cluster variance (σ 2 0 ) lead to lower belief in cluster separation and a lower value of the within-cluster variance degrees of freedom (α) represents greater diversity in within-group variation.
The user can also specify the characteristics that influence the estimation algorithm.See Section 2.2 for details.
As part of the MCMC algorithm, an auxiliary variable K is introduced representing a hard clustering of latent space position within each iteration.This variable is useful computationally as the prior distribution can be reexpressed as:

Package history
The package latentnet was originally developed, by Mark S. Handcock, Peter Hoff, Susan Shortreed, and Jeremy Tantrum, to implement Handcock et al. (2007), but has been substantially rewritten by Pavel Krivitsky to enhance the capabilities and implement the new specification.The package latentnet implements the specification in this paper.To get the original specification in Handcock et al. (2007) use the package latentnetHRT.This corresponds to version 0.7 of the original latentnet.

Model specification and fitting
In this section, we describe how to fit latent position and latent cluster models for networks.This is done using the function ergmm.
The function ergmm constructs an internal representation of the model to be fitted, finds reasonable initial values for the MCMC algorithm, runs the MCMC sampler, and performs post-processing of the MCMC sample, returning an object of S3 class ergmm, described in Section 3.1.
The first, and the only mandatory, argument to ergmm is formula, to specify the network object to be used as the response variable in the model and the model to fit.

A simple motivating example
To get an idea of what the package can do we fit a standard data set in the social network literature.In 1968, Sampson collected data on 18 monks and their interpersonal relations (Sampson 1969).Each monk was asked about positive relations with the other monks and reciprocity was not required, thus the graph is directed.The data contain 56 directed ties between the 18 monks.There are no covariates in this data set, thus estimation is of a single covariate (the intercept) and the positions of the actors.The data is available in the package via the data function and a description can be found using the help function: Let's take a look at the names of the monks (followed by their order in the original): R> data("sampson") R> network.vertex.names(samplike) [1] "Romul_10" "Bonaven_5" "Ambrose_9" "Berth_6" "Peter_4" [6] "Louis_11" "Victor_8" "Winf_12" "John_1" "Greg_2" [11] "Hugh_14" "Boni_15" "Mark_7" "Albert_16" "Amand_13" [16] "Basil_3" "Elias_17" "Simp_18" and at the cohesive sub-groups designated by Sampson (1969).
This quick fit and display creates as many questions as it answers.Did the algorithm converge?How well does the model fit?Are the MLE estimates good, or should alternatives be used?
We will return to these later in the paper.But first let's look at specifying more sophisticated models.

Model formula
The first, and the only mandatory, argument to ergmm is formula, to specify the network object, defined in package network, to be used as the response variable in the model, and the covariates.This is done using the standard syntax for the statnet suite of packages: a two-sided R formula, with the response variable on the left-hand side and the covariates on the right-hand side, specified as function-like calls (Handcock, Hunter, Butts, Goodreau, and Morris 2003).

Terms that can be specified in the formula
Terms specific to latentnet are as follows: Intercept By default, linear predictor η i,j has at least an intercept term: x 1,i,j ≡ 1. Adding a term "-1" removes it.This is not recommended unless the other covariates are selected to make the intercept term redundant.

Latent space and clustering
The term latent adds a Euclidean latent space distance term −|Z i − Z j | to the linear predictor of the probability of a tie between i and j, η i,j , optionally modeling clustering of the latent space positions.The parameters to the term are as follows: var.mul, var (optional) Prior within-cluster variance σ 2 0 .
For those parameters that have an argument ending in -.mul, the parameter can either be specified directly by passing the argument without the ending, or indirectly, by specifying the parameter with the ending, used as a scaling constant for picking a sensible prior.(This is the default.)See Section 2.4 for an explanation of how -.mul is used.
Other covariates An arbitrary dyadic covariate term can be added to the model by specifying x k,i,j directly with the term latentcov(x).The first argument, x, can either be a matrix of covariates for each dyad, or an edge attribute name of the response network.If an optional parameter attrname is specified, it is used as the user-visible name of the term.latentnet also implements all dyad-independent terms available in statnet.Note that the model does not admit terms which introduce conditional dyad-dependence.
In each term, it is possible to set prior mean (ξ k ) and variance (ψ 2 k ) using mean and var respectively.At this time, all covariate coefficients are modeled as a priori independent.

An example of a latent position fit with clustering
We next quickly fit a latent position model with two dimensions and three groups to the same network samplike, using defaults for all other algorithmic inputs.We first set the random number seed.This is not necessary but enables the results to be reproduced.The "seed" argument can be any integer, here 3141.

R> set.seed(3141)
The G parameter fits 3 groups.The optional argument verbose is used to make ergmm print out diagnostics of its progress:

Sensible prior for the hyper-parameters
The latent position model requires specifying a number of hyperpriors, and the model is sensitive to their choice.For example, too high a prior intra-cluster variance leads to clusters blurring together, while too low a prior intra-cluster variance creates a posterior mode in which all the clusters are concentrated at a point, causing the fit to "collapse".For user convenience, latentnet implements a heuristic that produces adequate results on a variety of networks.The heuristic used is as follows: In a typical network, ties even within a particular cluster form far from a complete graph, so the amount of space occupied by a cluster tends to increase with the number of nodes in the cluster, faster than a normal distribution with a fixed variance.
Intuitively, the node within each cluster needs a certain amount of space.The volume of a d-dimensional hypersphere is proportionate to its radius to the d'th power, so it makes sense to make the prior variance proportional to the d'th root of the cluster size.
For the dyadic covariate coefficients, the heuristic represents an a priori belief that coefficients of covariates with high magnitudes would, themselves, tend to have lower magnitudes to compensate.Thus, their respective prior variances are divided by the mean square of the covariate.

Controlling the fit and refining the computations
The control argument can be used to change the characteristics of the MCMC algorithm and how ergmm fits the given model.The control list is generated by the ergmm.controlfunction (R has good reasons for this) in a similar fashion as glm.control (which can also be consulted for background).
All of these are passed as arguments to the ergmm.controlfunction.For a simple list of these, type args(ergmm.control) in R, or for more information on each, type help("ergmm.control").Some of the more commonly used arguments include: burnin The integer value assigned to this argument specifies how long the MCMC chain should proceed before beginning the sampling process.It also affects the length of the pilot runs for the adaptive sampling.
interval The integer value assigned to this argument specifies how long the MCMC chain should run between successive draws.
sample.sizeThe integer value assigned to this argument specifies how many draws from the MCMC chain should be made.

Nonbinary edge weights
An ergmm call has two optional parameters, family and fam.par, to specify the model for the response variable; an optional argument response, giving either the name of the edge attribute to be used as a response or a sociomatrix, is also useful, since, at their core, the networks are binary.Currently implemented families are given in Table 1.
For example, consider the Highland Tribes dataset, available in the package netdata (Handcock et al. 2003) and sourced from UCINET (Borgatti, Everett, and Freeman 1999).These are 16 tribes in the Eastern Central Highlands of New Guinea, with each pair of tribes report to be in one of three states of relation: alliance, antagonism, or neutrality.While a multinomial model might be more appropriate, for the purposes of demonstration, we code alliance as 2, antagonism as 0, and neutrality as 1, and fit a binomial model with 2 trials.This dataset is included in latentnet, and this particular coding is contained in the network edge attribute sign.012.Thus, to fit this model, we use: R> data("tribes") R> tribes.fit<-ergmm(tribes ~latent(d = 2, G = 3), + family = "binomial", fam.par = list(trials = 2), + response = "sign.012",verbose = 1)

Output format and visualization
In this section, we detail what the ergmm produces as output, how the summary function works, and how the plot function can be used to get sophisticated graphical representations of the model fit.Let's start with the model output.

Class ergmm
Running ergmm produces an object of S3 class "ergmm", containing results of the run, diagnostic information, and some auxiliary data that can be used to reproduce it exactly.The entries returned depend on the relevant options passed to ergmm (i.e. the tofit argument): sample is an object of class ergmm.par.list(detailed below), containing the parameter configurations drawn using the MCMC algorithm, as well as some diagnostic information such as acceptance rates of some of the Hastings proposals.It also has attributes Q, containing the posterior individual-node cluster assignment probabilities after labelswitching, and breaks, containing information about which configurations came from which threads, in case of multi-threaded runs.
model is an object of class ergmm.model(see below), containing an internal representation of the model being fitted; prior is an object of class ergmm.par,containing the information on the prior distribution used for the fit: its elements have names of the form var.par, where var is the variable whose prior distribution is being specified, and par is the name of the prior for this variable (the argument that would be passed to the term in the model formula).For example, Z.var gives the prior within-cluster variance for Z.This object can be passed as an argument to ergmm to produce a fit with an identical prior distribution.If used in this manner, it overrides any prior settings specified in the model formula.
control is a list of parameters that affect fitting but not the posterior distribution.
mcmc.mle and mcmc.pmode are objects of class ergmm.par,containing the highest-likelihood and highest-posterior-density MCMC sample encountered.
pmode is an object of of class ergmm.par,containing an approximate posterior mode (at this time, this posterior mode does not incorporate the prior cluster probability).
mle contains a maximum likelihood estimator.
mkl contains the configuration that minimizes Kullback-Leibler divergence between the predicted distribution and the posterior prediction.
burnin.start and main.startcontain the configurations at which the burn-in and the sampling run, respectively, were started.
Storing parameter configurations: ergmm.parand ergmm.par.listclass, primarily by disabling partial name matching.The correspondence between model parameters and elements of ergmm.par is given in Table 2.
A series of configurations, such as an MCMC sample, is stored in the ergmm.par.listclass, which, in addition to providing access to the samples of individual variables using the $ operator, as with ergmm.par,also implements indexing using [[]] to extract individual configurations (ergmm.par)and [] to extract subsets (ergmm.par.list).This is analogous to indexing ordinary lists.

Class ergmm.model
A formula passed to ergmm is parsed into an internal representation of a network model, of class ergmm.model,containing the response network (Yg), optional response attribute (response), family and family parameters (family and fam.par, respectively), the dimensionality of latent space (d), number of clusters (G), presence or absence of the intercept in the model (intercept), and the number, names, and matrices of covariates (p, coef.names, and x, respectively).This structure is returned as a part of an ergmm, and can be passed in place of a formula to rerun a particular model fit (though hyperpriors would need to be specified using the prior argument to ergmm).
A print method is available for ergmm.model, to pretty-print the information about an ergmm.modelobject.

Summarizing model fits
Although the results of latent position model fits are usually reported by plotting the latent space positions, latentnet does provide a convenient way to extract various summary statistics of the posterior.A summary method is implemented for ergmm objects, and returns an summary.ergmmobject, which a print method available for it, or can be accessed as a list.
The method summary.ergmmtakes the following arguments: ergmm.fit (required) An object of class ergmm to summarize.
quantiles (optional) For posterior mean, which quantiles to estimate.
se (optional) Whether to compute standard errors for those point estimates for which they are available.
The resulting summary.ergmmobject is a list containing the following elements: ergmm is an object of class ergmm, containing the fit from which the summary was generated.
model is an object of class ergmm.model,containing the model that was fit.
pmean, mkl, mle, etc. are objects of class ergmm.par,containing the requested point estimates.Some of them also have an element Z.pZK, which contains the posterior probability each node being in each cluster; coef.table, which contains a data frame with the coefficients and, depending on the type of estimate, their standard errors or quantiles; and cor and cov, which contain the correlation and variance-covariance matrices of coefficient estimates, respectively.
For example, let's take a look at the posterior probabilities of group membership for each monk.this, by feeding point estimates of latent space positions to network package's plot.networkfunction.
By default, a call to plot will plot the minimum Kullback-Leibler (MKL) estimates (see Shortreed et al. 2006, for the details) for: R> plot(samplike.fit) A convenient way to visualize clustering is by plotting each node as a small pie chart with the slices of the pie being the proportions of MCMC draws in which that node belonged to each cluster.This is done by passing pie=TRUE to the function.For better readability, we recommend using argument vertex.cex to increase the size of each plotting symbol, leading to the following code: R> plot(samplike.fit,pie = TRUE, vertex.cex= 2.5) The term vertex is used as a synonym for "node" here for consistency with the sna package.See Figure 2 for the resulting output.and thus producing an "animation" of the sampler's evolution.
Plotting the Tribes fit is a little more complicated: to show whether an edge represents friendly a hostile relationship, the colors need to be specified manually.Here, edges representing alliances are green and enmities red: R> plot(tribes.fit,edge.col= as.matrix(tribes,"gama", + m = "a") * 3 + as.matrix(tribes, "rova", m = "a") * + 2, pie = TRUE)

Diagnostics and algorithm tuning
How well did the algorithm do in fitting the model?Is the MCMC algorithm "mixing"?Did it converge?
The quality and speed of the MCMC algorithm used to produce a Bayesian model fit depends on, among other things, the variances and the covariance structure of proposals.To help evaluate it, latentnet provides MCMC output diagnostics and uses adaptive sampling to tune the proposal parameters.

mcmc.diagnostics method for ergmm
A quick and easy way to view basic diagnostics for an ergmm fit is using mcmc.diagnosticsmethod.This function uses the facilities of R package coda to compute Raftery-Lewis convergence diagnostics and plot autocorrelations, trace plots, and posterior density estimates for a subset of variables being simulated.Thus, to see if our fit to the binary monastery network mixed well, we can use R> mcmc.diagnostics(samplike.fit) which prints out the Raftery-Lewis diagnostic: Quantile (q) = 0.025 Accuracy (r) = +/-0.0125Probability (s) = 0.95 The plots produced by mcmc.diagnostics are in Figure 3.

A simple proposal tuner
While it is usually easy to see if the sampler mixed well, finding those proposals that would produce good mixing is not trivial.Approaches based on the approximate posterior (such as Raftery and Lewis 1996) are not appropriate with the proposal types used, so a more empirical approach is used.
The burn-in is split into a series of pilot runs.A simple heuristic and user-supplied values are used to set the initial proposal parameters.After each pilot run, the acceptance rates are checked.If they fall below a very low threshold (or above a very high threshold), the pilot run is repeated with the proposal variances of the variables involved reduced (or increased) by a large factor.If they do not, the proposal standard deviations and coefficients are updated as follows: For single-variable proposals, the proposal standard deviation is multiplied by the sample acceptance rate of the second half of the pilot run, and divided by the target acceptance rate.
For multivariate proposals, the empirical covariance in the second half of the pilot run of the variables involved is computed, its Cholesky decomposition is taken, and the resulting coefficients used to generate a multivariate normal proposal are multiplied by a scaling factor which is, again, multiplied by the sample acceptance rate and divided by the target acceptance rate.
These are then used by the next pilot run, or, if the burnin is over, in the sampling run.
Following Neal and Roberts (2006), 0.234 is used as the default target acceptance rate.

Simulation of latent position and latent cluster networks
The package provides a way to simulate from an latent position model.This is straightforward, since at every level of the hierarchy, the model is dyad-independent.The model can be specified either by a formula or an ergmm object (typically a ergmm function model fit).

Simulation from the posterior of an ERGMM fit
The method simulate for ergmm fits simulates a network conditional on a random draw from the posterior.It is invoked simply by calling simulate with an ergmm object as its first argument.For example, Figure 4: A random network generated from the posterior of the Sampson's Monks fit.
Alternatively, it is possible to pass an ergmm.modelobject to simulate, along with the prior, specified as a list described in Section 3.1, and a configuration of parameters, given as an ergmm.parobject.The configuration of parameters does not need to have all the parameters specified: those missing will be generated from their prior distribution.For example, R> likemonks1 <-with(samplike.fit,simulate(model, + par = sample[[1]], prior = prior)) R> plot(likemonks1) will generate and plot a network from the first draw from the posterior.

Assessment of model fit via posterior predictive checks
As an assessment of the model fit we can consider posterior predictive checks.This is a goodness of fit method based on how similar the networks simulated from the posterior predictive distribution are to the original for some higher-order statistics of interest, such as the distribution of geodesic distances.To conduct such an analysis for the monk's data we can use the gof function.The function specifies the higher-order statistics via a GOF formula.Any of those in the package ergm are available.The default statistics are idegree, odegree and geodesic distance.
R> samplike.fit.gof<-gof(samplike.fit)To get a graphical summary we can use the plot function: R> plot(samplike.fit.gof) This produces three figures corresponding to in-degree distribution, out-degree distribution and geodesic distribution.Only the first figure, corresponding to the in-degree distribution is reproduced in the paper (Figure 6).We see that the model reproduces the observed statistics reasonable well except for the a tendency to somewhat under produce monks of degree 3 and over produce monks of degree 4. That is, there is a tendency for the actual monks to have degree 2 or 3 more often, and 4 less often than the model expects. in degree proportion of nodes q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q Goodness−of−fit diagnostics estimated groups; Labels the Sampson's groups

Figure 1 :
Figure 1: MLE positions for a fit on Sampson's Monks.

d
(required) Dimensionality of the latent space d.G (optional) Number of spherical Gaussian clusters into which the latent space positions are grouped G.The default is to fit a "plain" latent space model (G = 0).

Figure 2 :
Figure 2: MKL positions for a fit on Sampson's Monks.The pie charts on the right-hand plot show each vertex's posterior probability of assignment to each cluster.
It is also possible to specify other point estimates using argument what.For example, R> plot(samplike.fit,what = "pmean") and R> plot(samplike.fit,what = "pmode") will plot the posterior means and the approximate posterior modes (if the latter were requested in the tofit argument to ergmm), respectively; while R> plot(samplike.fit,what = 4) will plot the positions from the 4'th MCMC draw.The latter is useful for diagnosing slow mixing, by placing it inside a loop with R function Sys.sleep:R> for (i in 1:samplike.fit$control$sample.size)

Figure 5 :
Figure 5: Two-dimensional posterior densities of cluster locations for Sampson's Monks.

Figure 6 :
Figure6: A goodness-of-fit plot for Sampson's Monks for the in-degree distribution of the monks.Note the envelopes and the disjunction between the proportions of monks with indegree 3 and 4.
Each iteration of MCMC, as well as starting values, summary statistics, and hyperprior parameters are represented as instances of the ergmm.parS3 class, which extends the list