ergm: A Package to Fit, Simulate and Diagnose Exponential-Family Models for Networks

We describe some of the capabilities of the ergm package and the statistical theory underlying it. This package contains tools for accomplishing three important, and inter-related, tasks involving exponential-family random graph models (ERGMs): estimation, simulation, and goodness of ﬁt. More precisely, ergm has the capability of approximating a maximum likelihood estimator for an ERGM given a network data set; simulating new network data sets from a ﬁtted ERGM using Markov chain Monte Carlo; and assessing how well a ﬁtted ERGM does at capturing characteristics of a particular network data set.


Introduction
The ergm package for R (R Development Core Team 2007), a cornerstone of the statnet suite of packages for statistical network analysis, provides tools for modeling networks based on a well-studied class of models called exponential-family random graph models (ERGMs) or p-star models (Holland and Leinhardt 1981;Wasserman and Pattison 1996;Robins, Pattison, Kalish, and Lusher 2007a). In particular, the package allows users to obtain approximate (or, in some cases, exact) maximum likelihood estimates (MLEs); simulate random networks from a specified ERGM; and perform graphical goodness-of-fit checks of the type described by . This article describes some of the technical

ERGMs in a nutshell
The purpose of ERGMs, in a nutshell, is to describe parsimoniously the local selection forces that shape the global structure of a network. To this end, a network dataset, like those depicted in Figure 1, may be considered like the response in a regression model, where the predictors are things like "propensity for individuals of the same sex to form partnerships" or "propensity for individuals to form triangles of partnerships". In Figure 1(b), for example, it is evident that the individual nodes appear to cluster in groups of the same numerical labels (which turn out to be students' grades, 7 through 12); thus, an ERGM can help us quantify the strength of this intra-group effect. The information gleaned from use of an ERGM may then be used to understand a particular phenomenon or to simulate new random realizations of networks that retain the essential properties of the original. Handcock, Hunter, Butts, Goodreau, and Morris (2008) say more about the purpose of modeling with ERGMs; yet in this article, we focus primarily on technical details.
In the remainder of the article, we introduce the two network datasets of Figure 1 that will be used for illustrative purposes (Section 2), provide a brief technical summary of what an ERGM is (Section 3), and list a few examples of ERGMs along with numerous references in which these models are developed more fully (Section 4). Section 5 describes the algorithm for producing approximate maximum likelihood estimates used by the ergm package, while Sections 6 and 7 describe the simulation and goodness-of-fit capabilities, respectively, of the package. The focus of this article is restricted to the technical aspects of modeling using ERGMs, so it does not provide a proper discussion of the purpose of ERGMs nor how best to construct ERGMs in practice. There is a large body of literature on these topics, and the interested reader might turn to the special issue of Social Networks (Robins and Morris 2007), which contains several related articles and which provides extensive lists of references. Additional aid on the practical use of the statnet suite of packages is given by Goodreau et al. (2008a) in the form of a short tutorial.

Network datasets in ergm
Several network datasets are included with the ergm package. To see a list of them, type: R> data(package = "ergm") In this article, we use the samplike and faux.mesa.high networks, depicted in Figure 1, to illustrate various aspects of the ergm package functionality. To learn more about these particular datasets, or any of the datasets included with the ergm package, it is possible to view their corresponding documentation files by using the help function, or, equivalently, the question mark, as follows: R> help("samplike") R> ? "faux.mesa.high" In the samplike dataset of Figure 1(a), each node represents a monk within a particular monastery and a directed edge from one to another indicates that the first named the second as one of the three monks he likes the most, at any of the three distinct time points when the survey was administered; type help("sampson") for more details. Three groups, identified by Sampson (1968) after analyzing the trends in the pattern of ties over time, are indicated by the three different shapes and colors of nodes. Note that the definition of group membership in this case is endogenous: membership is not a measured attribute of the node, like age, that is independent of the relational structure, but instead a latent cluster defined by the structure of relations. The default behavior of the plotting function for a directed network like samplike is to place arrows at one end of each line segment, indicating the direction of each edge.
In the faux.mesa.high dataset of Figure 1(b), each node represents a student in grades seven through twelve at a hypothetical yet realistic school (or middle school-high school pair) in the United States, and each edge indicates a mutual friendship in which each node named the other as one of his or her top five male or top five female friends; see the appendix in Goodreau et al. (2008a) for the origin of this data set. Boys are depicted by square nodes and girls by many-sided polygons that appear circular; the label for each node indicates the grade in school (seventh through twelfth). In contrast to the samplike network, the nodal attributes in this case, grade and sex, are exogeneous. This difference means they can serve as predictors in a generative model for the friendships. Since the edges indicate mutual friendships, this is an undirected network.
In both figures 1(a) and 1(b), it is evident that the colors used for displaying the nodes are related to the clustering of the nodes. However, the plotting function does not consider these colors in any way when positioning the nodes; it only considers the pattern of edges and non-edges that exist in the network. Since the colors in the samplike are derived from the pattern of edges, the clustering by color is tautological. The fact that the nodes from faux.mesa.high cluster by grades, however, is different. In this case the clustering reveals a qualitative fact about this network, and the ergm package allows us to analyze properties like this quantitatively. The exact R code used to produce both of these plots is given in Appendix A-note that, because the algorithm used for the plots has a random element, this code will not produce exactly the same layouts as in Figure 1.

ERGM specification
Let the random matrix Y represent the adjacency matrix of an unvalued (binary) network and let Y denote the support of Y. Then we may think of Y as the set of all obtainable networks. Typically, as in this article, one fixes the number n of individuals, so that Y is a subset of all n × n matrices whose entries are all zero or one and whose diagonal entries are all zero-since the (i, j) entry indicates an edge from i to j, forcing the diagonal to be zero means that self-partnerships are disallowed. In the undirected case, Y contains only symmetric matrices.

The model
The distribution of Y can be parameterized in the form where θ ∈ Ω ⊂ R q is the vector of model coefficients and g(y) is a q-vector of statistics based on the adjacency matrix y (Frank and Strauss 1986;Wasserman and Pattison 1996).
Model (1) may be expanded by replacing g(y) with g(y, X) to allow for additional covariate information X about the network, as described in Section 4.3. The denominator, is the normalizing factor that ensures that Equation 1 is a legitimate probability distribution. Specification of Y, including the number of nodes, n, is an important yet often overlooked aspect of model (1). If, for instance, an edge denotes a heterosexual sexual relationship, then each element of Y should contain certain structural zeros, namely, all Y ij for which both i and j represent nodes of the same sex. At its largest, for a fixed n, Y may contain up to N = 2 n(n−1) networks, a very large number even for moderate-sized n, which makes calculation of κ(θ, Y) the primary barrier to inference using this model.

Change statistics
An alternative specification of the model (1) clarifies the interpretation of the θ coefficients.
To articulate this alternative, we first introduce the notion of a vector of change statistics. Such a vector is a function of three things: A particular choice g(·) of statistics defined on a network, a particular network y, and a particular pair of different nodes (i, j) that is either ordered or unordered, respectively, according to whether y is directed or undirected. We define the vector of change statistics as where y + ij and y − ij represent the networks realized by fixing y ij = 1 or y ij = 0, respectively, while keeping all the rest of the network exactly as in y itself. In other words, δ g (y) ij is the change in the value of the network statistic g(y) that would occur if y ij were changed from 0 to 1 while leaving all of the rest of y fixed.
In terms of the change statistic vector, model (1) may be shown to imply the following distribution of the Bernoulli variable Y ij , conditional on the rest of the network: where the logit function is defined by logit(p) = log[p/(1 − p)] and Y c ij represents the rest of the network other than the single variable Y ij . When the network statistics involve covariates X in addition to y, as we will describe in Section 4.3, we may add X to the notation and write δ g (y, X) ij . Equation 3 reveals two facts: First, the probability on the left hand side depends on y c ij only through the change statistics δ g (y) ij , not on g(y + ij ) or g(y − ij ) themselves. In many cases, it is much easier to calculate δ g (y) ij than it is to calculate g(y + ij ) or g(y − ij ), and this fact can lead to efficient computational algorithms. As an example of this phenomenon, the Erdős-Rényi model of Section 4.1 implies that δ g (y) ij = 1 for all y and for all i and j.
Second, Equation 3 says that each component of the θ vector may be interpreted as the increase in the conditional log-odds of the network, per unit increase in the corresponding component of g(y), resulting from switching a particular Y ij from 0 to 1 while leaving the rest of the network fixed at Y c ij . For examples of these kinds of interpretations for an actual dataset, refer to Sections 4 and 6 of Goodreau et al. (2008a).
The specific statistics that may be included in the g(y) vector in the ergm package are listed and described in . In the next section, we illustrate the use of only a small fraction of the available terms on the samplike and faux.mesa.high datasets. It is important to remember that because samplike is directed and faux.mesa.high is undirected, there are certain types of model terms that may not be used with one or the other of these datasets. A summary of these restrictions may be found in the table in Appendix A of Morris et al. (2008).

Examples of ERGMs
Here, we offer a brief glimpse at some important categories of ERGMs. We also give numerous references for readers interested in delving more deeply into the intricacies of building ERGMs, which is a subject that we cannot adequately cover given the limited space available and the limited scope of this article.

Bernoulli and Erdős-Rényi models
In the remainder of this article, we take to be the parameter space. The dimension q of Ω is at most N − 1 (for the "saturated" model), although it is typically much smaller than this. For example, if the dyads Y ij = Y ji of an undirected network are mutually independent and Y consists of the set of all possible undirected networks, the model can be written where θ ij = logit [P θ,Y (Y ij = 1)] is the log-odds of a tie in the (i, j) dyad and [Recall that the logit, or log-odds, function is defined by logit p = log p − log(1 − p).] Thus, for model (4), q = n(n − 1)/2 and the elements of the vector g(y) are just y ij . This model is often called a Bernoulli network. The special case where the dyads have a common probability implies that q = 1, g(y) = i<j y ij is the number of partnerships in the network, and the single coefficient θ can be interpreted as the common log-odds of partnership formation within any dyad. The mathematical properties of this homogeneous Bernoulli network model, also known as the Erdős-Rényi model, have been extensively studied-see Albert and Barabási (2002) and the many references therein-but the simplicity and homogeneity that make it tractable also make it less useful as a realistic model for social phenomena.
Consider the samplike dataset, which is contained in the ergm package and which may be accessed, once ergm is loaded, by typing

R> data("sampson")
Some information about the dataset may be obtained by typing help("sampson") or help("samplike"). To obtain summary statistical information, we may type either samplike or summary(samplike): Note that the ergm command requires the formula format in R, much like other regressionlike functions such as lm for linear regression or glm for generalized linear models. For ergm, the formula should be of the form network object~<model term 1> + <model term 2> + ...

R> samplike
where the model terms determine the elements of the g(y) vector. The ergm function allows many possible model terms other than edges; a complete catalog is given in Morris et al. (2008). Holland and Leinhardt (1981) appear to be the first to propose log-linear models for social networks. Suppose that we take Y to be the set of all directed graphs, with independent dyads [i.e., the pairs (Y ij , Y ji ) are independent for different choices of {i, j}] and the following model for tie probabilities:

The p 1 model
In this specification, each dyad has its own probability distribution. This model can represent arbitrary nodal indegree and outdegree marginal distributions and strength of reciprocity (or mutuality) within dyads. It can be written in log-linear form as which is a special case of Equation 1 with q = 3n(n − 1)/2. Holland and Leinhardt (1981) refer to this as the p 1 model, though they point out that it has too many coefficients to be useful statistically and instead consider a simplified version of model (5) in which all of the ρ ij are equal (to ρ) and φ ij = θ + α i + β j . In this model, ρ may be considered the mutuality effect; α i and β j are the sender and receiver effects, respectively, of the ith and jth nodes; and θ is the overall edge effect, analogous to the intercept in a linear regression. We may fit this model using maximum likelihood estimation as follows (because ergm uses a complex algorithm to fit this model, as described in Section 5.1, the ergm command may take a few minutes): R> model2 <-ergm(samplike~edges + sender + receiver + mutual, + seed = 2345, verbose = TRUE) R> summary (model2 The seed argument is used here and elsewhere in the article to make the output exactly reproducible, since the fitting algorithm is random; however, we have noticed that there exist some differences among the output produced by some platforms for some of the examples nonetheless. We obtained these results using R version 2.6.2 on a Windows machine using ergm version 2.1 and network version 1.3. The verbose = TRUE argument results in a lot of output as the algorithm proceeds. The control = control.ergm(check.degeneracy = TRUE) option would have invoked a check for degeneracy that takes quite a bit of time for this particular model due to the fact that it contains 36 parameters. Model degeneracy is discussed briefly by Handcock et al. (2008) in this volume, and in more detail by Handcock (2003b,a).
Note that summary(model1) produced a lot of information besides the coefficient estimates. The standard errors and loglikelihood values on which the deviance, AIC and BIC values are based use stochastic approximations discussed in Hunter and Handcock (2006). Also note that there are only 17 sender effects (sender2 through sender18) and 17 receiver effects, even though there are 18 nodes. This is because inclusion of all 18 effects would result in a linear dependency among the statistics of g(y), which should be avoided here, as in all statistical models.
The estimates above are approximate maximum likelihood estimates obtained using a stochastic algorithm based on Markov Chain Monte Carlo (MCMC); hence, the results will not be exactly the same for all runs. However, if exact reproducibility of coefficient estimates is important, it is possible to seed the random number generator manually using the seed argument to the ergm function; type ?ergm for more details. In this particular case, the formula in Equation 1 simplifies considerably and it is possible using numerical methods to find the exact maximum likelihood estimator, i.e., without resorting to a stochastic MCMC-based algorithm. Holland and Leinhardt (1981) do this, but unfortunately it is not possible to compare their results directly with ours because they use a slightly different dataset.
These early simple ERGMs, which have no exogeneous covariates and assume independence across dyads, have been substantially extended and generalized in subsequent social network literature. Based on developments in spatial statistics (Besag 1974), Frank and Strauss (1986) introduce forms of dependence with Markov structure. Wasserman and Pattison (1996) incorporate exogeneous and endogenous nodal attributes  and make a distinction between explanatory and response variables (Robins, Pattison, and Wasserman 1999), resulting in social influence (Robins, Pattison, and Elliott 2001b) and social selection (Robins, Elliott, and Pattison 2001a) models. These generalizations essentially allow analysis of networks with "colors" on the nodes, where "color" is used to indicate the attribute values conceptually as Figure 1 does literally. Recent developments include new forms of dependency structures, to take into account more general neighborhood effects. These models relax the one-step Markovian dependence assumptions, allowing investigation of longer range configurations, such as longer paths in the network or larger cycles (Pattison and Robins 2002). Models for bipartite (Faust and Skvoretz 1999) and tripartite (Mische and Robins 2000) network structures have also been developed.

Exogenous covariates and dyadic independence
Attribute information is easily incorporated into an ERGM (Fienberg and Wasserman 1981).
Suppose we wish to examine the impact of p exogenous attributes represented by an n × n × p array, X, whose ijkth element is the value of the kth attribute for the potential edge represented by the Bernoulli random variable Y ij . Note that this construction allows the attributes to be functions of nodal covariates. For instance, X ijk might be the absolute difference in ages between nodes i and j, say, |age i − age j |. As a practical matter, note in this example that it is not actually necessary to store the entire X array; it is much simpler to store only the vector giving the age of each node along with a rule for how to calculate X ijk when needed.
To modify the ERGM of Equation 1 to allow X to influence the probability distribution of Y, we replace g(y) by g(y, X), indicating that the statistics depend on the attribute information in addition to the relationship information. As an example, suppose that g(y, X) includes the following terms (where the equivalent ergm-package codings are given in square brackets): # of edges in y [ergm code: edges] # of edges between students of the same grade, counted separately for each possible grade [nodematch("Grade", diff = TRUE)] # of edges involving males, with male-male edges counted twice [nodefactor("Sex")] We might say that this model contains terms for the overall number of edges, a differential homophily effect for grade, and a main effect for sex. We may fit this model using the faux.mesa.high dataset as follows: R> data("faux.mesa.high") R> model3 <-ergm(faux.mesa.high~edges + nodematch("Grade", diff = TRUE) + + nodefactor("Sex")) R> summary (model3 We see that in each grade, students are more likely to make friends with those in their own grade than those in other grades. Furthermore, girls are more likely than boys to make friends in this network. These coefficients may be interpreted as described in Section 3.2, and these interpretations will be familiar to practitioners of logistic regression. For instance, we can say that if all other covariate values are the same, then an individual female's odds of forming a tie with a particular student are exp{0.3743} = 1.454 times those of an individual male, conditional on the rest of the network. See Goodreau et al. (2008a) for further interpretation of output such as this.
The output above is not based on a stochastic MCMC algorithm, so the code should always produce exactly the same values. This is not true of the stochastically obtained summary of model2 in Section 4.2. Before explaining the difference, we emphasize that a dyad in a network is the random variable representing the state of the relationship(s) between two given nodes. In other words, a dyad in an undirected network is simply a single Y ij , whereas a dyad in a directed network is a pair (Y ij , Y ji ). We now articulate the idea of dyadic independence via a couple of definitions. To understand Definition 1, it may be helpful to review the definition of the change statistic vector δ g (y, X) ij in Section 3.2.
Definition 1 A dyadic independence term is a term in an ERGM for which the corresponding network change statistic(s) in the δ g (y, X) ij vector-or δ g (y) ij if there are no covariatesmay always be calculated, regardless of the values of i and j, without knowing anything about y except possibly (in the case of a directed network) the value of y ji .
The table in Appendix A of Morris et al. (2008) indicates which of the terms currently available in statnet are dyadic independence terms. Examples include each of the terms used in model2 and model3: edges, receiver, sender, mutual, nodematch. and nodecov.
Definition 2 A dyadic independence ERGM is an ERGM all of whose terms are dyadic independence terms.
In the case of dyadic independence models for undirected networks, the conditional probabil- may be replaced by the unconditional, or marginal, probability P θ,Y (Y ij = 1). This results in an enormous simplification of the likelihood function, as described in Section 5.2, allowing the exact calculation of the maximum likelihood estimator. The output from fitting model3 is a case in point.
The situation is almost this simple for dyadic independence ERGMs for directed networks. However, a dyad in a directed network consists of two edges, so even in a dyadic independence model, it is possible that P θ,Y (Y ij = 1) depends on Y ji . This is exactly the situation of the p 1 model of Section 4.2, which is why ergm employs a stochastic MCMC algorithm in the model2 example. Currently, there are only two terms in statnet-mutual and asymmetricthat require a dyadic independence model to use MCMC. See Morris et al. (2008) or type ?ergm.terms for full descriptions of these terms. Any dyadic independence ERGM for a directed network not containing either of these two terms exploits the same exact maximum likelihood calculation used for model3 above.
To conclude this section, we draw an important distinction between dyadic independence and linear independence. It is always important to ensure that there are no linear dependencies among the terms in an ERGM, and linear dependencies can arise with either dyadic independence or dyadic dependence terms. For instance, it was to avoid linear dependence that the sender1 and receiver1 statistics are eliminated from model2 of Section 4.2, even though both sender and receiver are dyadic dependence terms, whether or not we include sender1 and receiver1 statistics in g(y). Many other statnet terms eliminate (or can be made to eliminate) certain statistics in order to avoid linear dependencies; for examples, see all terms that use the base argument in the summary table of Appendix A in Morris et al. (2008).

Dyadic dependence models
One commonly used class of dyadic dependence models-i.e., models that do not satisfy Definition 2-exhibit Markov dependence in the sense defined by Frank and Strauss (1986). For these models, dyads that do not share a node are conditionally independent, an idea analogous to the nearest neighbor concept in spatial statistics. Sometimes, a homogeneity condition is also added so that all isomorphic networks have the same probability under the model. Frank and Strauss (1986) show that homogeneous Markov network models are exactly those having the triangle parameterization, in which θ ∈ Ω = R n and and T 1 (y) = 1 6 n i=1 n j=1 n k=1 y ij y jk y ki .
In this parameterization, S k (y) counts the so-called k-stars for 1 ≤ k ≤ n − 1 and T 1 (y) is a count of triangles (or cyclic triads in the directed case). An equivalent form is the degree distribution parameterization, in which where D k (y) equals the number of individuals with exactly k relationships, 1 ≤ k ≤ n − 1. The degree distribution parameterization has the advantage that the degree statistics are directly interpretable in terms of concurrency of partnerships; i.e., D m (y) for m > 0 counts the number of individuals with m concurrent partners.
In practice, these models have often been simplified further, reducing the terms to edges, two-stars and triangles, and assuming isomorphic homogeneity. Unfortunately, we now know that such simple Markov models rarely produce reasonable networks. The reason has to do with the problem of degeneracy (Handcock 2003a,b), which is discussed in the context of the ergm package by Handcock et al. (2008). For a lengthy case study of degeneracy in a model that contains the triangle term, see Section 5 of Goodreau et al. (2008a). The shortcomings of the simplified Markov model may be addressed by allowing for some heterogeneity via the inclusion of covariate-dependent model terms as described in Subsection 4.3, and by the use of triad-based curved exponential family terms in place of the triangle count as described below.

Curved exponential-family models
This section details some of the technical considerations underlying a recently-developed class of network statistics that has been shown to work well in many social network contexts (Snijders, Pattison, Robins, and Handcock 2006;Robins, Snijders, Wang, Handcock, and Pattison 2007b;Goodreau, Kitts, and Morris 2008b). As an example of this type of statistic, consider the quantity u(y; φ s ) = e φs Evidently, u(y; φ s ) is a scalar for a fixed network y and parameter φ s , obtained by a linear combination of the degree statistics D i (y) that depends on the tuning parameter φ s (Hunter 2007). Because of the geometric series used in the linear weights, u(y; φ s ) is referred to as the geometrically weighted degree (GWD) statistic. If the edges term is also included in g(y), then u(y; φ s ) may be shown to be equivalent in a certain sense to the alternating k-star statistic of Snijders et al. (2006).
The ergm package has the capability of fitting models that include the GWD statistic, via the gwdegree term and several other related terms. In addition, two other geometrically weighted statistics, the geometrically weighted edgewise shared partner (GWESP) and the geometrically weighted dyadwise shared partner (GWDSP) statistics, are also supported by ergm. The first of these, GWESP, is equal to where EP i (y) is the number of edges in y between two nodes that share exactly i neighbors in common, i.e., the number of edges that serve as the common base for exactly i distinct triangles (Hunter 2007). The GWDSP statistic is similar, except EP i (y) is replaced by DP i (y), which is the number of pairs (i, j) such that i and j share exactly i neighbors in common, whether or not y ij = 1.
From a modeling perspective, these geometrically weighted terms are useful because they are not merely counts of local network configurations, like the degree of k-star statistics; instead, they are particular linear combinations of an entire distribution of degree or shared partner statistics. These terms appear very effective at overcoming the problems of degeneracy pointed out for the Markov network models mentioned in Section 4.4. A full discussion of the merits of these terms is beyond the scope of this article. For information on their purpose, see Snijders et al. (2006) and Robins et al. (2007b); for case studies that use them, see Goodreau (2007), Hunter et al. (2008), and Goodreau et al. (2008b); and for a tutorial introduction to their use, see Section 6 of Goodreau et al. (2008a). Here, we focus solely on the technical difficulties accompanying their use in ERGMs.
If φ s in Equation 7 and φ t in Equation 8 are fixed and known, then u(y; φ s ) and v(y; φ t ) present no special difficulties; one or both may easily be included as components of g(y). However, if φ s or φ t is an unknown parameter to be estimated via maximum likelihood, then the terms introduce some formidable technical and computational challenges. In this case, the model resulting from including u(y; φ s ) or v(y; φ t ) in g(y) is not of the standard ERGM form (1). However, it may be shown (Hunter and Handcock 2006) that this model is in fact an example of a curved exponential-family model in the sense of Efron (1975Efron ( , 1978. As an example, we fit two different models to the faux.mesa.high dataset, each involving the edges term, a uniform homophily effect of grade (i.e., an effect of two students being in the same grade), and a GWESP term. In model4, the φ t parameter of Equation 8 is assumed to be fixed at a value of 0.5; this model is therefore a true ERGM of the form (1). The second model, model4a, is a curved exponential family model in which the φ t parameter is to be estimated (the value 0.5 is used only as an initial value in the numerical estimation procedure). Note the difference in output for the two models, the key being that one holds the φ t parameter fixed at 0.5 and the other estimates it along with the other coefficients. The second model, model4a, may take a few minutes to run. , the degeneracy diagnostic suggests that both model4 and model4a may be degenerate models. These models are given here only to illustrate the difference between fixed = TRUE and fixed = FALSE, not to suggest that they fit these data well.

Approximating an MLE
From Equation 1, we obtain the loglikelihood function where y obs denotes the observed network. It is possible to redefine the g(y) vector by subtracting from it the constant vector g(y obs ), thus simplifying expression (9) without changing the model at all. Indeed, this simplification is used by the ergm function. However, we will use expression (9) throughout this article.
Rather than maximize (θ) directly, we will consider instead the log-ratio of likelihood values where θ 0 is an arbitrarily chosen parameter vector. [Note: Previously in this article, we have tended to use the term "coefficient" in situations in which either "coefficient" or "parameter" would technically be correct. To be precise, a coefficient in this context is a specific kind of parameter, namely, one that is multiplied by a statistic as in this case or in the case of regression generally. In this section, we may use the terms "parameter" and "coefficient" interchangeably.] The approximation of ratios of normalizing constants such as the one in expression (10) is a difficult but well-studied problem (Meng and Wong 1996;Gelman and Meng 1998). The main idea we exploit in the ergm function is due to Geyer and Thompson (1992) and may be described as follows: Starting from Equations 1 and 2, a bit of algebra reveals that where E θ 0 denotes the expectation assuming that Y has distribution given by P θ 0 ,Y . Therefore, we may exploit the law of large numbers and approximate the log-ratio by where Y 1 , . . . , Y m is a random sample from the distribution defined by P θ 0 ,Y , simulated using an MCMC routine as described in Section 6.
The stochastic estimation technique described above requires one to select a parameter value θ 0 . While the approximation of Equation 11 may in theory be made arbitrarily precise by choosing the MCMC sample size m to be large enough, in practice it is extremely difficult to use this approximation technique unless the value θ 0 is chosen carefully-namely, θ 0 should be "close enough" to the true maximum likelihood estimatorθ or Equation 11 will fail.
To see why this is so in a simple example, consider model1 from earlier, in which g(y) is just the number of edges in y and for the samplike dataset we foundθ = −0.9072. Suppose that we wanted to use the approximation (11) for (θ) − (θ 0 ), where in this case we take θ 0 = 1 for purposes of illustration. [Note: both g(y) and θ are scalars in this example, so we do not use bold type to write them.] Since θ 0 = 1 corresponds to a network in which each of the 18 × 17 = 306 possible edges occurs independently with probability exp{1}/(1 + exp{1}) = 0.731, we may easily obtain a random sample g(Y 1 ), . . . , g(Y m ) by simulating m draws from a binomial distribution with parameters (306, .731). In this simulation, the probability of obtaining a network with g(Y) < g(y obs ) = 88 is extremely small, roughly 2.3 × 10 −59 . For all practical purposes, such an event will never happen in a simulation even for very large m. Yet a simple derivation shows that the right side of Equation 11 cannot have a maximizer if there is no Y i with g(Y i ) < g(y obs ), a fact that also follows from standard exponential-family theory (Barndorff-Nielsen 1978). Therefore, we conclude that the stochastic algorithm for approximating the MLE will fail if we select θ 0 = 1 since g(Y i ) < g(y obs ) is an extraordinarily rare event in this case. On the other hand, with θ 0 = −1, it is straightforward to obtain an approximate MLE through simulation. These two cases are illustrated by Figure 2. . The true MLE,θ = −0.9072, is easily derived in this case. Note that θ 0 = −1 appears to give a good approximation to the loglikelihood nearθ, whereas θ 0 = 1 gives an approximation that cannot even be maximized.

Pseudolikelihood
The default method used by ergm to choose θ 0 is pseudolikelihood estimation, originally motivated by, and developed for, spatial models (Besag 1974). The idea is to use an alternative local approximation to the likelihood function referred to as the pseudolikelihood. The pseudolikelihood for model (1) is identical to the likelihood for a logistic regression model in which the (binary) response data consist of the off-diagonal elements of y obs and the predictor vectors are given by the change statistics δ g (y obs ) ij of Equation 3. Indeed, this is exactly the likelihood that is obtained if one starts with Equation 3 and then assumes in addition that the Y ij are mutually independent, so that The maximum pseudolikelihood estimator (MPLE) for an ERGM, the maximizer of the pseudolikelihood, may thus easily be found (at least in principle) by using logistic regression as a computational device. As the discussion in Section 4.3 shows, when the ERGM is a dyadic independence model not containing the mutual or asymmetric terms, the true likelihood and the pseudolikelihood are the same, which is to say that the true maximum likelihood estimator may be found via an MPLE computation.
When the ERGM in question is not a dyadic independence model, the statistical properties of pseudolikelihood estimators for social networks are not well understood. Recent work (van Duijn, Gile, and Handcock 2007) recommends strongly against their use as estimators. Nonetheless, it is sometimes helpful to be able to check the value of an MPLE. The ergm function may be used to return an MPLE by setting MPLEonly = TRUE. For instance, we may check that the MLE for the dyadic independence ERGM called model3 fitted earlier coincides with its MPLE by verifying that the difference in their coefficient estimates equals the zero vector: R> model3a <-ergm(faux.mesa.high~edges + nodematch("Grade", diff = TRUE) + + nodefactor("Sex"), MPLEonly = TRUE) R> model3$coef -model3a$coef edges nodematch.Grade.7 nodematch.Grade.8 0 0 0 nodematch.Grade.9 nodematch.Grade.10 nodematch.Grade.11 0 0 0 nodematch.Grade.12 nodefactor.Sex.2 0 0

Profile likelihood computations
It may sometimes be desirable to fix the values of certain parameters in the model at known constants and then maximize the likelihood as a function of the remaining parameters. The resulting maximized value of the likelihood is called the profile likelihood and it is a function of only the parameters that were fixed. In this way, for instance, it is possible to examine a profile likelihood surface for one of more of the parameters. Note that maximizing the profile likelihood function for a subset of the parameters yields the overall MLE.
To achieve this type of profiling using statnet, use offset in an ergm formula. For example, suppose we wish to modify model3, seen previously in this section, so that the edges coefficient is fixed at −6.0 instead of its unconstrained MLE value of −6.248. Then, we would like to maximize the likelihood (i.e., estimate the other coefficients) subject to this constraint. Even though model3 is a dyadic independence model, to carry out this constrained maximization would require a specially modified logistic regression routine that is part of neither ergm nor R; therefore, we will force the ergm function to generate an approximate maximum likelihood estimate using an MCMC algorithm, as follows. The idea is to start with the maximum likelihood found earlier (model3$coef), modify only the "edges" coefficient, and then hold that coefficient constant at -6.0: R> theta0 <-model3$coef R> theta0[1] <--6.0 R> model3b <-ergm(faux.mesa.high~offset(edges) + + nodematch("Grade", diff = TRUE) + nodefactor("Sex"), + theta0 = theta0, control = control.ergm ( Note that we had to explicitly state a starting θ 0 value, theta0, for the entire parameter vector (not merely the edges term). Also, the force.mcmc = TRUE option to the control.ergm function is used to force ergm to use a stochastic approximation algorithm to find the MLE, even in the case of a dyadic independence model.

Simulating random networks from an ERGM
The form of model (1) allows networks to be generated from it using Markov Chain Monte Carlo (MCMC) algorithms. MCMC algorithms have been much studied and are a natural way to simulate social networks (Gilks, Richardson, and Spiegelhalter 1996;Newman and Barkema 1999). The goal is to construct a Markov Chain on Y with P θ 0 ,Y (Y = y) as the equilibrium distribution. This is operationalized by starting from a network in Y and then making a large number of appropriately sampled Markov transitions until approximate convergence to P θ 0 ,Y (Y = y) is reached. Subsequent transitions are sampled and form a (sequentially dependent) sample from the desired model. For details on the general MCMC approach, see the extensive literature cited in the above books.
Many chains of networks are possible for a given ERGM, with vastly different mixing properties. However, convergence is ensured under fairly mild conditions (irreducibility and aperiodicity) on the Markov Chain in the limit as the number of transitions approaches infinity. For the social network representation (1), this process has been studied by Crouch, Wasserman, and Trachtenberg (1998), Corander, Dahmström, and Dahmström (1998), and Snijders (2002.

Different types of Markov chains
A full-conditional MCMC method has a simple form: At each iteration, for some choice of (i, j), Y ij is set to zero or one according to the conditional probabilities P θ 0 ,Y (Y ij = 1|Y c ij = y c ij ) and P θ 0 ,Y (Y ij = 0|Y c ij = y c ij ) implied by Equation 3. This so-called "Gibbs sampling" or "heat bath" algorithm chooses the pairs (i, j) uniformly at random, sequentially, or using some mixture of the two. Each update requires the change statistics δ g (y) ij of Equation 3 to be determined. The speed of the calculation of δ g (y) ij (or δ g (y, X) ij if covariates are involved) is an important factor in the computational quality of the algorithm (i.e., speed of convergence to equilibrium).
As an alternative to Gibbs sampling, Metropolis algorithms propose transitions from y current to y proposed , where at each step of the chain, the algorithm makes a random choice of whether to remain at y current for an additional step or change to y proposed , the latter choice occurring with probability min 1, Still more general, Metropolis-Hastings algorithms choose y proposed from an auxiliary distribution dependent on y current and are aimed at either focusing the transitions or spreading them more broadly throughout Y. Thus, if q(y 1 , y 2 ) denotes the probability that Y proposed = y 1 given that Y current = y 2 under this auxiliary distribution, then the probability (12) is replaced by min 1, Thus, the Metropolis algorithm of Equation 12 may be viewed as a special case in which the auxiliary distribution is symmetric in the sense that q(y 1 , y 2 ) = q(y 2 , y 1 ).
What makes Metropolis and Metropolis-Hastings algorithms (which include Gibbs sampling as a special case) so appealing is that the normalizing constants (2) disappear from the ratio of ERGM probabilities seen in Expressions (12) and (13); indeed, this ratio is simply In fact, if y proposed differs from y current by exactly a single edge toggle, replacing y ij by 1 − y ij , then g(y proposed ) − g(y current ) is just ±δ g (y) ij . On the other hand, if y proposed differs substantially from y current for a particular type of Metropolis-Hastings proposal, then the ratio of Equation 14 can be calculated by considering a sequence of networks, each with one dyad different from the last, starting from the current network and ending at the proposed network. At each step, the ratio is a simple function of the change statistic vector.

Modifying the Metropolis-Hastings algorithm
Metropolis-Hastings algorithms can converge more efficiently than Gibbs sampling to the target distribution when the proposal density q(·, ·) is well-chosen. The behavior of MCMC algorithms is also very dependent on the choice of statistics g(y). Snijders (2002) reports on some odd convergence properties of the MCMC algorithms described here for particular choices of an ERGM and a parameter vector. In some cases, the sequences of realizations transition quickly between very different networks after periods of minor variation that can be extremely long. Other studies using MCMC algorithms to simulate social network models have reported difficulties in obtaining convergence to realistic distributions (Crouch et al. 1998;Corander et al. 1998). A typical occurrence in such cases is for the algorithm to produce networks that are complete, empty, or otherwise extreme in some way. Such behavior is a byproduct of the models themselves, rather than the MCMC algorithms used to simulate from them (Handcock 2003a,b). Corander et al. (1998) considered algorithms that hold the number of edges in the network fixed, which avoids the problem of full or empty graphs. However, in most circumstances the density of the network is a product of the social process that produced it and cannot be assumed to be known in advance. Nonetheless, the ergm package supports many different Metropolis-Hastings constraints that hold various network statistics, such as the overall density, the degree distribution, or the degree of each node, constant. These constraints amount to restricting the class Y of networks that are considered to be possible under model (1). The possible constraints available in the ergm package, along with a couple examples of their use, are described in Section 3 of Morris et al. (2008). Possible modifications to the q(y 1 , y 2 ) proposal distribution are discussed in Section 4 of Morris et al. (2008).
Suppose we wish to use an MCMC idea such as the one described above to simulate a random network according to this model. Here are two equivalent ways to do this using the simulate command: R> net1 <-simulate(model2, verbose = TRUE, seed = 678) R> net1 <-simulate(samplike~edges + sender + receiver + mutual, + theta0 = model2$coef, verbose = TRUE, seed = 678) Note that the second of these commands, using a formula along with the theta0 argument, gives quite a bit of flexibility. For instance, one might wish to fit model2 and then examine the effects of small changes to one of the parameters in the fitted model. To do this, one could make a copy of these fitted coefficients -say, by typing the command mycopy <-model2$coef -then make the small changes to this copy and use the altered copy as the parameter values by substituting theta0 = mycopy into the above expression.
After simulating net1, we may plot it using a command similar to the one used to produce Figure 1, but with net1 in place of samplike; see Appendix A for details. The result of one such experiment is depicted in Figure 3.
Note in particular that the clustering of nodes by (colored) group is no longer evident. This is due to the fact that the ERGM used to generate this simulated network has no terms that represent the effects of the nodal "group" covariate. Since group membership was an endogenous property of the original network, rather than an exogenously defined measure, the inclusion of such terms would raise some interesting theoretical issues that lie outside the scope of this paper. While we cannot use this model to try to reproduce the group membership of specific nodes, we can use it to try to reproduce a network that is isomorphic with respect to the aggregate pattern of clustering. This structural isomorphism is closely related to the concept of "regular equivalence" in the social network literature (Borgatti and Everett 1992).
For a quantitative comparison of the structural similarities in the randomly generated network and the original samplike dataset, we may use the summary.formula capability of ergm, which provides summary statistics for a network. The idegree term above refers to in-degree, and in an ergm formula, idegree(0:3) adds four statistics to the g(y) vector: The number of nodes with 0, 1, 2, and 3 in-edges in y, respectively. Morris et al. (2008) give a list of the various graph statistics that may be used in Figure 3: A randomly generated network according to the ERGM with mutuality and edges terms, fitted to the samplike dataset.
an ergm or summary statement. Furthermore, both Goodreau et al. (2008a) and Morris et al. (2008) contain additional examples of the simulate function applied to networks.

Goodness of fit
Recent work on the quality of certain ERGMs, in particular work on degeneracy in ERGMs (Handcock 2003a,b) underscores the following fact: A maximum likelihood estimatorθ, while providing in some sense the best possible model from the particular class of models defined by Equation 1 for a particular choice of g(y), does not necessarily result in a particularly good model in a practical sense. It is possible that the model class itself is simply incapable of producing a probability distribution on Y such that there is a reasonable probability of obtaining networks that resemble the data y obs . Yet we must specify what is meant by "resemble" in this context.
One way in which one network may "resemble" another, particularly in the context of an ERGM with a particular vector g(y), is that their g(y) vectors may be close together. We know from the theory of exponential family models (Brown 1986) that a particular ERGM (1), with θ set equal to the maximum likelihood estimatorθ, has the property that so that at least we may be assured that the probability mass of the ERGM is centered at Randomly generated networksỸ 1 ,Ỹ 2 ,Ỹ 3 , . . .

g(y obs
). Yet this is not sufficient to imply that a random Y generated from the ERGM will "resemble" y obs . It is in fact quite possible that Equation 15 could be achieved essentially because the MLE model places nearly all of the probability mass on nearly-empty or nearlyfull networks, such that the mean, somewhere in between, is exactly g(y obs ). A striking example of this phenomenon is given in Section 3 of Handcock et al. (2008) in this volume; see also Handcock (2003a).
The intuition of the gof function in the ergm package, illustrated by the cartoon of Figure 4, is to compare the observed y obs with a set of simulated networksỸ 1 ,Ỹ 2 , . . . based on certain network statistics -which may or may not overlap those of g(y) itself. For instance, the code below uses four different sets of statistics, specified by the GOF argument, as a basis for comparison between the faux.mesa.high dataset and a series of 100 randomly generated networks obtained from the fitted model3. The interval = 5e+4 argument specifies the number of MCMC steps (50,000 in scientific notation) between sampled networks. Because of the large amount of simulation and compilation of network statistics necessary, the gof function may take several minutes to run.
R> m3gof <-gof(model3, GOF =~distance + espartners + degree + triadcensus, + verbose = TRUE, interval = 5e+4, seed = 111) The four sets of statistics used for the comparison are as follows: The geodesic distance distribution: The proportion of pairs of nodes whose shortest connecting path is of length k, for k = 1, 2, . . .. Also, pairs of nodes that are not connected are classified as k = ∞.
The edgewise shared partner distribution: The statistics EP 0 , EP 1 , . . . of Equation 8, divided by the total number of edges.
The degree distribution: The statistics D 0 , D 1 , . . . of Equation 6, divided by n.
The triad census distribution: The proportion of 3-node sets having 0, 1, 2, or 3 edges among them. Note: For a directed network, the triad census has 16 categories instead of 4; see the triadcensus term in Section 2.5 of Morris et al. (2008). The par and plot functions below produce the plot shown in Figure 5.
R> par(mfrow = c(2,2)) R> plot(m3gof, cex.lab=1.6, cex.axis=1.6, plotlogodds = TRUE) The upper-right plot of Figure 5 reveals that the ERGM with only an edges term, a differential homophily term for grade and a main effect for sex does a poor job of capturing the edgewise shared partner distribution. But considering that model3 is so simplistic, it does a remarkably good job of producing networks that reflect the degree distribution, the pairwise geodesic distance distribution, and the triad census of the original faux.mesa.high dataset. A better model for the faux.mesa.high dataset would include homophily terms and main effects for grade, sex, and race as well as a GWESP term to capture transitivity. Further details on such models may be found in Hunter et al. (2008), where they are fit to real data similar to faux.mesa.high.

Discussion
There are many features of the ergm package that it is impossible to document here due to space limitations, but we hope that this article, together with its companion articles in this volume, serves as a useful introduction to the capabilities of the package as well as some of the theory behind it. Of course, questions will inevitably arise that are not answered here or in the package documentation. For this reason, we have established a statnet mailing list at statnet_help@u.washington.edu. To subscribe go to https://mailman.u.washington. edu/mailman/listinfo/statnet_help. Further details are available on the statnet project web page at http://statnetproject.org.
The statnet packages are far from finished. For instance, future versions of the ergm package will address the question of how to fit ERGMs to network data that evolve in time. In addition, while the numerical fitting algorithm has come a very long way-and we are nearly at the stage where a reasonable model can be expected to converge "out of the box"-improving the algorithm is still a topic of active research.