Improving ERGM Starting Values Using Simulated Annealing

Much of the theory of estimation for exponential family models, which include exponential-family random graph models (ERGMs) as a special case, is well-established and maximum likelihood estimates in particular enjoy many desirable properties. However, in the case of many ERGMs, direct calculation of MLEs is impossible and therefore methods for approximating MLEs and/or alternative estimation methods must be employed. Many MLE approximation methods require alternative estimates as starting points. We discuss one class of such alternatives here. The MLE satisfies the so-called"likelihood principle,"unlike the MPLE. This means that different networks may have different MPLEs even if they have the same sufficient statistics. We exploit this fact here to search for improved starting values for approximation-based MLE methods. The method we propose has shown its merit in producing an MLE for a network dataset and model that had defied estimation using all other known methods.

undirected networks on N nodes, while in other applications A may be constrained to be a proper subset of this set. (If we drop the requirement that A ij = A ji , then the sample space consists of directed networks.) The sufficient statistics T (A) play a central role in the model, since they enable the inclusion of traditional exogenous covariates like a node's age as well as endogenous statistics, i.e., statistics that allow for inference on the structure of the network. Popular endogenous statistics are a network's number of triangles or the number of pairs of ties that share one common node (two-stars). The normalizing constant k(θ) := a∈A exp(θ T (a)), (2) a weighted sum over all possible networks in the sample space, assures that (1) defines a probability model.
Maximizing the log-likelihood function of the ERGM results in the maximum likelihood estimator, or MLE. The MLE can be difficult to calculate directly due to the required calculation of k(θ), a sum which is only feasible for particular choices of T (A) for which the sum may be simplified or sample spaces small enough to allow direct calculation. Even for a small number of nodes, the sample space can be prohibitively large; for instance, there are over 3.5 × 10 13 networks on just N = 10 nodes, and an additional node inflates this number by a factor of over 1000. It is therefore often necessary to use an approximation method when an MLE is desired, as exact calculation is not possible.
An alternative method of estimation, known as maximum pseudolikelihood estimation (MPLE), has the desirable property that it is relatively easy to calculate even in cases where an exact MLE is elusive. This article first discusses approximate maximum likelihood estimation and then summarizes how MPLE works and explains why, even in cases where an MLE is desired, the approximation method often begins by calculating the MPLE. Then, in Section 4, we explain a novel method to augment the approximation of the MLE by exploiting the failure of the MPLE to satisfy the so-called likelihood principle. We demonstrate the use of this idea in Section 5.

Approximate MLE and Exact MPLE
An expedient to the problem of maximizing the often-intractable likelihood function (3) and Thompson (1992) and then adapted to the ERGM framework by Snijders (2002) and Hunter and Handcock (2006), The MCMLE is based on the idea that for any θ 0 ∈ R q , where the expectation is taken assuming that the random A has the distribution P θ0 .
For a given θ 0 , Equation (4) may be exploited by sampling a large number of networks A 1 , . . . , A L from the distribution P θ0 . Then one may approximate and optimize the difference of log-likelihood In theory, the MCMLE algorithm works for any starting value θ 0 , however, Hummel et al. (2012) show that approximation (5) is best when θ is close to θ 0 , and furthermore the approximation can degrade badly when these parameter values are not close. For this reason, in practice it is necessary to choose θ 0 close to a maximizer of (θ) or else the MCMLE idea fails.
The most common choice of θ 0 is the maximizer of the so-called pseudo-likelilhood function.
To construct the pseudolikelihood, we focus on the individual values of the tie indicators A ij . A straightforward calculation shows that for a particular i, j, the conditional distribution of A ij given the rest of the network A is calculated from Equation (1) to be where the inverse logit function is defined by logit −1 (x) = exp{x}/(1 + exp{x}) and the networks a + ij and a ij − are formed by setting the value of a ij to be zero or one, respectively, while fixing the rest of the network at a c ij . We define (∆A) ij := T (A + ij ) − T (A − ij ) and refer to (∆A) ij as the i, j vector of change statistics. We may therefore rewrite Equation (6) as for all i, j. Under the additional assumption that the A ij are all independent of one another, these equations together define a logistic regression model, and the maximum likelihood estimator for this logistic regression is known as the maximum pseudo-likelihood estimator (MPLE) because the assumption of independence is not justified in all cases and therefore the logistic regression likelihood function is sometimes misspecified. Despite this misspecification, the MPLE is frequently used as an estimator of θ because it is easily obtained using logistic regression software. Indeed, the MPLE has a lengthy history in the literature on ERGMs; see Schmid and Hunter (2020) for more details.
There is a substantial literature on the use of MPLE as an estimator in its own right. For example, Schmid and Hunter (2020)  Tukey for example, "Far better an approximate answer to the right question, which is often vague, than an exact answer to the wrong question, which can always be made precise" (Tukey, 1962). For the purpose of this article, the MPLE will serve merely as a value θ 0 used in Approximation (5) and we assume that the ultimate goal is to obtain an approximate MLE via MCMLE.

Comparison of MLE with MPLE
The theory of exponential family models is well-established in general; Barndorff-Nielsen (1978) gives an extensive book-length treatment of this theory. As a particularly useful example in the context of ERGMs, for exponential family distributions the expectation of the T vector under the MLE equals the sufficient statistic on the observed network, i.e., Eθ[T (A)] = T (A obs ). In other words, the MLE equals the method of moments estimator. This aligns with one's general expectation that the distribution described by an estimate should on average describe the observed network. Furthermore, this fact provides a useful means for checking that a potential maximizer of the approximate loglikelihood function is in fact close to the MLE, as one may simulate networks from the distribution derived from the MLE and check that their sample mean is approximately equal to A obs .

The Likelihood Principle
Another property of the MLE is that it satisfies the likelihood principle (Barnard et al. (1962), Birnbaum (1962)), which states that all information in the data relevant to the model parameters is contained in the likelihood function. In the ERGM context, this means that an estimator should depend on A obs only through T (A obs ), as the likelihood itself depends on A obs only through T (A obs ).
This means that two networks with the same sufficient statistic will yield the same MLE. However, as Corander et al. (1998) points out, the MPLE does not satisfy the likelihood principle. This observation forms the basis of the remainder of this article.
The failure of the MPLE to satisfy the likelihood principle means that two networks A 1 and A 2 may have different MPLEs even if T (A 1 ) = T (A 2 ). We see this fact illustrated in Figure 1, which depicts numerous possible MPLE values, each of which results from a network on 9 nodes with the same values of T . Also depicted in Figure 1 is the mean value parameter space, an alternative to the natural parameter space R q , which is defined as the interior of the sufficient statistic's sample space. Each parameter θ can be uniquely projected to a point g in mean value parameter space by a bijective function µ : R q → C. For exponential family distributions this function is defined In other words, a parameter θ's corresponding point in mean value space is the expectation of the sufficient statistic with respect to the distribution defined by θ. This means that T (A obs ) is the MLE's projection into mean value parameter space, which is an interesting fact in its own right, since the MLE in natural parameter space is hard to find, but finding its counterpart in mean value space is trivially easy.
More specifically, Figure 1 depicts the MLE and multiple MPLEs in the natural as well as in mean value parameter space of networks on N = 9 nodes with exactly 18 edges and 13 triangles.
The dashed lines in the right plot visualize the boundary of the convex hull of the mean value parameter space. The networks were sampled from the (ρ, σ, τ )-model of Frank and Strauss (1986) with sigma taken to be zero, which results in an ERGM with a 2-dimensional T vector. We fixed ρ = −1 and τ = 0.53, which is equivalent to With only 9 nodes, it is computationally feasible to enumerate all possible T (A) vectors for A ∈ A, so it is possible to calculate the MLE exactly in this case. Schmid and Hunter (2020) demonstrate how to achieve this enumeration using the ergm package (Handcock et al., 2019) for the R computing environment (R Core Team, 2020). Although it is possible to enumerate all network statistics according to their multiplicity in this 9-node example, it is not computationally feasible to calculate every possible MPLE resulting from a network with 18 edges and 13 triangles. Thus, Figure 1 uses the simulated annealing method explained in Section 4 to generate multiple such networks randomly.

Rao-Blackwellization
Another potential way to exploit the failure of the MPLE to satisfy the likelihood principle is via the so-called Rao-Blackwellization of the MPLE. The Rao-Blackwell theorem (Lehmann and Casella, 1998) states that any estimator that is not a function of the sufficient statistic T (A) is inadmissible, meaning that for any estimatorθ that is not a function of T (A), one can find an improved estimator θ * by taking the expectation ofθ conditional on T (A). This new estimator has lower risk thanθ by any convex lost function, e.g., mean squared error.
In principle, computing the Rao-Blackwellized MPLE would require the distribution of the MPLE conditional on T (A obs ). To put this idea into practice, a sample from this distribution could be obtained using, say, the simulated annealing method of Section 4. However, since the stochastic properties of the simulated annealing algorithm are not well understood, we have not explored this particular idea in the current manuscript.

New Starting Values Via Simulated Annealing
As described earlier, maximum likelihood estimation by MCMC requires the selection of an auxiliary parameter θ 0 . Even though in theory the algorithm operates with any choice of θ 0 , in practice, a parameter close to the MLE is essential for the MCMLE algorithm to work successfully (Hummel et al., 2012). The commonly used starting value for θ 0 is the MPLE, since it is simple to calculate via logistic regression; even though other methods have been proposed (e.g., Krivitsky, 2017), MPLE remains the predominant method. As shown in Figure 1, however, the MPLE can in some cases be very different from the MLE, which typically results in the failure of the MCMLE algorithm.
In our 9-node example, the MCMLE algorithm failed for about a third of the MPLEs as starting values shown in Figure 1.

Failure of the MCMLE algorithm: An Illustrative Example
We demonstrate the problems of the MPLE as starting value for the MCMLE algorithm by taking a particular network on 9 nodes with 18 edges and 13 triangles as depicted in Figure 2. The MPLE of this particular network isθ = (−1.3, 0.702), while the MLE can be exactly calculated aŝ θ = (−0.623, 0.337). Table 1 shows MCMLE results for the same network, where the algorithm was initialized by differentθ values at each trial from among the values depicted in Figure 1.
In four out of ten trials, the algorithm stopped due to model degeneracy. As first sketched by Handcock (2001) and later studied in detail by Schweinberger (2011)), degeneracy occurs when the distribution defined by Pθ-ostensibly the best possible parameter value, in some sense-places most of its probability mass on just a few networks, usually the empty and the full network. In this scenario, the simulated networks are so different from the observed network that the MCMLE algorithm fails. In six cases shown in Table 1 Figure 2 using Model (7) for ten independent trials.

Simulated Annealing and MPLE
As noted earlier, Hummel et al. (2012) show that a poorly chosen θ 0 can make successful MCML estimation almost impossible. Consequently, the more different the MPLE is from the MLE, the more difficult it is to find the MCMLE, since the MPLE is commonly taken as the algorithm's starting value θ 0 due to its simple and fast calculation. Inspired by Figure 1, we propose a novel approach for finding an improved starting value for the MCMLE algorithm, namely, to search for networks that yield the same-or at least nearly the same-statistics as the observed network, then consider these networks' MPLEs as potential starting values.
We propose searching for networks with the same statistics as the observed network using the simulated annealing algorithm (Kirkpatrick et al., 1983). This algorithm was initially inspired from the practice of annealing metal. When annealing metal, the material is heated and then cooled very slowly, allowing it to be gradually shaped into the desired form. The process of heating up and cooling down is translated into the simulated annealing approach by allowing interim results initially (during the "heated" phase) to be worse than previous results; that is, the algorithm is more stochastic than deterministic in its early phase. Gradually, the algorithm "cools," becoming more deterministic and less prone to random jumps. The hope is that the algorithm does not get stuck at local maxima that it otherwise might not be able to leave if a purely deterministic optimization algorithm were used. The hope is that the combination of an initial random phase followed by increasingly deterministic behavior allows the algorithm to find and then focus on areas of the search space that include globally optimal solutions.

Applications
We demonstrate the simulated annealing-based approach to MCMLE using the E. coli transcriptional regulation network of Shen-Orr et al. (2002), which is based on the RegulonDB data of Salgado et al. (2001). The nodes in this network represent operons, while an edge from operon i to j indicates that i encodes a transcription factor that regulates j. Even though this is originally a directed network which contains self-edges, i.e., operons which regulate themselves, we follow Saul and Filkov (2007) and treat this network as undirected and without self-edges. This results in a network with 519 edges and 418 nodes. We study the same ERGM on these data as Hummel et al.  is advantageous to find a network that resembles an Erdös-Renyi network and that, in addition, yields the same sufficient statistics as the observed network.
Simulating networks from the probability distribution defined by the obtained MCMLE results in networks that resemble the rightmost network in Figure 4, rather than the observed network. We repeat the simulation study that resulted in Figure 3, with the exception that we start the simulated annealing algorithm with an Erdös-Renyi-generated network with similar density to the original network, i.e., where p is defined as the ratio of the number of ties to the number of The center depicts a network whose MPLE falls into the MPLE cluster of Figure 3. The right depicts a network whose MPLE falls into the MPLE cluster of Figure 3. All three networks have the same sufficient statistics vector.
possible ties. WIth this modification, the simulated annealing algorithm only finds networks where the MPLE is in the proximity of the MCMLE, meaning that the MPLE of any of these networks is an improved starting value for the MCMLE algorithm compared to the original MPLE.
In summary, we suggest finding an improved starting value θ 0 the following way: 1. For a given network A obs , find the sufficient statistics T (A obs ) 2. Simulate a network G using an Erdös-Renyi model on the same nodes as A obs and with p given by the edge density of A obs This approach was successfully applied by  to a citation network based on all US Supreme Court majority opinions written between 1937 and 2015. The majority opinion of each court case is considered a node, and a citation from case i to case j is defined as a directed tie. The network includes 10,020 nodes and the ERGM used includes 14 endogenous and exogenous statistics-complicated enough to be computationally challenging. Ultimately,  find that multiple different approaches fail to yield a successful MCMLE, among them the stepping method of Hummel et al. (2012) and the standard MCMLE algorithm using the MPLE as starting

Simulate a new network
value. Yet the approach described her-which reproduces the observed network statistics only approximately due to the inclusion of several continuous variables among these statistics-does produce a successful MCMLE; it is the only known method that does so in this example.

Discussion
The basic idea of this article-searching for networks that have the same, or approximately the same, ERGM statistics as the observed network and then using the MPLEs from these networks as starting values for a traditional MCMLE algorithm-is relatively simplistic. Yet it has already demonstrated its value in solving previously unsolvable ERGM-based estimation problems.
It seems there is still much to learn about this methodology. For instance, is there a way to implement Rao-Blackwellization, and does this approach lead to estimates that are closer in their behavior to the MLE? Also, how important is it to search the sample space of all networks thoroughly, and is an alternative to simulated annealing possible for this purpose? Figure 3 suggests that initializing the simulated annealing algorithm at the observed network and the Erdös-Renyigenerated network generate wholly distinct clouds of MPLE values, which leads to the question of whether all possible MPLE values for a particular set of statistics is actually bimodal, or whether in fact there is a vast as-yet-unexplored set of MPLE values that could potentially be of use both as starting MCMLE values and as sample points in a Rao-Blackwellization scheme.
That such a simple idea can prove so effective relative to all other known methods suggests that there exists immense untapped potential for improving upon approximate likelihood-based inference using MPLEs.