The importance sampling technique for understanding rare events in Erd\H{o}s-R\'enyi random graphs

In dense Erd\H{o}s-R\'enyi random graphs, we are interested in the events where large numbers of a given subgraph occur. The mean behavior of subgraph counts is known, and only recently were the related large deviations results discovered. Consequently, it is natural to ask, can one develop efficient numerical schemes to estimate the probability of an Erd\H{o}s-R\'enyi graph containing an excessively large number of a fixed given subgraph? Using the large deviation principle we study an importance sampling scheme as a method to numerically compute the small probabilities of large triangle counts occurring within Erd\H{o}s-R\'enyi graphs. We show that the exponential tilt suggested directly by the large deviation principle does not always yield an optimal scheme. The exponential tilt used in the importance sampling scheme comes from a generalized class of exponential random graphs. Asymptotic optimality, a measure of the efficiency of the importance sampling scheme, is achieved by a special choice of the parameters in the exponential random graph that makes it indistinguishable from an Erd\H{o}s-R\'enyi graph conditioned to have many triangles in the large network limit. We show how this choice can be made for the conditioned Erd\H{o}s-R\'enyi graphs both in the replica symmetric phase as well as in parts of the replica breaking phase to yield asymptotically optimal numerical schemes to estimate this rare event probability.


Introduction
In this paper we study the use of importance sampling schemes to numerically estimate the probability that an Erdős-Rényi random graph contains an unusually large number of triangles. Consider an Erdős-Rényi random graph G n,p on n vertices with edge probability p ∈ (0, 1). For a simple graph X on n vertices, let T (X) denote the number of triangles in X. For p fixed, one can show that E[T (G n,p )] ∼ n 3 p 3 as n → ∞. For t > p, what is the probability µ n = P T (G n,p ) n 3 t 3 (1.1) that G n,p has an atypically large number of triangles? The last few years have witnessed a number of deep results in understanding such questions on upper tails of triangle counts, along with more general subgraph densities (see e.g., [3, 6-9, 13, 16]). In the dense graph case, where the edge probability p stays fixed as n → ∞, [7] derived a large deviation principle (LDP) for the rare event {T (G n,p ) n 3 t 3 }, showing that for t within a certain subset of (p, 1], P T (G n,p ) n 3 t 3 = exp −n 2 I p (t)(1 + O(n −1/2 )) (1.2) where the rate function I p (t) is given by More recently [8] showed a general large deviations principle for dense Erdős-Rényi graphs, using the theory of limits of dense random graph sequences developed recently by Lovasz et al. [3,14,15]. When specialized to upper tails of triangle counts, the large deviation principle shows that for the range of (p, t) considered in (1.2), the Erdős-Rényi graph G n,p conditioned on the rare event {T (G n,p ) n 3 t 3 } is asymptotically indistinguishable from another Erdős-Rényi graph G n,t with edge probability t, in a sense that the typical graphs in the conditioned Erdős-Rényi graph resembles a typical graph drawn from G n,t when n is large. (Asymptotic indistinguishability is explained more precisely at (2.11).) While this seems plausible for any t > p since E[T (G n,t )] ∼ n 3 t 3 as n → ∞, it is not always the case. Depending on p and t, it may be that the graph G n,p conditioned on the event {T (G n,p ) n 3 t 3 } tends for form cliques and hence does not resemble an Erdős-Rényi graph. When the conditioned graph does resemble an Erdős-Rényi graph, we say that (p, t) is in the replica symmetric phase. On the other hand, when the conditioned graph is not asymptotically indistinguishable from an Erdős-Rényi graph we say that (p, t) is in the replica breaking phase. (See Definition 2.2.) Our approach to this problem is from a computational perspective: we study the use of importance sampling schemes for numerically estimating the probability µ n , and also determine the schemes that perform optimally for those (p, t) in the replica symmetric phase as well as in a subset of the replica breaking phase.
The exponential decay of the probability of the event of interest makes it difficult to estimate this probability even for moderately large n. Direct Monte Carlo sampling is obviously intractable. The central strategy of importance sampling is to sample from a different probability measure, the tilted measure, under which the event of interest is no longer rare; one obtains more successful samples falling in the event of interest but each sample must then be weighted appropriately according to the Radon-Nikodym derivative of the original measure against the tilted measure. Importance sampling techniques have been used in many other stochastic systems, such as SDEs and Markov processes and queuing systems, see e.g [2,4,10,12,20] and the references therein. In particular, when a large deviations principle is known for the stochastic system, the tilted measure commonly used is a change of measure arising from the LDP. However, not every tilted measure associated with the LDP works well. It is well known that a poorly chosen tilted measure can lead to an estimator that performs worse than Monte Carlo sampling, or whose variance blows up [11]. Thus, a careful choice of tilted measure is of utmost importance.
Given (p, t), the G n,t -measure works as a tilted measure by ramping up the edge probability of the samples; we shall refer to G n,t as an edge tilt. As we will see later on, even when the LDP suggests that G n,t is asymptotically equivalent to the conditioned G n,p graph, the edge tilt is not necessarily a good tilted measure for estimating the probability µ n . It turns out that the class of measures associated with the Erdős-Rényi graphs is too limited, so we must broaden the class to consider the class of exponential random graphs. Exponential random graphs are generally defined via a Gibbs measure. In the context of estimating rare events for triangles, one need only consider the Gibbs measures involving only edge and triangle counts. Hence, consider the exponential random graphs G h,β,α n defined via the Gibbs measure, Q n = Q h,β,α n on the space of simple graphs on n vertices, where where H(X) = hE(X) + β n n 3 6 1−α T (X) α with parameters h ∈ R and β, α > 0. E(X) is the number of edges in graph X. Given (p, t), a special choice of Gibbs measure Q h,β,α n is what we will call a triangle tilt, which works by ramping up the probability of triangles. We defer the full definition of the triangle tilt to Defintion 2.12 in Section 2.2. We shall show that in a number of different regimes, the triangle tilt is the best possible tilt, in an asymptotic sense. In this sense, the class of exponential random graphs is sufficiently rich to ensure the existence of an optimal triangle tilt even for a subset of the replica breaking phase.
To understand why the class of exponential random graphs is the right class to consider, we make a digression to mention the connection between exponential random graphs and the conditioned Erdős-Rényi graphs. Exponential random graphs have been studied in [1,5,16,17]. The "classical" exponential random graphs with α = 1 and its connection to conditioned Erdős-Rényi graphs was initially observed by Chatterjee and Dey [7] when proving the large deviations principle (1.2), and it was further developed by Lubetzky and Zhao [16] for α > 0. An interesting observation in [16], for the case when (p, t) belongs to the replica symmetric phase, is the connection between the free energy of the Gibbs measure and the derivative of the rate function. This connection leads to the following duality relationship between certain parameters (h, β, α) of the Gibbs measure and the parameters (p, t) of the conditioned Erdős-Rényi graph: for (p, t) that is replica symmetric and α ∈ [2/3, 1], the typical exponential random graph resembles the the conditioned Erdős-Rényi graph if h = log p 1−p and the free energy of the Gibbs measure Q h,β,α n , expressed in a variational formulation where I p (t) is the rate function at (1.3), is maximized at t. One of our main results, Theorem 2.5, extends this duality into the replica breaking phase, and generalizes the way of characterizing when an exponential random graph resembles an conditioned Erdős-Rényi graph. The gist of Theorem 2.5, and its immediate consequence is described, heuristically, as follows: Fix p ∈ (0, 1) and t ∈ [p, 1], and let h p = log p 1−p . Suppose there exists β > 0 and α ∈ [0, 1] such that where the function φ p (u) is a rate function (see (2.24)). Then the exponential random graph G hp,β,α n and the Erdős-Rényi graph G n,p conditioned on T (G n,p ) n 3 t 3 are asymptotically indistinguishable. Thus, Theorem 2.5 provides a way to characterize the asymptotic behaviour of the conditioned Erdős-Rényi graph by that of an exponential random graph.
Apart from its independent interest, Theorem 2.5 and the variational form (1.4) is the basis for choosing the parameters for the Gibbs measure that defines the triangle tilt. In essence, the triangle tilt can be defined when there exists (h p , β, α) for which (1.4) holds, which is the case for the replica symmetric phase and at least a nontrivial subset of the replica breaking phase.
Returning to the question of the efficiency of an importance sampling scheme, one measure of the efficiency is through the magnitude of the variance of the importance sampling estimator. In the presence of a large deviation principle, we appeal to the notion of asymptotic optimality, which is the property that the importance sampling estimator has the smallest attainable variance, as n → ∞. (See Definition 2.3.) Our main results pertain to the asymptotic optimality or non-optimality of certain importance sampling estimators. In Proposition 3.1 we prove a necessary condition for asymptotic optimality when the tilt is based on an exponential random graph: the exponential random graph must be asymptotically indistiguishable from the conditioned Erdős-Rényi graph. In particular, if (p, t) belongs to the replica symmetric phase, then the necessary condition is that the exponential random graph is indistinguishable from G n,t . On the other hand, Proposition 3.2 shows that this is not a sufficient condition for asymptotic optimality: there is a subregime of the replica symmetric phase for which the edge tilt produces a suboptimal estimator. It is interesting to note that although the LDP suggests that G n,t is the typical behaviour of the conditioned ER graph in the replica symmetric phase, directly using G n,t as the importance sampling tilt does not necessarily give an optimal estimator. Instead, we must be careful to use a tilt that not only is indistinguishable from the conditioned Erdős-Rényi graph, but also gives an asymptotically optimal estimator. It turns out that the triangle tilts are the appropriate tilts to use, and this fact is the statement of our main optimality result, which we state here.
Suppose there exists a triangle tilt Q hp,β,α n with parameter α > 0 corresponding to (p, t), as defined in Definition 2.12. Then the importance sampling estimator based on the tilted measure Q hp,β,α n is asymptotically optimal.
Organization of the paper: We start by giving precise definitions of the various constructs arising in our study in Section 2. This culminates in Theorem 2.5 that characterizes the limiting free energy of the exponential random graph model. The rest of Section 2 is devoted to drawing a connection between the exponential random graph and Erdős-Rényi random graph conditioned on an atypical number of triangles, leading to the derivation of the triangle tilts. Section 3 discusses and proves our main results on asymptotic optimality or non-optimality of the importance sampling estimators. In Section 4, we carry out numerical simulations on moderate size networks using the various proposed tilts to illustrate and compare the viability of the importance sampling schemes.
Acknowledgement This work was funded in part through the 2011-2012 SAMSI Program on Uncertainty Quantification, in which each of the authors participated. JN was partially supported by grant NSF-DMS 1007572. SB was partially supported by grant NSF-DMS 1105581.

Large deviations, importance sampling and exponential random graphs
A simple graph X on n vertices can be represented as an element of the space Ω n = {0, 1} ( n 2 ) . A graph X ∈ Ω n will be denoted by X = (X ij ) 1 i<j n with the entry X ij indicating the presence or absence of an edge between vertices i and j. For a given edge probability p ∈ [0, 1], an Erdős-Rényi random graph G n,p is a graph on n vertices such that any edge is independently connected with probability p. We shall use P n,p to represent the probability measure on Ω n induced by the Erdős-Rényi graph G n,p . The probability of a fixed graph X under the measure P n,p can be explicitly computed as where h p := log p 1−p , and E(X) := i<j X ij is the number of edges in X. Let T (X) denote the number of triangles in graph X: Also let the event W n,t = X ∈ Ω n | T (X) n 3 t 3 denote the upper tails of triangle counts.
⊂ Ω n is a sequence of Erdős-Rényi random graphs generated independently from P n,p , then for any integer K 1, is an unbiased estimate of µ n . By the law of large numbers, M K → µ n with probability one as K → ∞. Although this estimate of µ n is very simple, the relative error is which scales like (Kµ n ) −1/2 as µ n → 0. Hence the relative error may be very large in the large deviation regime where µ n << 1, unless we have at least K ∼ O(µ −1 n ) samples. Therefore, it is desirable to devise an estimate of µ n which, compared to this simple Monte Carlo estimate, attains the same accuracy with fewer number of samples or lower computational cost.
Importance sampling is a Monte Carlo algorithm based on a change of measure. Suppose that P n,p is absolutely continuous with respect to another measure Q on Ω n with dP n,p dQ = Y −1 : Ω n → R.
Then we have where E Q denotes expectation with respect to Q, and we now use {X k } ∞ k=1 to denote a set of random graphs sampled independently from the new measure Q. If we definẽ thenM K is also an unbiased estimate of µ n , and the relative error is now:

4)
Formally this is optimized by the choice Y = (µ n ) −1 1 Wn,t (X), in which case the relative error is zero. Such a choice for Q is not feasible, however, since normalizing Y would require a priori knowledge of µ n = P n,p (W n,t ). Intuitively, we should choose the tilted measure Q so thatX k ∈ W n,t occurs with high probability under Q.
We will refer to Y −1 as the importance sampling weights, and Q as the tilted measure, or tilt. If Q arises naturally as the measure induced by a random graph G n , we will also refer to G n as the tilt.
Asymptotic optimality and large deviations. In view of (2.3), one way of comparing the efficiency of importance sampling estimators is to look at which estimator has the smaller second moment. When the family of measures P n,p possesses a large deviations principle, a notion of asymptotic optimality (or asymptotic efficiency) of the estimator M K can be defined with the interpretation that the second moment of the estimator is the smallest possible in the asymptotic sense, as afforded by the large deviation principle ( [4] and see also Definition 2.3). Thus, before defining asymptotic optimality, we first proceed with a description of the large deviations principle for Erdős-Rényi random graphs.
In the context of Erdős-Rényi random graphs, Chatterjee and Varadhan [8] have proved a general large deviation principle which is based on the theory of dense graph limits developed by [3]. In this framework, a random graph is represented as a function X(x, y) ∈ W, where W is the set of all measureable functions f : [0, 1] 2 → [0, 1] satisfying f (x, y) = f (y, x). Specifically, a finite simple graph X on n vertices is represented by the function, or graphon, Here we treat (X ij ) as a symmetric matrix with entries in {0, 1} and X ii = 0 for all i. In general, for a function f ∈ W, f (x, y) can be interpreted as the probability of having an edge between vertices x and y. Then, we define the quotient space W under the equivalence relation defined by f ∼ g if f (x, y) = g(σx, σy) for some measure preserving bijection σ : [0, 1] → [0, 1]. Intuitively, an equivalence class contains graphons that are equal after a relabelling of vertices. (See, e.g., [3,8] for further exploration and properties of the quotient space.) By identifying a finite graph X with its graphon representation, we can consider the probability measure P n,p as a measure induced on W supported on the subset of graphons of finite graphs. For f ∈ W, denote and We see that E(X) = n 2 2 E(X) and T (X) = n 3 6 T (X), so that E and T represent edge and triangle densities of the graph X, respectively. Then, rather than considering the event W n,t , we shall equivalently consider the upper tails of triangle densities, The large deviation principle of Chatterjee and Varadhan [8] implies for any p ∈ (0, 1) and t ∈ [p, 1], where φ(p, t) is the large deviation decay rate given by a variational form, Here, is the large deviation rate function, where I p : [0, 1] → R is defined at (1.3). A further important consequence of the large deviation principle concerns the typical behaviour of the conditioned probability measure P n,p (X|W t ) = P n,p (X)1 Wt (X)µ −1 n . When we refer to G n,p conditioned on the event T (f ) t 3 , we mean the random graph whose law is given by this conditioned probability measure. The term "asymptotically indistinguishable" in Lemma 2.1 roughly means that the graphon representation of the graph converges in probability, under the cut distance metric, to the constant function u * at an exponential rate as n → ∞. Intuitively, this means that the typical conditioned Erdős-Rényi graph resembles some graph f * ∈ F * for large n. In order to give a more precise definition of asymptotic indistinguishability, we first recall the cut distance metric δ , defined for f, g ∈ W by where the infimum is taken over all measure-preserving bijections σ : It is known by [14] that (W, δ ) is a compact metric space. We say that a random graph G n on n vertices is asymptotically indistinguishable from a subset F ⊂ W if for any (2.11) Further, we say that G n is asymptotically indistinguishable from the minimal set F ⊂ W if F is the smallest closed subset of W that G n is asymptotically indistinguishable from. Clearly, if G n is asymptotically indistinguishable from a singleton set F, then F is, trivially, minimal. Finally, we say two random graphs G 1 n , G 2 n are asymptotically indistinguishable if they are each asymptotically indistinguishable from the same minimal set F ⊂ W. Intuitively, this means that the random behaviour, or the typical graphs, of G 1 n resembles that of G 2 n for large n. (See [5] and [8] for a wide-ranging exploration of this metric in the context of describing limits of dense random graph sequences.) Using this terminology, we observe that an Erdős-Rényi graph G n,u is asymptotically indistinguishable from the singleton set containing the constant function f * ≡ u. A special notion about whether the conditioned Erdős-Rényi graph is again an Erdős-Rényi graph leads to the following definition.
Definition 2.2. The replica symmetric phase is the regime of parameters (p, t) for which the large deviations rate satisfies (2.12) and the infimum is uniquely attained at the constant function t.
The replica breaking phase is the regime of parameters (p, t) that are not in the replica symmetric phase.
Hence, the notion of replica symmetry is a property of the rare event problem, where, conditioned on the event {T (f ) t 3 }, the Erdős-Rényi graph G n,p behaves like another Erdős-Rényi graph G n,t with the higher edge density t, for large n. In constrast, the conditioned graph in the replica breaking phase is not any Erdős-Rényi graph, and has been conjectured to exhibit a clique-like structure with lesser than t edge density. The term "replica symmetric phase" is borrowed from [8], which in turn was inspired by the statistical physics literature. However, we remark that this term has been used by different authors to refer to other instances of graphs behaving like an Erdős-Rényi graph.
The large deviations principle gives us an estimate of the relative error in the estimatẽ M K . For any fixed K, it is clear from (2.4) that minimizing the relative error is equivalent to minimizing the second moment E Qn [(1 Wt Y −1 ) 2 ]. By Jensen's inequality, we have the following asymptotic lower bound: This leads to the definition of asymptotic optimality.
In contrast, the second moment of each term in the simple Monte Carlo method satisfies Thus, the simple Monte Carlo method is not asymptotically optimal. Observe that Jensen's inequality for conditional expectation implies So, if Q n is asymptotically optimal, we must have lim inf which is consistent with the intuition that a good choice of Q n should putX k ∈ W t with high probability.
2.1. Asymptotic behavior of exponential random graphs. To find "good" importance sampling tilted measures, we focus on the class of exponential random graphs. The exponential random graph is a random graph on n vertices defined by the Gibbs measure on Ω n , where for given h ∈ R, β ∈ R + , α > 0, the Hamiltonian is We will use ψ n = ψ h,β,α n to denote the log of the normalizing constant (free energy) . We denote by G h,β,α n the exponential random graph defined by the Gibbs measure (2.16). The case where α = 1 is the "classical" exponential random graph model that has an enormous literature in the social sciences, see e.g. [18,19] and the references therin and rigorously studied in a number of recent papers, see e.g. [1,5,16,17,21,22]. In this case, the Hamiltonian can be rewritten as n 2 H(X) = hE(X) + β n T (X). We will drop the superscripts in ψ h,β n , Q h,β n when α = 1. The generalization to the exponential random graph with the parameter α was first proposed in [16].
Observe that the Erdős-Rényi random graph is a special case of the exponential random graph: if β = 0 and h = h p with h p defined by (2.1), then Q hp,0,α n = P n,p for any α > 0 and the edges are independent with probability p. On the other hand, choosing β > 0 introduces a non-trivial dependence between the edges. By adjusting the parameters (h, β, α), the Gibbs measure Q h,β,α n can be adjusted to favor edges and triangles to varying degree.
The asymptotic behavior of the exponential random graph measures Q h,β,α n and the free energy ψ h,β,α n is partially characterized by the following result of Chatterjee and Diaconis [5] and Lubetzky and Zhao [16]. In what follows, we will make use of the functions and, for f ∈ W, If the supremum in (2.21) is attained at a unique point u * ∈ [0, 1], then the exponential random graph G h,β,α n is asymptotically indistinguishable from the Erdős-Rényi graph G n,u * .
Our main result in this section, stated next, is the generalization of the variational formulation for the free energy of the Gibbs measure of any exponential random graph. The consequence of this result leads to the connection between the exponential random graph and the conditioned Erdős-Rényi graph. Before stating the result we will need some extra notation. Extend the Hamiltonian defined in (2.17) to the space of graphons in the natural way where recall the definitions for the density of edges and triangles for graphons defined respectively in (2.6) and (2.7). For fixed q ∈ (0, 1) recall the functions I q (f ) from (2.10) and the function I(f ) from (2.19).

Theorem 2.5. Given any Gibbs measure parameters
Then for the exponential random graph G hq,β,α n , the free energy satisfies The supremum, , is attained exactly on the set F * v * , where v * maximizes the RHS of (2.23).
Further, if (q, v * ) belongs to the replica symmetric phase, then the supremum, Proof. The first equality in (2.23) follows from Thm 3.1 in [5]. To show the second equality, suppose f ∈ ∂W u , for u ∈ (0, 1) .
This implies that Let v * maximize the RHS of (2.23). Then v * := arg sup It follows that and moreover, the supremum sup This concludes the proof of (2.23).
Now suppose (q, v * ) belongs to the replica symmetric phase. This implies that the constant function v * is the unique minimizer of the LDP rate function inf f ∈W v * [I q (f )], and by Theorem 4.2(iii) in [8], is a constant function, then it suffices to consider the supremum only over constant functions. But the only constant function in ∂W u is the function u. So where C ⊂ W is the set of constant functions. The proof is complete.
and the set of minimizers that attain the rate φ(q, u) is exactly F * u . So, for any u q, the Erdős-Rényi graph G n,q conditioned on the event T (f ) u 3 is asymptotically indistinguishable from the minimal set F * u , by Lemma  is asymptotically indistinguishable from the minimal set F * v * . We have the following corollary. Corollary 2.7. Let the parameters (h q , β, α), (q, v * ) be as in Theorem 2.5 and Eqn (2.23).
Suppose v * q. Then G hq,β,α n is asymptotically indistinguishable from the conditioned Erdős-Rényi graph, G n,q conditioned on the event T (f ) (v * ) 3 .
In particular, if (q, v * ) belongs to the replica symmetric phase, then G hq,β,α n is asymptotically indistinguishable from the Erdős-Rényi graph G n,v * .
The mean behaviour of the triangle density of an exponential random graph G hq,β,α n can be deduced from the variational formulation in (2.23), and in special instances, so can the mean behaviour of the edge density. This is shown in the next proposition.
Since I p (f ) is the rate function, it is known that φ p (p) = 0, and φ p (u) is continuous and strictly increasing on [p, 1] (Theorem 4.3 in [8]). If φ p (u) is differentiable everywhere, then the extremal points u * of the functionṼ (u) := Then for (2.29) to hold, β must necessarily be given by (2.30) The next lemma shows that, regardless of the differentiability of φ p (t), provided a certain minorant condition holds, we can find a β and a sufficiently small α so that (2.29) holds, and consequently that the exponential graph is asymptotically indistinguishable from the conditioned Erdős-Rényi graph.
We shall say that (p, t) satisfies the minorant condition with parameter α if (t 3α , φ p (t)) lies on the convex minorant of the function x → φ p (x 1/3α ). If (t 3α , φ p (t)) lies on the convex minorant of x → φ p (x 1/3α ), then subdifferential(s) of the convex minorant of x → φ p (x 1/3α ) always exist and are positive. Recall that the subdifferentials of a convex function f (x) at a point x are the slopes of any line lying below f (x) that is tangent to f at x. The set of subdifferentials of a convex function is non-empty; if the function is differentiable at x, then the set of subdifferentials contains exactly one point, the derivative f ′ (x). Lemma 2.9. Suppose (p, t) satisfies the minorant condition for α > 0 sufficiently small. Let β 6 be any subdifferential of the convex minorant of x → φ p (x 1/3α ) at the point t 3α .
Proof. The proof follows a similar technique to [16]. Using the rescaling u → x 1/3α , the variational form sup 0 u 1 [ β 6 u 3α − φ p (u)] can be rewritten as The assumption that β 6 is a subdifferential ofφ p (x) at x = t 3α implies that the maximum of sup x [ β 6 x −φ p (x)] is attained at t 3α . Since, for sufficiently small α, the point (t 3α , φ p (t)) lies onφ p (x), we have thatφ p (t 3α ) = φ p (t) and so the maximum of sup is also attained at t 3α . It follows that the maximum of sup u [ β 6 u 3α − φ p (u)] is attained at t. (However, this maximum may not be unique. If the subtangent line defined by the subdifferential β 6 touchesφ p at another point r 3α , then r also a maximum.) To prove the last part of the lemma, if φ p (u) is differentiable at t, then the subdifferential is simply the derivative. Then we have αt 3α−1 . Next, we use the minorant condition and Lemma 2.9 to define a parameterized family of subregimes of the (p, t)-phase space.
Definition 2.10. Fix α > 0. We define the regime S α to be the set of pairs (p, t) for which the minorant condition holds with α and there exists a subdifferential β 6 of the convex minorant of x → φ p (x 1/3α ) such that the variational form sup uniquely maximized at t.
If α ∈ [2/3, 1], the exponential random graph is known to be asymptotically indistinguishable from an Erdős-Rényi graph G n,u for some u ∈ [0, 1]. Recalling Definition 2.2 of the replica symmetric phase, the following statement follows directly from the arguments in [16] and Theorem 4.3 in [8].
Lemma A.2 shows that S α ⊃ S α ′ for 0 < α < α ′ . The sets S α for α = 2/3, 1 are shown in Figure 2 By definition, any replica symmetric (p, t) satisfies the minorant condition for any α ∈ [2/3, 1]. Are there any replica breaking (p, t) that satisfies the minorant condition for some α? The answer is in the affirmative. To see this, consider α = 1/3 and convex minorant of x → φ p (x 1/3α ) = φ p (x). For each p < p crit , there exists an interval [r p , r p ] ⊂ (p, 1) where (p, t) is replica breaking if t ∈ [r p , r p ], and (p, t) is replica symmetric for the other values of t. Since φ p (t) < I p (t) if t ∈ [r p , r p ] and φ p (t) = I p (t) for other values of t, and since I p (u) is convex, the convex minorant of φ p (x) must touch φ p at at least one t p ∈ [r p , r p ]. So (p, t p ) is replica breaking and satisfies the minorant condition.
The preceding argument shows that α>0 S α is strictly larger than the replica symmetric phase, and contains a nontrivial subset of the replica breaking phase. Using the characterizations of the sets S α , we are now ready to define the triangle tilts. Definition 2.12. Given (p, t) ∈ S α for some α > 0, a triangle tilt with parameter α corresponding to (p, t) refers to any Gibbs measure Q hp,β,α n where h p = log p 1−p , and β 6 is any subdifferential of the convex minorant of x → φ p (x 1/3α ). If φ p (u) is differentiable at t, then there is exactly one triangle tilt with parameter α corresponding to (p, t), with the parameters (h p , β * , α) where β * is defined in (2.30).
The triangle tilt with parameter α corresponding to (p, t) is well-defined only when (p, t) ∈ S α , or, equivalently stated, it is well-defined only when 2.29 holds. In view of Theorem 2.5 and Lemma 2.9, when the triangle tilt with parameter α corresponding to (p, t) is well-defined, it is the measure induced by an exponential random graph which satisfies (2.27) and which is asymptotically indistinguishable from the conditioned Erdős-Rényi graph G n,p conditioned on the rare event T (f ) t 3 . Also, if (p, t) ∈ S α , then since by Lemma A.2 the sets S α ′ are increasing as α ′ decreases, the triangle tilt with parameter α ′ corresponding to (p, t) is defined for any α ′ α. If (p, t) is in the replica symmetric phase, the triangle tilt can be defined for some α ∈ [2/3, 1], and since φ p (t) = I p (t) in the replica symmeteric phase, from (2.30) the triangle tilt parameters necessarily take on the following explicit expression: (h p , β * , α), where If (p, t) is in the replica breaking phase, we may need to resort to numerical strategies to find the parameters β and α.
Remark 2.13. Given any (p, t), if φ p (u) is differentiable at t, then we can define β * in (2.30) regardless of whether (p, t) belongs to S α . In this case, t is a stationary point of the function L(u) : u → β 6 u 3α − φ p (u). If φ p (u) is twice differentiable at t, then since we have that t is a local maximum of L(u) if and only if ∂ t β * > 0. Now note that in the replica symmetric phase where the LDP implies that an Erdős-Rényi random graph G n,p conditioned on the event T (f ) t 3 is indistinguishable from G n,t , we have the obvious edge tilt as follows.
Definition 2.14. Given (p, t), let h t = log t 1−t . The edge tilt refers to the Gibbs measure Q ht,0,α n = Q ht,0 n = P n,t , corresponding to the Erdős-Rényi graph G n,t .
It is also possible to consider tilts that are a hybrid between the edge tilt and triangle tilts and that satisfy (2.27). Such tilts can be constructed explicitly for the replica symmetric phase. Consider the extremal points of the function If the maximum of V (u) occurs at u * , we must have and Using (2.33) we may express h as a function of β and α: (2.35) The next lemma follows from the continuity of V and the conditions (2.33), (2.34).
When (p, t) belongs to the replica symmetric phase, we can apply Lemma 2.15 with u * = t to obtain a family of hybrid tilts with the parameters (h(β, α), β, α) for β ∈ [0, β 0 ). Due to Theorem 2.4(b), the hybrid tilt satisfies (2.27) and is asymptotically indistinguishable from the Erdős-Rényi graph G n,t . Hybrid tilts of this form are considered in the numerical simulations in Section 4.2.

Asymptotic Optimality in the replica symmetric phase
The reason for the names, triangle tilt or edge tilt, is that the Radon-Nikodym derivative, dP dQ , that weights the samples in the importance sampling estimator (2.3) depends only on the number of triangles or the number of edges, respectively, in the samples. That is, dP n,p dQ hp,β * ,α n (X) ∝ e n 2 β * 6 T (X) α , and dP n,p dQ ht,0 Here recall that T (X) = 6 n 3 T (X) is the density of triangles in X and E(X) = 2 n 2 E(X) is the density of edges.
In the case of the edge tilt, the fact that the weights depend only on the number of edges has deeper repercussions. Since E[E(G ht,0 n )] ∼ t, good samples in the target event {T (f ) t 3 } having fewer than t density of edges are being over-penalized by the weights. In contrast, the triangle tilt penalizes samples more heavily only when they deviate from t 3 density of triangles.
To formalize the above heuristic arguments, we study the asymptotic optimality, or non-optimality, of importance sampling schemes based on the tilted measures Q h,β,α n . For any admissible parameters (h, β, α), importance sampling estimator based on the tilted measure Q h,β,α n is For any (h, β, α), E[q n ] = µ n and soM K is an unbiased estimator for µ n . We now prove the asymptotic optimality of the triangle tilts, Theorem 1.1. Proof of Theorem 1.1.
Proof. Due to (2.13), it suffices to show that Note that E, T : W → R are bounded continuous mappings [3,Theorem 3.8], and the exponent of the indicator 1 Wt (X) = e −n 2 0 W t (X) , where 0 Wt (X) = 0 if X ∈ W t and 0 Wt (X) = ∞ otherwise, can be approximated by bounded continuous approximations.
Since I p (f ) is the rate function for the family of measures P n,p , ( where, by (2.23),Ṽ for any (h, β, α). The last inequality follows from the fact that T (f ) t 3 for all f ∈ W t . Now, taking the triangle tilt with (h p , β, α), we have by its definition that u * = t. Then Combined with the upper bound for the asymptotic second moment, we conclude that the triangle tilt Q hp,β,α n yields an asymptotically optimal importance sampling estimator.
3.1. Non-optimality. In this section, we show the non-optimality of importance sampling estimator with certain tilted measures. In the first result, we show that an exponential random graph that is not indistinguishable from the conditioned Erdős-Rényi graph cannot produce an optimal estimator. In the case where (p, t) belongs to the replica symmetric phase, this rules out all exponential random graphs that are indistinguishable from G n,u , with u = t, from being asymptotically optimal, but does not rule out the Erdős-Rényi graph G n,t corresponding to the edge tilt. Then, the second non-optimality result identifies a non-trivial subset of the replica symmetric phase for which the edge tilt does not produce an optimal estimator.
The importance sampling estimator is not asymptotically optimal.
Proposition 3.2. Let 0 < p < e −1/2 1+e −1/2 and t ∈ (p, 1). If t is sufficiently close to 1 and (p, t) belong to the replica symmetric phase, then the importance sampling scheme based on the edge tilt Q ht,0 n is not asymptotically optimal.
Proof: Starting from (3.4), we have where h(β) = h(β, 1) as in (2.35) with α = 1. Because (p, t) is in the replica symmetric phase, I p (f ) is minimized by the constant function On the other hand, E is minimized by This f 1 represents a graph with a large clique, in which there is a complete subgraph on a fraction t of the vertices. Let us define and (Recall h(β) = h t here.) From (3.5) we see that We claim that for p < e −1/2 /(1 + e −1/2 ) and t sufficiently close to 1, we have Γ(1) < Γ(t). Indeed, let g(t) = Γ(1) − Γ(t): Observe that g(1) = 0 and So, if p < e −1/2 /(1 + e −1/2 ), we have g ′ (1) > 0. So, for t sufficiently close to 1, we have Γ(1) < Γ(t). Therefore, and we conclude that Since the strict inequality holds, the importance sampling scheme associated with Q h,β n cannot be asymptotically optimal.
Remark 3.3. The critical point in the proposition,p = e −1/2 1+e −1/2 ≈ 0.3775, corresponds to hp = −1/2. In consideration of (2.11) and Figure 2.1, we see that the conditions of the proposition are attainable: if p <p and t is sufficiently close to 1, then (p, t) will be in the replica symmetric phase. For example, when p = 0.35, we can numerically approximate the value oft ≈ 0.948, so that whenever t ∈ (t, 1], the edge tilt for (0.35, t) is not asymptotically optimal.

Numerical simulations using importance sampling
We implement the importance sampling schemes to show the optimality properties of the Gibbs measure tilts in practice. Although we have thus far been considering importance sampling schemes that draw i.i.d. samples from the tilted measure Q, in practice it is very difficult to sample independent copies of exponential random graphs. This is because of the dependencies of the edges in the exponential random graph, unlike the situation with an Erdős-Rényi graph where the edges are independent. Thus, to implement the importance sampling scheme, we turn to a Markov chain Monte Carlo method known as the Glauber dynamics to generate samples from the exponential random graph. The Glauber dynamics refers to a Markov chain whose stationary distribution is the Gibbs measure Q h,β,α n . The samplesX k from the Glauber dynamics are used to form the importance sampling estimatorM K in (3.1). The variance ofM K clearly also depends on the correlation between the successive samples. However, in this paper, rather than focus on the effect of correlation on the variance ofM K , we instead investigate and compare the optimality of the importance sampling schemes, and show that importance sampling is a viable method for moderate values of n.

Glauber dynamics.
For the exponential random graph G h,β,α n , the Glauber dynamics proceeds as follows.
Suppose we have a graph X = (X ij ) 1 i<j n . The graphX is generated from X via the following procedure. 1. Choose an edge X ij , for some (i, j), from X uniformly at random.

For the new graphX, fix all other edgesX
3. Conditioned on all other edges fixed, pick is the number of 2-stars in X with a base at the edge X ij , and the number of triangles in X not involving the edge X ij , respectively. 4. If conditioning on A J is used, check ifX is in A J . If not, revert to X.
In step 4, a conditioning of the Gibbs measure is discussed in Section 4.3.
For the classical exponential random graph with α = 1, the probability ϕ in the Glauber dynamics has a neater expression, ϕ = e h+βL ij /n 1 + e h+βL ij /n .
At each MCMC step, if X ij =X ij , then E(X) differs from E(X) by one edge, and T (X) differs from T (X) by nL ij triangles. The stationary distribution of the Glauber dynamics is the Gibbs measure Q h,β,α n that defines the exponential random graph G h,β,α n . Regarding the mixing time of the Markov chain, [1] showed that if (h, β, 1) has the property that the unique global maximum of the function V (u), defined in (2.32), is also the unique turning point, then the mixing time for the Glauber dynamics is O(n 2 log n). For other values of (h, β, 1), the mixing time is O(e n ).

4.2.
Numerical simulations in the replica symmetry phase. The importance sampling scheme was performed for p = 0.35, t = 0.4, in the replica symmetry phase, using the Glauber dynamics. The simulations used the Gibbs measure tilts Q h(β),β n , with α = 1 and β is of the form for q = 0.35, 0.36, . . . , 0.4 (4.1) and h(β q ) is given by (2.35). Each of these exponential random graphs G h(β),βq n is indistinguishable from the Erdős-Rényi graph G n,t , and q = p = 0.35 is the triangle tilt while q = t = 0.4 is the edge tilt. For the values p q t, the Table 4.1 verifies the accuracy of the importance sampling estimates for µ n := P(G n,p ∈ W t ) using the tilts Q , where the parameters β = β q,t are defined in (4.1). Also shown is the log probability 1 n 2 log P(G n,p ∈ W t ) (lower number). shown is the estimate for the log probability, 1 n 2 log P(G n,p ∈ W t ). Since (p, t) is replica symmetric, the LDP rate is lim n→∞ 1 n 2 log P(G n,p ∈ W t ) = −I p (t) = −0.002694. The value of the log probability is seen to approach the LDP rate as n is increased. Table 4.2 shows the estimated values of the variance of the estimator, V ar Qn (q n ), wherê q n = 1 Wt dPn,p dQ h,β n , as well as the log second moment 1 n 2 log E Qn [q 2 n ]. The variance of the estimator for all the tilts appear to be comparable and the log second moment likewise appears to converge towards −2I p (t) = −0.0053869. In this case, p = 0.35, t = 0.4 does not belong to the regime described in Proposition 3.2, and the numerical results suggest that all the tilts of this form, including the edge tilt, appear to be close to asymptotically optimal.
For n = 16, 32, 64, the number of MCMC samples used was 5 × 10 4 n 2 log n, while for n = 96, the number of MCMC samples used was 10 5 n 2 log n.
Both the random graphs corresponding to the triangle or edge tilts are expected by (2.27), (2.28) to have n 3 t 3 triangles and n 2 t edges on average. However, there is a difference between the way that the triangle and edge tilts produce events in T (f ) t 3 , and that is in the number of edges in the successful samples that fall in the rare event.  The distribution of the edge count of successful samples from the triangle tilt has a larger proportion with less than n 2 t edges, as compared to the edge tilt. This is shown in Figure  4.1.

4.3.
Importance sampling with conditioned Gibbs measures. Quite a different issue from the asymptotic optimality of the importance sampling estimator is the question of the efficiency of the Glauber dynamics in drawing samples from the tilted measure. The efficiency of using an MCMC to draw samples is subject to the mixing time of the Markov chain. In the case of the exponential random graph, the mixing time of some such graphs is known to be exponentially long, O(e n ), due to the fact that the Hamiltonian H(f ) has multiple local maxima [1]. In this section, we propose a way to sidestep this issue, by using a conditioned version of the Gibbs measure, in which the sampling from the exponential random graph is restricted to an appropriate subregion of the state space Ω n . Conditioning the Gibbs measure on the desired subregion of the state space serves to focus the sampling to the region of the state space that really matters, and possibly also improving the mixing time of the Markov chain.
The conditioned Gibbs measure is particularly apt in the following scenario. Suppose, for given (p, t), the variation form in (2.29) is locally, but not globally, maximized by t (c.f. Figure 4.2). If u * = t is the global maximum of (2.29), then G h,β,α n is indistinguishable from the conditioned Erdős-Rényi graph G n,p conditioned on {T (f ) (u * ) 3 }. Compared to our target of exceeding n 3 t 3 triangles, the samples from G h,β,α n will have an over-or under-abundance of triangles, leading to a poor estimator with very large variance. Recall that Proposition 3.1 shows that the importance sampling estimator based on Q h,β,α n is non-optimal. The conditioned Gibbs measure mitigates this problem by restricting the exponential random graph to having just the "right" number of triangles.
Conditioned Gibbs measure. Given a set A ⊂ W, the exponential random graph conditioned on A, denoted G h,β,α n,A has the conditional Gibbs measureQ h,β,α n,A = Q h,β,α n |A, where the Hamiltonian H(X) is defined in (2.16). The free energyψ n,A =ψ h,β,α n,A is The following proposition describes the asymptotic behaviour of the free energy, which is analogue of Theorems 3.1 and 3.2 in [5]. Proof. This follows from a simple modification of the proof of Theorem 3.1 and 3.2 in [5] to restrict to the set A.
The importance sampling scheme based on the conditioned Gibbs measureQ h,β,α n,A gives the estimatorν whereP n,p,A = P n,p |A. Note thatν A is an unbiased estimator for ν n,p =P n,p,A (W t ). The estimatorμ n for µ n = P n,p (W t ) can be obtained fromν A bŷ µ n =ν A · P n,p (A) + P n,p (W t ∩ A c ).
Since the two probabilities on the RHS, particularly the second term, may not be easily computable or estimated, we may alternatively takeν A as a biased estimator for µ n . By an appropriate choice of the set A, we can ensure the bias is small and vanishes exponentially faster than the small probability we are trying to estimate (see Lemma A.4(ii)). In our application, the conditioning of the Gibbs measure is applied to control the number of triangles that the sampled graphs are allowed to have. Thus, it is natural to choose the set A of the form where J ⊂ [0, 1] is a closed interval and A 0 ⊂ W is a closed subset containing all the constant functions in J. Then the set A J ⊂ W is closed because T is continuous in the cut distance metric δ . A consequence of Proposition 4.1 is a variational formulation similar to (2.23).
Proposition 4.2. Let A J be defined in (4.4). Given any Gibbs measure parameters (h, β, α), assume wlog that h = h q = log q 1−q for some q ∈ (0, 1). For u ∈ [0, 1], denote ∂W u := {f ∈ W | T (f ) = u 3 } and let F * u ⊂ W be the set of minimizers of The supremum sup f ∈A J [H(f )−I(f )] is attained exactly on the set F * v * , where v * maximizes the RHS of (4.5).
The proof is identical to the proof of Theorem 2.5 and is left to the appendix. Combining Propositions 4.1 and Theorem 4.2, the exponential random graph conditioned on A J is asymptotically indistinguishable from the graphs in the set F * v * . The case when J = [0, 1] and A 0 = W, which is when there is no conditioning, coincides with Theorem 2.5.
Using Proposition 4.2, the notion of the triangle tilt can be extended to the importance sampling schemes using the conditioned Gibbs measures, in a similar way as in Section 2.2 for the full Gibbs measure, as follows. Given (p, t) and the set A J , suppose there exists parameters (h p , β, α) such that The conditioned Gibbs measure G hp,β,α n,A J is a (conditioned) triangle tilt with parameter α corresponding to (p, t).
Under some mild conditions on the set A J , Lemma A.4 shows that Thanks to (4.7), the notion of asymptotic optimality for conditioned tilts is unchanged. As a corollary, we have that the conditioned triangle tilt also yields an asymptotically optimal importance sampling scheme. The proof is left to the appendix. We remark here that if the Glauber dynamics is use to generate samples from Q hp,β,α n,A J , we must require that A J is connected. A sufficient condition for A J to be connected is if J is an interval of the form [0, r] or [r, 1]. Numerical illustration of a conditioned Gibbs measure. We illustrate the conditional Gibbs measure tilt with an example. For concreteness, let us set p = 0.2 and t = 0.3, and for the Gibbs measure parameters, set h = h p and α = 1, and vary β 0. We will study how the asymptotic second moment changes as β varies. The pair (p, t) = (0.2, 0.3) is in the replica symmetric phase S 2/3 . For the triangle tilt with α = 1, we have from (2.31) β = β * = (h t − h p )/t 2 . The variational form V (u; h p , β * ) = hp 2 u + β * 6 u 3 − I(u) has a local maximum at t = 0.3 but is maximized at a value u * ≈ 0.989. (See Figure 4.2.) So (p, t) / ∈ S 1 . The exponential graph G hp,β * n will produce on average n 3 (u * ) 3 triangles-this is too many triangles, and the variance of the importance sampling estimator will blow up.
To avoid getting samples with too many triangles, let us restrict the state space to cap the number of triangles and edges,   When the tilt with β = β * is conditioned on A r , it gives the best estimator and is asymptotically optimal by Corollary 4.3 This is corroborated by the numerical simulations that suggest that the triangle tilt performs significantly better than crude Monte Carlo sampling, and also outperforms the edge tilt. In contrast, when no conditioning is performed, the IS estimator exhibits a sharp decline in performance when β is increased beyond the transition point at β ≈ 4.76 (c.f. Figure 4.2). This transition point coincides with the phase transition when the exponential graph G hp,β n exhibits a transition from a graph with low edge density to one with high edge density. As mentioned above, the graph with high edge density overproduces triangles, causing the estimator to have a large variance.
Triangle tilt with parameter α and conditioned Gibbs measure. The importance sampling scheme was next performed for p = 0.2 and t = 0.3, in the replica symmetric phase. We now consider the following tilted measures, all of whose exponential random graphs are indistinguishable from the Erdős-Rényi graph G n,t .

Appendix A. Auxiliary lemmas and proofs
We collate a number of lemmas and proofs in this section, roughly in the order that they appear in the paper.
Lemma A.1. (i) Given (p, t), let F * be the set of functions that minimize the LDP rate function, inf f ∈Wt [I p (f )] in (2.9). Then F * is the minimal set that the Erdős-Rényi graph G n,p conditioned on T (f ) t 3 is asymptotically indistinguishable from. Then F * is the minimal set that the exponential random graph G h,β,α n is asymptotically indistinguishable from.
Proof. The proofs of asymptotic indistinguishability of F * was shown in [8, Theorem 3.1] for (i) and [5,Theorem 3.22] for (ii). The proofs naturally extend to give the minimality of F * , and we state them here for the record.
Observe that for any random graph G n that is asymptotically indistinguishable from a set F * , to show that F * is minimal, it suffices to show that, for any relatively open non-empty subset F 0 ⊂ F * such that F * \ F 0 is non-empty, there exists ǫ > 0 such that lim inf n→∞ 1 n 2 log P(δ (G n , F * \ F 0 ) > ǫ) = 0. (A.1) Let F 0 ⊂ F * be any relatively open non-empty subset, with F * \F 0 non-empty. Denote, for ε > 0, we have that lim inf n→∞ 1 n 2 log P(G n ∈ F ε ) − sup The proof is complete.
Proof of Proposition 2.8.