Heterogeneous clustered random graphs

A graph Hamiltonian is proposed that allows creation of random networks close to specified connectivity, degree heterogeneity, and clustering.

model and swapping edges [15][16][17], although this often fails to produce networks of the desired clustering coefficient [18], and graph mobility can bias results if not adjusted for [19].
In this letter, a graph Hamiltonian is derived that enables creation of random graphs very close to specified connectivity, degree heterogeneity and clustering, and which, at the same time, attributes non-zero weight to any observed network. This is used to simulate a set of networks with similar properties to the classic power grid network of Watts and Strogatz [8].
Graph Hamiltonian. -Let G be the set of all undirected graphs with N nodes. We consider a graph G ∈ G with nodes indexed i, j, . . ., and links defined by the adjacency matrix G = (G ij ). Here, G ij = G ji = 1 if i and j are connected and G ij = 0 otherwise; by convention we set G ii = 0.
We write k i for the degree of node i, which also lets us define expectations of functions of degree: Analogously, we define the number of triangles node i participates in, t i , and related expectations, as Other relevant quantities, defined at the level of the whole network, are the number of links, number of lines and 68006-p1 T. House number of triangles, which are given, respectively, by Note that this method of counting is more common for directed than undirected graphs, but will yield simpler expressions later. Our aim is to obtain a way to generate networks with specified values of these quantities. We will define a probability measure over G in a Hamiltonian framework so that the probability given to G ∈ G is Here H is called the Hamiltonian, θ = (θ a ) is a vector of model parameters, and Z is the partition function. We will also writeπ for the ensemble probability of a network having properties Standard exponential random graph models invoke maximum entropy at given mean to motivate a linear Hamiltonian [2,12,13], taking the form where the parameter vector θ is chosen to give the desired mean values to the network properties (F a ). For the simplest of these, the Bernoulli random graph (H = θM ) we can impose mean link density ρ and use the independence of links in the model to solve analytically, giving where θ = ln((1 − ρ)/ρ). The difference between these distributions arises due to combinatorial factors (i.e. the binomial coefficient) and is shown graphically in fig. 1. One known pathology of linear Hamiltonians is that models based on them often exhibit degeneracies [2,14], leading Park and Yook [20] to propose a non-linear optimisation Hamiltonian for degree and clustering at an individual level: This approach avoids the degeneracy problems of linear Hamiltonians. Other approaches make use of a non-linear  Hamiltonian for the overall number of triangles in the network [15,17], where Θ is the desired number of triangles in the network. Such a model is designed to work with a sampling strategy based on applying edge swaps to a network that already has the desired degree distribution. Both of these approaches require a large number of parameters to work: (8) has 2(N + 1) explicit model parameters and (8) has 2 explicit parameters, but with the N node degrees constrained or known. Our aim here is to derive a graph Hamiltonian that will appropriately constrain network connectivity, heterogeneity and transitivity, with O(1) rather than O(N ) parameters.
To do this, let us first consider what might be a natural functional form, since there is no a priori reason why the absolute value should be used. If we want to adopt an optimising approach, then in place of (6) we would write  where the constraint on Φ a is that it should take its minimum at a specified value, and hence we constrain the mode rather than the mean of the network statistic F a [G] under the probability measure. If Φ a are quadratic functions, then we are aiming for a sampling regime such that in the thermodynamic limit the marginal distributions for each network property be the desired network clustering coefficient [2]. Putting these together with the quadratic form for (10) above gives the following six-parameter Hamiltonian: It then remains to motivate values for the parameters β.
In general, higher values of these parameters will attribute lower probabilities to networks further from the mode. From the form of (11) we have natural choices, There is then the question of what value of the standard deviation in the number of links, σ, is most appropriate.
Simulation. -In general, the sum over the set of all graphs, G, needed to calculate the partition function in (4) involves 2 N (N −1)/2 possibilities making this impractical. Monte Carlo methods -in particular the Metropolis-Hastings algorithm [21,22]-can be used to overcome this difficulty. Here a current graph G is stored in memory, and a new graph G proposed with probability e q(G ;G) . Then the new graph is accepted if Here, rand is a random number picked from the uniform distribution U(0, 1), and ΔH is the proposed change in the Hamiltonian. For most moves in graph space (e.g. adding/deleting links and edge swaps) the probability of a proposal is equal to the probability of proposing its reversal and so the q terms cancel. The stationary distribution of the Markov chain defined by (13) will be (4) provided the process is ergodic. As we discussed above for the Bernoulli random graph, this is no guarantee that the measureπ over network properties will be as desired.
Suppose we could calculate the probability eq (F ;F ) that a graph with properties F will be proposed from one with F , and consider the Markov chain defined by the acceptance criterion The stationary distribution of this chain will not have stationary distribution π, but will have the distributioñ π over network properties that we have designed our Hamiltonian to produce. Exact calculation of Q will in general involve extremely complex combinatorial considerations, making it computationally unfeasible. It is possible, however, to make a judicious choice of β in (11) so that ΔH Q, meaning that we can approximate Where this approximation is good, the criteria for acceptance (13) and (14) are equivalent, and we expect the network measure defined by (11) to yield distributions over network properties that are approximately normal as desired. The approximation (15) will be more accurate for larger values of the precision parameters β, since as these parameters are increased the (tractable) differences in the Hamiltonian will become larger while the (intractable) combinatorial factors remain constant. On the other hand, overly large differences will worsen the computational performance of the Metropolis-Hastings scheme [23]. This means that there is a trade-off between precision and the ability to sample networks efficiently. If we were to implement a Monte Carlo scheme of choosing distinct pairs (i, j = i) and then proposing G ij → 1−G ij , G ji → 1−G ji , then each proposal involves a change in M of 2. Therefore, a natural scale for the standard deviation associated with M is σ = O (1), and the precise choice β m = 1/4, with other parameter values following from (12), is used here since these were found to perform well. Figure 3 shows 4 × 10 3 samples from the model for three different parameter choices, which were generated in 90 seconds on a modern laptop and have properties close to those desired with errors arising due to the approximation (15). In general, convergence of the algorithm is smooth and it does not become trapped in local minima. Figure 4 shows that as would be expected the mixing times starting with the empty graph are between the order of pairs of nodes, O(N 2 ), and the next polynomial order O(N 3 ). The question of whether better performance could be obtained using advanced Monte Carlo methods [23] remains open.  (11) and parameters of the power grid network considered by Watts and Strogatz [8]; the red dashed line is the target value and the starting point is the original graph, which is first randomised by accepting all edge swaps then -after the time point shown with a blue dotted line-evolved according to (11). Bottom: statistical properties of 100 realisations of this graph model (black histogram outlines) compared to the Gaussian density implicit in (11).
Power grid randomisation. -To apply the approach suggested to a real-world dataset, parameters were taken from the power grid network considered by Watts and Strogatz [8]: N = 4941, θ m = 5.4 × 10 −4 , θ l = 1.2, θ t = 0.10. One hundred simulations were run starting from the graph G i,j = 0, ∀i, j until convergence. The results of these are shown in fig. 5, showing that the approach is slightly biased due to the approximation (15), but for practical purposes delivers random samples of networks with the desired statistics to high accuracy.
Note that for the original power grid network, the mean shortest path length is 19 and the network consists of one giant component. For the simulated networks, the mean finite shortest path length is 7.5, but on average 26% of shortest paths are infinite (i.e. the nodes are in different components). This indicates how the real network is structured to preserve connectivity at the expense of path length when compared to a random graph, even when adjusting for connectivity, heterogeneity and clustering.

68006-p5
T. House To emphasise the usefulness of the random graph model independently of the simulation method used, the initial network was also randomised using edge swaps, and then edge swaps were used to reintroduce clustering using the Hamiltonian (11), with results shown in fig. 6, which are very similar to those of fig. 5, but with degree distribution unchanged compared to the original network.
Conclusions. -The approach presented here gives a general recipe for complex random graph models. First, decide which network properties are to be specified and constrained for the scientific problem in hand. Secondly, determine what conditional relationships hold between these properties and add any that are missing: in the model considered above, both triangles and lines are included since triangles are closed lines. Thirdly, write down an appropriate optimising Hamiltonian. Finally, tune the precision parameters to obtain the appropriate balance between accuracy and sampling efficiency. * * * Work funded by the Engineering and Physical Sciences Research Council. I would like to thank two anonymous referees for constructive comments that have improved this manuscript.