Classical Information Theory of Networks

Heterogeneity is among the most important features characterizing real-world networks. Empirical evidence in support of this fact is unquestionable. Existing theoretical frameworks justify heterogeneity in networks as a convenient way to enhance desirable systemic features, such as robustness, synchronizability and navigability. However, a unifying information theory able to explain the natural emergence of heterogeneity in complex networks does not yet exist. Here, we fill this gap of knowledge by developing a classical information theoretical framework for networks. We show that among all degree distributions that can be used to generate random networks, the one emerging from the principle of maximum entropy is a power law. We also study spatially embedded networks finding that the interactions between nodes naturally lead to nonuniform distributions of points in the space. The pertinent features of real-world air transportation networks are well described by the proposed framework.

Heterogeneity is among the most important features characterizing real-world networks. Empirical evidence in support of this fact is unquestionable. Existing theoretical frameworks justify heterogeneity in networks as a convenient way to enhance desirable systemic features, such as robustness, synchronizability and navigability. However, a unifying information theory able to explain the natural emergence of heterogeneity in complex networks does not yet exist. Here, we fill this gap of knowledge by developing a classical information theoretical framework for networks. We show that among all degree distributions that can be used to generate random networks, the one emerging from the principle of maximum entropy is a power law. We also study spatially embedded networks finding that the interactions between nodes naturally lead to nonuniform distributions of points in the space. The pertinent features of real-world air transportation networks are well described by the proposed framework.
The principle of maximum entropy states that the unique probability distribution, encoding all the information available about a system but not any other information, is the one with largest information entropy [1]. Available information about the system corresponds to constraints under which entropy is maximized. The principle of maximum entropy has found applications in many different disciplines, including physics [2], computer science [3], geography [4], finance [5], molecular biology [6], neuroscience [7], learning [8], deep learning [9], etc. Applications can be also found in network science [10][11][12][13][14][15][16][17][18], where the maximum entropy argument is usually applied to the distribution of probabilities P(G) of observing a given graph G of finite size N in an ensemble of random graphs [11]. Different sets and types of entropy-maximization constraints lead to different network models. For example, if the constraints are soft, i.e., if they deal with expected values of network properties, then P(G) is a Gibbs-like distribution corresponding to exponential random graph models (ERGMs) [11,19]. The simplest example is the Erdős-Rényi model [20] obtained by constraining the expected total number of links. Another example is the ERGM with given expected degrees of all nodes [11]. From the point of view of statistical mechanics, ERGMs are canonical ensembles, conjugated to microcanonical ensembles that enforce sharp constraints on the exact values of network properties versus their expected values [12][13][14]. For instance, the standard (sharp) configuration model is the microcanonical network ensemble conjugated to the canonical network ensemble or ERG model. The conjugated microcanonical and canonical network ensembles are not thermodynamically equivalent, and their nonequivalence is due to an extensive number of constraints [13,14,21]. The maximum entropy principle applies to microcanonical ensembles as well, where P(G) is the entropy-maximizing uniform distribution on the set of graphs G satisfying sharp constraints.
In all the examples above, the maximum entropy principle is de facto applied to networks adjacency matrices A whose elements are understood as sets of edge variables correlated by the imposed constraints. Calculations generally lead to the derivation of probabilities π i j = P(A i j = 1) for the pair of nodes i and j to be connected. If networks are undirected, then A i j = A ji and π i j = π ji . This approach is very similar to the one used in statistical mechanics to describe systems of noninteracting particles whose role is played by network edges, while particle states are enumerated by node pairs {i, j} [11,12]. In binary networks, where A i j is either 0 or 1, π i j takes the Fermi-Dirac form; if multiple edges are allowed between the same pair of nodes, then the system is described by the Bose-Einstein statistics [22].
Here we take advantage of the principle of maximum entropy in a different way. Instead of dealing with all elements of the adjacency matrix, including its null entries, we look directly at network edges. Specifically, we consider the probability P( = ( 1 , 2 )) that by picking a random edge , nodes 1 and 2 are found at its ends. The distribution P( ) that describes the ensemble is then found using the maximum entropy principle. Different constraints in the entropy maximization problem lead to different distributions P( ). Since the distributions in the ensemble are exponential, we refer to it as the classical network ensemble, differentiating it from previously explored maximum entropy ensembles obeying quantum statistics [11]. We note that the framework we consider here allows for multiedges and tadpoles as in similar approaches [4,18]. This makes all edges uncorrelated variables, allowing for greater simplicity and flexibility.
As the first very basic example we consider the classical network ensemble in which we constrain the expected degrees. That is, we require that the probability to find node i arXiv:1908.03811v2 [physics.soc-ph] 14 Aug 2019 at one of the ends of a randomly chosen link is k i /L, where {k i } is any given degree sequence, L is a fixed number of links in the network, which is assumed to be consistent with k i s via 2L = i k i , and where 1(x = y) is the indicator function: 1(x = y) = 1 if x = y and 1(x = y) = 0 otherwise.
The constraint in Eq. (1) is required to hold for all nodes i = 1, . . . , N. The maximum entropy distribution P( ) is found by maximizing the functional using the method of Lagrange multipliers. The first term on the rhs of Eq. (2) is the Shannon entropy E[P( )] of the distribution P( ), while the other two terms represent the constraints under which this entropy is maximized. They introduce Lagrange multipliers ψ i and µ associated with the constraint in Eq. (1) and the normalization of P( ), respectively. The solution of this maximization problem leads to the expression for the probability π i j that a given link connects node i at one end to node j at the other end, that is where the Lagrange multipliers ψ i , µ are the solutions of the constraint equations k i /L = 2k i / k N = 2e −µ e −ψ i N j=1 e −ψ j , and k is the average degree. Therefore, e −ψ i = k i and e µ = ( k N) 2 , from which we obtain Notice that π i j is a probability distribution, thus obeying the normalization condition i j π i j = 1. Since there are L = k N/2 links in the network, the average number of links that connect node i to node j is given by This is the average number of links between nodes of degrees k i and k j in the uncorrelated random networks [23]. Equation (4) is the starting point of many calculations in network science that use the uncorrelated random networks as a null model. A popular example is the modularity function used in community detection [24]. The derivation above provides a theoretical ground for such an interpretation of the model. We now turn to more sophisticated outcomes of the considered framework. In our classical network ensemble, the degree distribution P(k) is an input parameter that we can set to whatever we wish. Among all possible choices of the degree distribution, which one corresponds to maximal randomness?
To answer this question, we start with Eq. (4) and define the classical network entropy as Equation (5) quantifies the entropy associated with all edges in the network. It is given by the sum of L = k N/2 identical terms corresponding to the entropy associated to the typical number of ways in which we can choose two nodes (i, j) to be connected by a single link of the classical network ensemble. This entropy S quantifies the amount of information encoded in the classical network ensemble with N nodes, L edges, and degree distribution P(k). Any given P(k) uniquely determines the value of S via Eq. (5), yet the same value of S may correspond to different P(k)s. Suppose we now want to deal with a coarse grained network model where all nodes with the same degree class are indistinguishable. This network ensemble is used to compress the information of the original network retaining only the information regarding the degree of the linked nodes. Clearly in this model we aim at observing the same expected number of links L k,k between nodes of degree k and nodes of degree k as in the classical network ensemble. If we indicate with N k the number of nodes in degree class k and use the notation π kk = kk /( k N) 2 for the probability that two nodes of degree k and k are connected in the classical network ensemble to one or the other end node of the link, it is easy to show that L k,k = Lπ k,k N k N k and that k,k L k,k = L. Every link of the coarse grained ensemble has probability Π k,k to connect super-nodes corresponding to degree classes k and k where Therefore the entropy H associated to the compressed model is We have two representations of the network ensemble at the node level and at the compressed level whose information content is quantified respectively by the S -entropy and the H-entropy. Suppose we now want to deal with the classical network ensemble with different P(k)s that have the same information content or "explicative power" at the node level. To this end, we impose the constraint S = S , where S is a given positive real number. To find the typical degree distribution P(k) under this constraint, we maximize the randomness of the coarse grained model quantified by the H-entropy.
Clearly, P(k) must also satisfy the constraints k kP(k) = k and k P(k) = 1. Combining all together, we thus have to maximize the functional from which we obtain The Lagrange multipliers λ, µ, and ν are determined as the solutions of constraint equations e µ+1 / k = These solutions always exist as long as λ > 1.
Equation (9) shows that the typical degree distribution P(k) with a given value of the classical entropy in Eq. (5) is a power law. To be precise, the power-law decay holds for large degrees k, while in the low-k region there is an exponential cutoff that affects the mean of the distribution. In Figure 1(a) we show the entropy H as a function of S for different values of the average degree k . The lower the S , and consequently the lower the power-law exponent λ, the higher the entropy H. This is because even though the number of networks with a given degree sequence decreases as S and λ go down, the number of ways to split L links into classes of links connecting nodes of degrees k and k increases. Therefore this result highlights the entropic benefit to have networks with broader lower-λ degree distributions that correspond to lower values of S -entropy but to higher values of H-entropy. Interestingly the same result could be obtained by maximizing the randomness of the classical network ensemble, and therefore the S -entropy while keeping fixed the informative power of its compressed description, i.e. the H-entropy. We now discuss how our approach can be used to predict the most likely distribution of the nodes in space when pairs of nodes have a given space-dependent linking probability. Our approach reveals that if nodes are interacting in a network, then interactions induce a natural tendency of the nodes to be distributed inhomogeneously in space. The finding is consistent with the so-called "blessing of non-uniformity" of data, i.e., the fact that real-world data typically do not obey uniform distributions [25]. We first consider spatial networks without any degree constraints, and then combine spatial and degreebased information in heterogeneous spatial networks.
Let δ i j be the distance between nodes i and j in some embedding space, and ω(δ) be the distance distribution between all the N 2 pairs of nodes, which we also call the correlation function: the number of pairs of nodes at distance δ is N 2 ω(δ) N 2 ω(δ)/2. We define a spatial classical network ensemble by imposing the constraint where F(δ) is a function of the distance. Different functions correspond to different ensembles. For example, in the ensemble with a fixed average length of links, this function is F(δ) = δ. If it is F(δ) = ln δ, then the average value of the order of magnitude of link lengths is fixed. The maximum entropy principle dictates the maximization of the functional leading to with f (δ) = g(δ)/z, z = dδ ω(δ)g(δ), and g(δ) = e −αF(δ) . Therefore if F(δ) = ln δ, then the linking probability decays with the distance as a power law, g(δ) = δ −α . If F(δ) = δ, then this decay is exponential, g(δ) = e −αδ . Fixing the number of links in the network to L as before, the classical entropy of the ensemble is which is the spatial analogue of the classical entropy in Eq. (5). We now ask: what is the most suitable distribution of nodes in the space at parity of explicative power of the network model? That is, what is the most likely correlation function ω(δ) such that S = S ? To answer this question, we define the entropy H of the compressed model in which we consider only the number of ways to distribute L links such that every link connects two nodes at distance δ with probability density φ(δ) where φ(δ) = L δ /L and L δ = Lω(δ) f (δ) is the expected number of links between nodes at distance δ in the classical network ensemble (which clearly satisfy the normalization condition dδ L δ = L). The maximum entropy value of ω(δ) is then found by maximizing the functional where λ, µ, ν are the Lagrange multipliers coupled to the constraints. The solution reads so that L δ is given by The Lagrange multipliers are then found as the solutions of the constraints equations. In Figure 1(b), we show the entropy H as a function of S for a power-law decaying linking probability f (δ) = δ −α /z. We now make several important observations. First, if the space is infinite, isotropic, and homogeneous, then the networks are homogeneous as well since any two points in the space are equivalent, and since the linking probability depends only on the distance between pairs of points. The degree distribution is thus the Poisson distribution with the mean equal to the average degree k = 2L/N. Second, Eq. (16) says that the maximum entropy distribution ω(δ) of distances δ between the nodes in the space is uniquely determined by the linking probability f (δ). Third, if this probability decays as a power law f (δ) = δ −α /z, then the framework describes the natural emergence of power-law pair correlation functions. Specifically, the solution in Eq. (16) decays as a power law at small distances δ, while at large distances the decay is exponential due to the finiteness of the system. If the embedding space is Euclidean of dimension d, then points are scattered in the space according to a fractal distribution. Define the node pair density function by where Ω δ is the volume element at distance δ from an arbitrary point. In the d-dimensional Euclidean space, Ω δ is the volume of the (d −1)-dimensional spherical shell, scaling with δ as Ω δ ∝ δ d−1 . Therefore for a power-law linking probability f (δ) = δ −α /z, we get where β = (λ+1)α−(d −1). Therefore, the embedding in d dimensions is possible only if β < 0. Finally, the distribution of nodes in the space is fractal and therefore highly nonuniform as the uniform distribution would correspond to ρ(δ) = const.
As the last example we consider the classical network ensemble of spatial heterogeneous networks combining the degree and spatial constraints in Eq. (1) and Eq. (10), respectively. The probability π i j that a random link connects nodes i and j is given in this case by where f (δ) = e −αF(δ) /z, with α the Lagrangian multiplier coupled to the constraint in Eq. (10), z the normalization constant enforcing i, j π i j = 1, and κ i the hidden variable of node i given by κ i = e −ψ i k N, with ψ i the Lagrangian multiplier coupled to the constraint in Eq. (1). If there are no correlations between the positions of the nodes in the space and their degrees, then the probability π i j can be written as meaning that κ i = k i , so that κ i can be interpreted as the expected degree k i of node i. Using the same approximation as in Ref. [26] for a power-law decaying function f (δ i j ) = δ −α i j /z, we can write π i j as π i j ∝ e −r i j , where r i j = ln κ i + ln κ j − α ln δ i j is approximately the hyperbolic distance between nodes i and j located at radial coordinates ln κ i and ln κ j and at the angular distance proportional to δ i j . Parameter α can then be related to the hyperbolic space curvature. The classical entropy of this ensemble is given by where ω(κ, κ , δ) is the density of pairs of nodes with hidden variables κ and κ at distance δ.
What is the most likely pair correlation function ω(κ, κ , δ) for a fixed value of the classical entropy S = S ? To answer this question, we maximize the H entropy of the compressed model where φ(κ, κ , δ) = L κ,κ ,δ /L is the probability density that a link connected two nodes of with hidden variables κ and κ and at distance δ. Note that L κ,κ ,δ = Lω(κ, κ , δ)κκ f (δ) indicates the expected number of links between pairs of nodes with hidden variables κ and κ at distance δ in the classical network ensemble. The maximization of H respecting the constraint S = S as well as the normalization of L κ,κ ,δ , dκ dκ dδ L κ,κ ,δ = L, and the normalization of ω(κ, κ , δ), dκ dκ dδ ω(κ, κ , δ) = 1, yields the answer where λ, µ, ν are the Lagrange multipliers coupled to the S = S constraint, the normalization of L κ,κ ,δ , and the normalization of ω(κ, κ , δ), respectively. Observe that the pair correlation function ω(κ, κ , δ) depends on its arguments only via w = κκ f (δ), and for small values of w it decays as a powerlaw function of w. If f (δ) = δ −α /z, then ω(κ, κ , δ) can be also written in terms of the approximate hyperbolic distance r = ln w = ln κ + ln κ − α ln δ as As in the homogeneous spatial case, here we also observe that the most likely distribution of nodes in the space is not uniform.
In Figure 2 we apply the considered information-theoretic framework to real-world air transportation networks, in which nodes are airports and edges between pairs of nodes indicate the existence of at a least one flight connecting the two airports. Specifically, we consider three networks corresponding to flights operated in different geographic areas by three air carriers. The distances δ i j between airports i and j are their geographic distances. The linking probability f (δ) is computed from the data as the empirical connection probability, and the hidden variables κ i are set to the actual degrees of the airports in the networks. We note that the empirical connection probabilities f (δ) decay as power laws, and that the pair correlation functions ω(w) are well described by Eq. (24).
In summary, this work illustrates a classical information theoretical approach to the characterization of random networks. According to our theory, network inhomogeneities in the distribution of node degrees and/or node position in space both emerge from the general principle of maximizing randomness at parity of explicative power. The framework provides theoretical foundations for a series of models often encountered in network science, and can likely be extended to generalized network models such as multilayer networks and simplicial complexes [29,30] or to information theory approaches based on the network spectrum [31]. In applications to real-world networks, the framework provides a theoretical explanation of the nontrivial inhomogeneities that are an ubiquitous features of real-world complex systems. Figure 2. The application of the information theoretical framework to real-world airport networks. The networks correspond to the flights operated by American Airlines (AA) during January-April 2018 between US airports [27], by Lufthansa (LU) and Ryanair (RY) during year 2011 between European airports [28]. For each air carrier, a separate air transportation network is built, in which nodes are airports and two airports are connected if at least one flight between the two airports is present in the data. Using the network topology and the geographic locations of the airports, the empirical linking probability f (δ) (panel (a)) and the density distribution P(δ) = dκ dκ ω(κ, κ , δ) (panel (b)) are computed for the three networks, where distance δ is geographic and is measured in kilometers. Panel (c) shows the pair correlation functions ω(κ, κ , δ) = ω(w), where w = κκ f (δ), for the three networks. Points represent empirical densities, while the full lines are theoretical predictions according to Eq. (24). Values of the Lagange multipliers are: λ = 1.2 and ν = 120 for AA, λ = 1.3 and ν = 5 for LU, and λ = 0.45 and ν = 8 for RY.