Size reduction of complex networks preserving modularity

The ubiquity of modular structure in real-world complex networks is the focus of attention in many trials to understand the interplay between network topology and functionality. The best approaches to the identification of modular structure are based on the optimization of a quality function known as modularity. However this optimization is a hard task provided that the computational complexity of the problem is in the non-deterministic polynomial-time hard (NP-hard) class. Here we propose an exact method for reducing the size of weighted (directed and undirected) complex networks while maintaining their modularity. This size reduction allows use of heuristic algorithms that optimize modularity for a better exploration of the modularity landscape. We compare the modularity obtained in several real complex-networks by using the extremal optimization algorithm, before and after the size reduction, showing the improvement obtained. We speculate that the proposed analytical size reduction could be extended to an exact coarse graining of the network in the scope of real-space renormalization.


Introduction
The study of the community structure in complex networks is becoming a classical subject in the area because several aspects of the problem are both challenging and interesting.The challenge comes from the difficulty for unveiling the best partition of the network in terms of communities, in the sense of groups of nodes that are more intraconnected rather than interconnected between them [1].The interest comes from the fact that this level of description could help to elucidate an organization of the network prescribed by functionalities [2,3], and also because it resembles the coarse graining process in statistical physics to describe systems at the mesoscale.
The most successful solutions to the community detection problem, in terms of accuracy and computational cost required, are those based in the optimization of a quality function called modularity proposed by Newman [4] that allows the comparison of different partitioning of the network.Given a network partitioned into communities, being C i the community to which node i is assigned, the mathematical definition of modularity is expressed in terms of the weighted adjacency matrix w ij , that represents the value of the weight in the link between i and j (0 if no link exists), and the strengths where the Kronecker delta function δ(C i , C j ) takes the values, 1 if nodes i and j are into the same community, 0 otherwise, and the total strength 2w = The modularity of a given partition is then the probability of having edges falling within groups in the network minus the expected probability in an equivalent (null case) network with the same number of nodes, and edges placed at random preserving the nodes' strength.The larger the value of modularity the best the partitioning is, because more deviates from the null case.Several authors have attacked the problem proposing different optimization heuristics [6,7,8,9,10,11] since the number of different partitions are equal to the Bell [12] or exponential numbers, which grow at least exponentially in the number of nodes N. Indeed, optimization of modularity is a NP-hard (Non-deterministic Polynomial-time hard) problem [13].
The definition of modularity can be also extended, preserving its semantics in terms of probability, to the scenario of weighted directed networks as follows: where w out i and w in j are respectively the output and input strengths of nodes i and j and the total strength is The input and output strengths are equal (w i = w i out = w i in ) if the network is undirected, thus recovering the standard definition of strength.Furthermore, if the network is unweighted and undirected, w i represents the degree of the i-th node, i.e. the number of edges attached to it, and w is the total number of links of the network.
The challenge of optimizing the modularity has deserved many efforts from the scientific community in the recent years.Provided the problem is NP-hard, only optimization heuristics have been shown to be competent in finding suboptimal solutions of Q in feasible computational time.Nevertheless, when facing the decomposition in communities of very large networks, optimality is usually sacrificed in favor of computational time.
Our goal here is to demonstrate that it is possible to reduce the size of complex networks while preserving the value of modularity, independently on the partition under consideration.The systematic use of this reduction allows for a more exhaustive search of the partitions' space that usually ends in improved values of modularity compared to those obtained without using this size reduction.The paper is organized as follows: In the next section we present the basics for the size reduction process.After that, we provide analytic proofs for specific reductions.Finally we exploit the reduction process based on the mentioned properties, and compare the modularity results with those obtained without size reduction in several real networks, using the Extremal Optimization heuristics [8].

Reduced graph
Let G be a weighted complex network of size N, with weights w ij ≥ 0, i, j ∈ {1, . . ., N}.If the network is unweighted, the weights matrix becomes the usual connectivity matrix, with values 1 for connected pairs of nodes, zero otherwise.We will assume that the network may be directed, i.e. represented by a non symmetric weights' matrix.
Any grouping of the N nodes of the complex network G in N ′ parts may be represented by a surjective function R : {1, . . ., N} −→ {1, . . ., N ′ } which assigns a group index R i ≡ R(i) to every i-th node in G.The reduced network G ′ in which each of these groups is replaced by a single node may be easily defined in the following way: the weight w ′ rs between the nodes which represent groups r and s is the sum of all the weights connecting vertices in these groups, where the sums run over all the N nodes of G.For unweighted networks the value of w ′ rs is just the number of arcs from the first to the second group of nodes.It must be emphasized that a node r of the reduced network G ′ acquires a self-loop if w ′ rr = 0, which summarizes the internal connectivity of the nodes of G forming this group.
The input and output strengths of the reduced network G ′ are and its total strength 2w ′ is equal to the total strength 2w of the original network

Modularity preservation
The main property of the reduced network is the preservation of modularity (1.1) or (1.2), i.e. the modularity of any partition of the reduced graph is equal to the modularity of its corresponding partition of the original network.More precisely, let C ′ : {1, . . ., N ′ } −→ {1, . . ., M} be a partition in M clusters of the reduced network G ′ .Its corresponding partition C : {1, . . ., N} −→ {1, . . ., M} of the original graph is given by the composition of the reducing function R with the partition C ′ , i.e.C = C ′ • R. Therefore, the statement of the previous paragraph becomes (2.5) The proof is straightforward: We have found a relevant property of modularity namely that those nodes forming a community in the optimal partition can be represented by a unique node in the reduced network.Each node in the reduced network summarizes the information necessary for the calculation of modularity in its self-loop (that accounts for the intraconnectivity of the community) and its arcs (that account for the total strengths with the rest of the network).The question now is: how to determine which nodes will belong to the same community in the optimal partition, before this partition is obtained?The answer will provide with a size reduction method in complex networks preserving modularity.

Analytic reductions
Here we give the proof for certain possible analytic size reductions of weighted networks, undirected and directed.

Reductions for undirected networks
The modularity of an undirected network may be written as where is the contribution to modularity of the i-th node.If we allow this node to change of community, the value of C i becomes a parameter, and therefore it is useful to define which accounts for the contribution of the i-th node to modularity if it were in community r.The separation of the self-loop term, which does not depend on which community node i belongs to, yields to the definition of and satisfying and The role of these individual node contributions to modularity becomes evident in the expression of the change of modularity when node i goes from community r to community s: As a particular case, a node that forms its own community, i.e. an isolated node i, which moves to any community s produces a change in modularity ∆Q = 2q i,s . (3.9) Therefore, if there exists a community s for which qi,s > 0, node i cannot be isolated in the partition of optimal modularity.This existence is easily proved by considering the sum of qi,r for all communities: where we have made use of the definitions of strength w i and total strength 2w for the simplification of the expression.Thus, completing the proof that there are no isolated nodes in the configuration which maximizes modularity, unless they have a big enough self-loop §.Now, it remains the problem of the determination of an acquaintance (node j) of node i in its optimal community, in order to group them (R i = R j ) in a single equivalent node with a self-loop, as explained above.If we know that nodes i and j share the same community at maximum modularity, the reduced network will be equivalent to the original one as regards modularity: no information lost, and a smaller size.Taking into account that the sign of the qi,r can only be positive if there is a link between node i and another node in community r, the only candidates to be the right acquaintance of any node are its neighbors in the network.
The simplest particular cases are hairs, i.e. nodes connected to the network with only one link.Hence, a hair can be analytically grouped with its neighbor k if producing a self-loop for node k of value When node i has no self-loop (w ii = 0) this condition is always fulfilled, see figure 1a.§ Note that some authors [14] have used the fact that no isolated nodes are obtained at the partition of maximum modularity to reduce the network size, simply by obviating the existence of these nodes.This approach clearly fails to reproduce the same modularity of the original network and provides misleading results, it should be avoided.
(a) Another solvable structure is the triangular hair, in which two nodes i and j have only one link connecting them, two more links from i and j to a third node k, and possibly self-loops.In this case, if nodes i and j share the same community in the optimal partition and therefore may be grouped as a single node h.Moreover, the resulting structure becomes a simple hair, which can be grouped with node k if where In the particular case of nodes i and j without self-loops (w ii = w jj = 0), the triangular hair can always be reduced to a single hair with a self-loop w ′ hh = 2w ij , see figure 1b.

Reductions for directed networks
The treatment of directed networks requires the distinction between the nodes' output and input contributions to modularity: where The separation of the self-loop term follows the same pattern than for undirected networks: satisfying and

.25)
With these definitions at hand, the change of modularity when node i goes from community r to community s becomes ∆Q = (q out i,s + qin i,s ) − (q out i,r + qin i,r ) , (3.26) and the change when an isolated node i moves to any community s is ∆Q = qout i,s + qin i,s . (3.27) The first difference between directed and undirected networks comes from the fact that we cannot prove this time the inexistence of isolated nodes in the partition of optimal modularity.The previous argumentation was based on the use of (3.10), which now splits in two relationships: The next step is the same: Since communities s 1 and s 2 need not be the same, the change of modularity (3.27) is not warranted to be positive, and thus isolated nodes are possible in the partition which maximizes modularity.Nevertheless, there exist three kinds of nodes for which we can prove they cannot be isolated in the optimal partition, provided their self-loops are not too large: hairs, sinks (nodes with only input links) and sources (nodes with only output links).
Directed hairs, i.e. nodes connected only to another node, either through an input, an output, or both links, necessarily have s 1 = s 2 .Therefore, it is save to group them in the same way as undirected hairs if In particular, this condition is always fulfilled if the hair has no self-loop (w ii = 0), see figure 2a.Whenever the self-loop is present, both input and output links are needed to counterbalance it.The resulting self-loop w ′ kk of the grouped node has value w ′ kk = w ii + w ik + w ki . (3.33) Sink nodes i are characterized by null output strengths, w out i = 0, which imply qout i,r = 0 for all communities r.Thus, the change of modularity (3.27)only depends on the value of qin i,s , and (3.31) tells us that they can always be grouped with an increase of modularity.The same property applies to sources, which are defined as nodes with null input strengths, w in i = 0. Note that sinks and sources cannot have self-loops, since this would be in contradiction with their null output and input strengths respectively.
A triangular hair formed by a source node i and a sink node j behaves exactly as the undirected triangular hair, being possible to group them in a single node h with a self-loop, see figure 2b, where

Results and discussion
The above proofs allow us to face the problem of size reduction in complex networks into a firm basis.In particular, this size reduction preserving modularity ensures that the structural mesoscale found by maximizing modularity will be invariant under these transformations.The natural question at this point is: what is the percentage in size reduction of networks using the previous rules?To answer this question it is mandatory to have an estimation on the number of hairs, and triangular hairs, we might expect in complex networks.In real networks this calculation can be performed by direct enumeration, however an estimation can be made in terms of general grounds about the degree distribution P (k).
Here we provide some rough estimates for the most widespread degree distributions in natural and artificial networks: scale-free and exponential.For scale-free networks it is usually assumed a P (k) = αk −γ , with γ ∈ [2,3] for most of the real scale-free complex networks.The normalization condition provides with the value of α.As a first approximation, neglecting the structural cut-off of the network, we can write That means that, roughly speaking, the number of hairs that corresponds to P (1) is about 83% of nodes in a scale-free network with γ = 3 and 61% when γ = 2, although this value is slightly Table 1.Results for the optimal partition obtained using EO algorithm [8] for several real networks before and after applying the size reduction.We present the number of nodes, modularity, number of communities and speed-up of the algorithm after reduction.reduced when considering the cut-offs of the real distributions.

Network
An equivalent estimate can be conducted for exponential degree distributions of type P (k) = αe −βk , with β > 0. In this case, normalization implies that and then α = e β − 1.The percentage of hairs in this case is P (1) = 1 − e −β , that, for example, for plausible values of β ∈ [0.5, 1.5] provides a reduction between 40% and 77% respectively.At the light of these estimates, the size reduction process provides with an interesting technique to confront the analysis of community structure in networks by maximizing modularity with a substantial advantage in computational cost without sacrificing any information.We have checked our size reduction process, and posterior optimization of modularity using Extremal Optimization (EO) [8] in several real networks.To enhance the accuracy of the EO algorithm, we perform a last step of optimization consisting in to merge communities whenever modularity is increased, and rearrange the borders (moving the nodes with the lowest modularity values and testing them in the neighbor communities) until all the nodes are better classified and no higher modularities, by moving one node, can be obtained.The results obtained improve those obtained using Spectral optimization [11] and simulated annealing [9].
The networks analyzed are: the Zachary's karate club network [15], the Jazz musicians network [16], the e-mail network of the University Rovira i Virgili [17], the airports network with data about passenger flights operating in the time period November 1, 2000, to October 31, 2001 compiled by OAG Worldwide (Downers Grove, IL) and analyzed in [18], the network of users of the PGP algorithm for secure information transactions [19], and the Internet network at the autonomous system (AS) level as it was in 2001 and 2006 reconstructed from BGP tables posted by the University of Oregon Route Views Project.The results obtained are reported in Table 1.
We observe that the reduction process allows for a more exhaustive search of the partitions' space as expected.The speed-up of the algorithm after reduction gives an indication of the effectiveness of the process.This is also corroborated by an improvement in modularity.We present in Table 1 the values of modularity for the different networks analyzed up to order 10 −6 .In general, the numerical resolution of modularity is up to order min i {w i }/2w, that represents the minimal possible change in the structure of the partitions.It means that every digit in our value of modularity is significant for comparison purposes.
Particularly illustrative is the analysis of the airport network.We have constructed different networks from the raw data, the undirected unweighted network previously used in [18], the undirected weighted network (where the weights reflects the number of passengers using the connection in the period of study), and the most realistic case corresponding to the weighted directed network of the airports connections.These networks allowed us to check our techniques (reduction and optimization algorithm) in all the possible scenarios.Note that the results obtained for the weighted directed and undirected networks in terms of modularity are very close, an explanation about this fact that is ubiquitous in the analysis of directed networks can be found in the Appendix.
Summarizing, we have proposed an exact procedure for size reduction in complex networks preserving modularity.The direct consequence of its application is an improvement in computational cost, and then accuracy, of any heuristics designed to optimize modularity.We think that the idea of the exact reduction could be extended to other specific motifs (building blocks) in the network, although its analytical treatment can be further difficult.The reduced network is also an appealing concept to renormalize dynamical processes in complex networks (in the sense of real space renormalization).With this reduction it is plausible to perform a coarse graining of the dynamic interactions between the formed groups, we will explore this connection in a future work.

Appendix A. Relationship between directed and undirected modularities
Let us suppose that w ij are the weights of a directed weighted network, and that we define its corresponding symmetrized (undirected) network by adding the weights matrix to its transpose: wij = w ij + w ji , ∀i, j .The modularity Q D of the directed network is invariant under transposition of the weights matrix since the input (output) strengths of the transposed network are equal to the output (input) strengths of the original one: .4)The relationship between the modularity Q D of the directed network and the modularity Q S of its symmetrization is obtained by simple calculations:   This result can also be expressed as a communities sum: The contribution of the links to the input and output strengths cancel if they fall within the communities.Therefore, if most links do not cross the boundaries of the communities, it follows that Q S ≈ Q D even if the network is highly asymmetric.

Figure 1 .
Figure 1.Analytic reductions for undirected networks.In (a) example of a hair reduction, (b) example of a triangular hair reduction (see text for details).The widespread case of unweighted networks, all weights equal to 1, implies that in the reduction (a), w ′ kk = 2, and in the reduction (b), w ′ hh = 2 and w ′ hk = 2.

Figure 2 .
Figure 2. Analytic reductions for directed networks.In (a) example of a hair reduction, (b) example of a triangular hair reduction (see text for details)