Optimal map of the modular structure of complex networks

Modular structure is pervasive in many complex networks of interactions observed in natural, social and technological sciences. Its study sheds light on the relation between the structure and function of complex systems. Generally speaking, modules are islands of highly connected nodes separated by a relatively small number of links. Every module can have contributions of links from any node in the network. The challenge is to disentangle these contributions to understand how the modular structure is built. The main problem is that the analysis of a certain partition into modules involves, in principle, as many data as number of modules times number of nodes. To confront this challenge, here we first define the contribution matrix, the mathematical object containing all the information about the partition of interest, and after, we use a Truncated Singular Value Decomposition to extract the best representation of this matrix in a plane. The analysis of this projection allow us to scrutinize the skeleton of the modular structure, revealing the structure of individual modules and their interrelations.


Introduction
The concept of modular structure in real complex networks [1] is revolutionizing the understanding of the evolution of complex systems [2]. Many efforts have been devoted to its automatic detection [3,4,5], however very little is known yet about the actual skeleton of the detected modules that build the network. This skeleton promises to be relevant to understand why physical processes in complex networks, such as synchronization [6], present emergent phenomena that are affected by the existence of topological barriers between modules. We still miss fundamental tools to anticipate these phenomena from a topological perspective. The current work is intended to provide network scientists with novel tools to screen the modular structure. The comprehension of modular structure in networks necessarily demands the analysis of the contribution of each one of its constituents (nodes) to the modules. Recently, Guimerà et al. [7,8] advanced on this issue proposing two descriptors to characterize the modular structure: the z-score (a measure of the number of standard deviations a data point is from the mean of a data set) of the internal degree of each node in its module, and the participation coefficient (P ) defined as how the node is positioned in its own module and with respect to other modules. Given a certain partition, the plot of nodes in the z-P plane admits an heuristic tagging of nodes' role. The success of this representation relies on a consistent interpretation of topological roles of nodes according to the specific data analyzed.
Here we introduce a formalism to reveal the characteristics of networks at the topological mesocale, where the representation of the network is viewed as a set of interconnected modules. We propose a method, based on linear projection theory, to study the modular structure in networks that enables a systematic analysis and elucidation of its skeleton. First, we construct a matrix containing all the information about the modular structure, and second, we find an optimal dimensional reduction of the information contained in it. In particular, we present the optimal mapping of the information of the modular structure (in the sense of least squares) in a two-dimensional space. The method has been applied to synthetic and real networks. The statistical analysis of the geometrical projections allow to characterize the structure of individual modules and their interrelations in a unified framework.
The paper is structured as follows. In section 2, we present the motivation of the method and the main findings to interpret the outcome. In section 3, the method is illustrated with synthetic networks whose structure is controlled. Finally, in section 4, the method is tested in real networks and an explanation of the results is offered.

Projection of the modular structure
A complex network (weighted or unweighted, directed or undirected) can be represented by its graph matrix W , whose elements W ij are the weights of the connections from any node i to any node j. Assuming that a certain partition of the network into modules is available, we plan to analyze this coarse grained structure. Note that the partition can be obtained by any method, and that the method we propose based on modularity [3] is a possibility. The main object of our analysis is the Contribution matrix C, of N nodes to M modules. The rows of C correspond to nodes, and the columns to modules. The analysis of this matrix is the focus of our research. The elements C iα are the number of links that node i dedicates to module α, and can be easily obtained as the matrix multiplication between W ij and the partition matrix S: where S jα = 1 if node j belongs to module α, and S jα = 0 otherwise. The goal is to reveal the structure of individual modules, and their interrelations, from the matrix C. To this end, we propose to deal with the high dimensionality of the original data by constructing a two-dimensional map of the contribution matrix, minimizing the loss of information in the dimensional reduction, and making it more amenable to further investigation.

Singular Value Decomposition of the modular structure
The approach developed here consists in the analysis of C using Singular Value Decomposition [9] (SVD). It stands for the factorization of a rectangular N-by-M real (or complex) matrix as follows: where U is an unitary N-by-N matrix, Σ is a diagonal N-by-M matrix and V † denotes the conjugate transpose of V , an M-by-M unitary matrix. This decomposition corresponds to a rotation or reflection around the origin, a non-uniform scale represented by the singular values (diagonal elements of Σ) and (possibly) change in the number of dimensions, and finally again a rotation or reflection around the origin. This approach and its variants have been extraordinarily successful in many applications [9], in particular for the analysis of relationships between a set of documents and the words they contain. In this case, the decomposition yields information between word-word, word-document, and document-document semantic associations, the technique is known as Latent Semantic Indexing [10], and Latent Semantic Analysis [11]. Our scenario is quite similar to this, where nodes resemble words, and modules resemble documents. We devise that a similar approach will help to unravel the relations between nodes' contributions and modules of a certain partition.

2.
2. An optimal 2D map of the modular structure of networks A practical use of SVD is dimensional reduction approximation, also known as Truncated Singular Value Descomposition (TSVD). It consists in keeping only some of the largest singular values to produce a least squares optimal, lower rank order approximation (see Appendix). In the following we will consider the best approximation of C by a matrix of rank r = 2. The main idea is to compute the projection of the contribution of nodes to a certain partition (rows of C, namely n i for the i-th node) into the space spanned by the first two left singular vectors, the projection space U 2 (see Appendix). We denote the projected contribution of the i-th node asñ i . Given that the transformation is information preserving [12], the map obtained gives an accurate representation of the main characteristics of the original data, visualizable and, in principle, easier to scrutinize. Note that the approach we propose has essential differences with classical pattern recognition techniques based on TSVD such as Principal Components Analysis (PCA) or, equivalently, Karhunen-Loeve expansions. Our data (columns of C) can not be independently shifted to mean zero without loosing its original meaning, this restriction prevents the straightforward application of the mentioned techniques, and also differentiates our work from the modern techniques for the analysis of gene expression patterns [13,14].
The main problem when using SVD relies always on the interpretation of its outcome. The combination of data in the process makes difficult a direct comparison between input and output. To overcome this problem, we point out the following geometrical properties of the projection of the rows of C we have defined (see Appendix for a mathematical description): (i) Every module α has an intrinsic directionẽ α in the projection space U 2 corresponding to the line of the projection of its internal nodes (those that have links exclusively inside the module). We call these directions intramodular projections. This property is essential to discern among modules that are cohesive, in the sense that the majority of nodes project in this direction, from those modules which are not.
(ii) Every module α has a distinguished directionm α in the projection space U 2 corresponding to the vector sum of the contributions of all its nodes. We call these directions modular projections. The modular projection is relevant when compared to the intramodular projection because their deviations inform about the tendency to connect with other modules. Note that e α and m α are equal only if the module is disconnected from the rest of the network.
(iii) Any node contribution projectionñ i is a linear combination of intramodular projections, being the coefficient of each one proportional to the original contribution C iα of links of the node i to each module α. This property comes from the linearity of the projection, and expresses the contribution of nodes to the modules to which they are connected to.
Consequently, from (i) and (iii), we can classify nodes. Nodes with only internal links have a distance to the origin proportional to its degree (or strength). Nodes with internal and external links, separate from the intramodular projection proportionally to their contributions to other modules. From (ii) we can classify modules. Modules that

Structure of individual modules
To study the structure of individual modules we concentrate on the analysis of the projection of nodes' contributions in the plane U 2 . Keeping in mind the geometrical properties (i) and (iii) exposed above, we propose to extract structural information relative to each module by comparing the map of nodes' contributions to the intramodular projection directions. To this end it is convenient to change to polar coordinates, where for each node i the radius R i measures the length of its contribution projection vectorñ i , and θ i the angle betweenñ i and the horizontal axis. We also define φ i as the absolute distance in angle betweenñ i and the intramodular projectioñ e α corresponding to its module α, i.e. φ i = |θ i − θẽ α |.
Using these coordinates R-φ we find a way to interpret correctly the map of the contribution matrix in U 2 : i) R int = R cos φ informs about the internal contribution of nodes to its corresponding module, as well as to the contribution to its own module by connecting to others. To clarify the latter assertion, let us assume a node i belonging to a module β has connections with the rest of modules in the network. Given that this connectivity pattern is a linear combination of intramodular directionsẽ α , the vector sum implies that connecting with modules α having |θẽ β − θẽ α | > π/2 decreases the module R, and vice versa. ii) R ext = R sin φ informs about the deviation (as the orthogonal distance) of each node to the contribution to its own module, see Fig. 2. It is also possible to study the spreading of φ by using other descriptors proposed in the Figure 2. Schematic plot of the coordinates proposed to study the structure of individual modules. The relative distance of a node from its module is captured by the angle φ. The respective components R int and R ext are depicted.
context of synchronization [15]. We explore the internal structure of modules using the values of R int , and the boundary structure of modules using R ext . Using descriptive statistics one can reveal and compare the structure of individual modules. Provided that the distribution of contributions is not necessarily Gaussian, an exploration in terms of z-scores is not convenient. Instead we use box-and-whisker charts for the variables, depicting the principal quartiles and the outliers (defined as having a value more than 1.5 IQR lower than the first quartile or 1.5 IQR higher than the third quartile, where IQR is the Inter-Quartile Range).
The boxplots for the data of each module in the variable R int allow for a visualization of the heterogeneity in the contribution of nodes building their corresponding modules, and an objective determination of distinguished nodes on its structure (outliers). Consequently, the boxplots in R ext inform about the heterogeneity in the boundary connectivity. Nodes with links in only one module are not considered in this statistics because they do not provide relevant information about the boundaries (they have φ = 0), only nodes that act as bridges between modules are taken into account. Considering internal nodes in this statistics would eventually produce a collapse of the quartiles to zero. Assuming that every module devotes some external links (otherwise they would be disconneted), the width of the boxes in this plot is proportional to the heterogeneity of such efforts. If only one node makes external connections, then the boxplot has zero width. Moreover, given two boxes equally wide, their position (median) determines which module contributes more to keeping the whole network connected.

Interrelations between modules
The analysis of the interrelations between modules is performed at the coarse grained level of its modular projections. The modular projectionsm α are aggregated measures of the nodes' contribution to their particular module. The normalized scalar product of modular projections provide a measure of the interrelations (overlapping) between

Application to synthetic networks
We start applying the methodology of analysis to synthetic networks, having control of the whole network structure. First, we analyze a network built up from cliques of different sizes, we consider a line of cliques from size 3 to 10, joined only by a unique link between them. We will consider two different partitions to test the method. The first partition consists of a module containing the larger clique, and another containing the rest of the cliques, see corresponding to the internal nodes of the clique and another corresponding to the node that acts as a connector with the following clique. The connectors towards the precedent clique (of lower size) are indistinguishable at the resolution of the plot, but also lay in a different direction.
Following the test, now we apply the method to a model of network with a well defined community structure that has been used as a benchmark for different community detection algorithms [5], proposed by Girvan and Newman [3]. In that model the authors construct a network of 128 nodes as a set of 4 communities, each one formed by 32 nodes. Fixing the mean number of links per node at a value of 16, the parameter describing the sharpness of the community distribution is z in , the average number of links within the community. A generalization of this model was proposed in [16] to include several hierarchical levels of communities. The hierarchy is defined as follows: we take a set of N nodes and divide it into n 1 groups of equal size; each of these groups is then divided into n 2 groups and so on up to a number of steps k which defines the number of hierarchical levels of the network. Then we add links to the networks in such a way that at each node we assign at random a number of z 1 neighbours within its group at the first level, z 2 neighbours within the group at the second level and so on. There remains the number of links that each node has to the rest of the network; that we will call z out . We construct a network with N = 128 nodes, two hierarchical levels with n 1 = 2, n 2 = 2, z 1 = 5, z 2 = 10 and z out = 1. Again the method resolves the modular structure and individual contributions in the correct way, see Fig. 5. In Appendix D we also test the sensitivity and robustness of the method to slight changes in the predefined partition.

Application to real networks
The first network analyzed is the Zachary's karate club network [17] accounting for the study over two years of the friendships between 34 members of a karate club at a US university in 1970. The network in question was divided, at the end of the study period, in two groups after a dispute between the club's administrator (node 1) and the club's instructor (node 34), which ultimately resulted in the instructor leaving and starting a new club, taking about half of the original club's members with him. The partition we have used in our study corresponds to four modules resulting from optimizing modularity [3] using Extremal Optimization [18] and refined with Tabu search [19], providing a value of modularity Q = 0.420. After the projection, see Fig. 6, we observe, nodes 1, 3 in the green module and 33, 34 in the blue module clearly distinguished by its value of R, denoting their important role in supporting the structure of both modules, however they are not the nodes that connect with other modules. It is also remarkable that node 10 lays half way of the modular directions of the larger modules assessing its unclassifiable nature (this node has been persistently misclassified by most of the community detection algorithms).
The proposed mapping is also applied to two other real networks, the worldwide air transportation network, and the AS-P2P Internet network. The airports network data set is composed of passenger flights operating in the time period November 1, 2000, to October 31, 2001 compiled by OAG Worldwide (Downers Grove, IL) and analyzed previously by Prof. Amaral's group [8]. It consists of 3618 nodes (airports) and 14142 links, we used the weighted network in our analysis. Airports corresponding to a metropolitan area have been collapsed into one node in the original database. The AS-P2P Internet data set considered is composed of autonomous systems (AS) [20] in the peer to peer (P2P) category, where two ASs freely exchange traffic between themselves and their customers, but do not exchange traffic from or to their providers or other peers [21]. We complemented this data set with the geographic localization of the ASs, resulting in 1217 nodes and 4058 links. We have optimized modularity [3] to find good partitions of the networks in modules. We have used the partition corresponding to 26 modules and modularity Q = 0.649 for the airports network, and 12 modules and Q = 0.387 for the AS-P2P network. Note that any partition, not necessarily the one corresponding to optimal modularity, can be analyzed as described.
The interesting aspect of applying the analysis to these two data sets is twofold: first, since both are geo-referenced, it is possible to assign a tag to each module corresponding to geographic areas, and second, the modular structure of both networks is substantially different, while the airports network evolution has been mainly shaped by two well defined continental blocks (USA and W Europe) §, the AS-P2P network has been built in a more homogeneous way. It is very interesting to observe how the AS-P2P network, following a sort of "wiring optimization", presents a community structure evenly distributed in areas covering a worldwide belt.
In Fig. 7a,b, we plot the structure of the networks partitioned in modules, these conform the original data that compose our contribution matrices. The geographical location has been added to the plot for visualization purposes but it has not been used in the analysis. The plots Fig. 7c,d (left) show the projections of the nodes' contributions following the same structure of the precedent plots. The differences between both modular structures has clearly emerged in this projection, the airports network is basically polarized in two geographical areas, whereas in the AS-P2P network this polarization does not exist. We also see how different airports and ASs excel in their values of R largely over the rest. This effect can be further developed by studying  Figure 7. Optimal map of the modular structure for the optimal partition of the airports network (a) and the AS-P2P network (b), each color corresponds to a different module of the given partition. In (c) and (d) we plot the projected space spanned by the two left singular vectors of the TSVD, U 2 (left), and its transformation to polar coordinates R-θ (right), for each network. Dashed lines mark the directions of intramodular projections of each module. Nodes whose contribution is totally internal to a module project exactly on its corresponding dashed line. In the R-θ plot we have labelled certain distinguished nodes that also correspond to very important airports and ASs in the world. For the airports network we have magnified the area over 10 −1 to identify the more important nodes in R. The loss of information associated to the two-dimensional projection is 18.2% for the airports network and 15.8% for the the structure of modules and their interrelations in each case.

a)
Airports b) AS-P2P   Figure 8. Box-and-whisker plots of R int and R ext respectively, for the two networks depicted in Fig. 7. Modules are sorted according to medians in increasing order. We label the horizontal axis using names for the modules assigned according to the geographical location of at least the 75% of their nodes. We highlight whiskers and outliers in both networks. Only those modules whose structure is significant (more than 10 nodes) are represented in the plot.
The structure of modules is scrutinized in Fig. 8, where we depict the box-andwhisker plots of the internal contributions R int and external contributions R ext . The results show the heterogeneity of each module of the partition. Remarkably, the method reveals outliers distinguished by their capability to support the internal structure of modules and also to cross-connect them. In Fig. 8a (top), we observe that USA and W Europe modules have medians greater than the percentiles-75 of the rest of modules. This fact is pointing out the extreme internal cohesion of both sites. We also observe that the lowest value in R int median corresponds to Alaska, however Anchorage leads the internal cohesion orders of magnitude beyond the core. In Fig. 8a (bottom) Canada, W Europe an C America provide the highest profile of boundary connectivity. Nevertheless, the role played by USA is still very significant because of its high percentiles and outliers. On the other side, Africa, Russia and China are less connected to the world than the rest of modules. For the AS-P2P the box-and-whisker plots in R int Fig. 8b (top) inform about a slight dominance of 3 modules E Europe, W Europe and the module containing USA and Japan. Here E Europe does not correspond to the political area but to a tag we use to represent a geographical area that is more oriental than the western, denoted as W Europe. In R ext Fig. 8b (bottom) the similarity in range and medians reveals the homogeneity of the mesoscale of this network. Significantly, some highlighted ASs in the plot do not belong geographically to the assigned tag, although the main proportion of nodes in that module do (see E Europe, W Europe and Russia).
Finally, we plot the interrelations between modules in Fig. 9 by computing the scalar product of their respective modular projections. The labels of the matrix are chosen in decreasing order of modular projection's angle θm α . For the airports network (Fig. 9a) we observe a clearly polarized structure in two main blocks, with a more diffuse central part overlapping both (corresponding to the communities mainly composed by nodes in Canada, Central America, Japan and South America). Japan is especially interesting for it maintains no preference in overlapping with any specific module in the network. In the AS-P2P network (Fig. 9b) we observe four groups, where neighbors in the analysis are in accordance with geographical neighbors. We remark that geographical information is not included in any part of the analysis, it simply emerges from the projection of the contribution matrix. The geographical correlation in the AS-P2P network could surprise given that communities of use in P2P networks are related to contents or topics, however many AS have to pay to other ASs to provide the connection between peers and then geopolitical constraints are revealed.

Conclusions
Summarizing, we have reformulated the analysis of the modular structure first, defining the object that contains all this information, and second we apply Singular Value Decomposition (SVD) on this object. Dimensional reduction follows in a natural way from the properties of the truncation of SVD, in particular we concentrate on the truncation of rank 2, with the idea of having a map of the modular structure amenable for   analysis to any scientist. The approach is very simple and can be understood using basic algebra notions. The computational implementation is also affordable given the multiple software packages that include an automatic SVD (R and Matlab among others). The result is a formalism to study the skeleton of networks at the modular level. The most important problem we have faced in the current research was the interpretation of the outcome in terms of the original data. We have made a breakthrough on this interpretation by focusing our attention in the particular resulting geometry of the projected contribution of nodes. We also present a statistical analysis of the resulting map using Box-and-Whisker plots based on percentiles, more appropriate than the use of z-scores that must assume a Gaussian distribution of values. Finally, we find the map of interrelations of the modular skeleton. The method proposed might be very useful for scholars in different disciplines that want access to an easy and tractable map of the empirical complex network data according to a biological, functional or topological partitions. We devise that the analysis of this map will be very helpful to anticipate the scope of dynamic emergent phenomena that depends on the structure and relations between modules. Spreading of viruses or synchronization processes are natural candidates to be analyzed considering the organization of the map. Moreover, we devise that the method can be used to graph bipartitioning by adaptively changing nodes between two modules while maximizing the angle in the R − θ plane between them. Further studies of the similarities between nodes' contribution projections can also help to classify networks according to the role profiles of nodes [22] and/or modules. second, C r is also the best approximation in the sense of statistics, it maintains the most significant information portion of the original matrix [12]. The left and right singular vectors (from matrices U and V respectively) capture invariant distributions of values of the contribution of nodes to the different modules. In particular the larger the singular value the more information represented by their corresponding left and right singular vectors. We have used the LAPACK-based implementation of SVD in MATLAB. We warn that some numerical implementations of SVD suffer from a sign indeterminacy, in particular the one provided by MATLAB is such that the first singular vectors from an all-positive matrix always have all-negative elements, whose sign obviously should be switched to positive [23].

Appendix B. Projection using TSVD of rank 2
In the case of a rank r = 2 approximation, the unicity of the two-ranked decomposition is ensured [9] if the ordered singular values σ i of the matrix Σ, satisfy σ 1 > σ 2 > σ 3 . This dimensional reduction is particularly interesting to depict results in a two-dimensional plot for visualization purposes. In the new space there are two different sets of singular vectors: the left singular vectors (columns of matrix U ), and the right singular vectors (rows of matrix V † ). Given that we truncate at r = 2, we fix our analysis on the two first columns of U , we call this the projection space U 2 . The coordinatesñ i of the projection of the contributions n i of node i are computed as follows: Here Σ 2 −1 denotes the pseudo-inverse of the diagonal rectangular matrix Σ 2 (singular values matrix truncated in 2 rows), simply obtained by inverting the values of the diagonal elements. It is possible to assess the loss of information of this projection compared to the initial data by computing the relative difference between the Frobenius norms: Appendix C. Geometrical properties of the projection of C The intramodular projectionẽ α corresponding to module α, is defined as the projection of the cartesian unit vector e α = (0, . . . , 0, 1, 0, . . . , 0) (the α-th component is 1, the rest are zero), i.e.ẽ α = Σ 2 −1 V † e α (C.1) Any node in the original contribution matrix can be represented as Original partition 1 node com1 » com2 2 nodes com1 » com2 8 nodes com1 » com2 Figure D1. Robustness of the method to noise in the partition. We show the separation from the intramodular directions of modules 1 to 4 (top to down) of all nodes, in particular we track the deviation of the nodes when some of them have been assigned to the incorrect module. The nodes that have been moved are those that deviate more from the intramodular projection of module 2.
Its projection gives the node contribution projectioñ a linear combination of intramodular projections. In particular, a node i whose contribution is totally internal to a module α is projected asñ i = k iẽα , where k i is the node degree. The modular projectionsm α are computed as the vector sum of all the projections of nodes contributions, for those nodes belonging to module α, i.e.
Appendix D. Effect of noise on C The method presented is pretty robust to perturbations in the partition or, equivalently, in the contribution matrix C. To support the claim we make the following experiment: using the benchmark network proposed by Newman and Girvan [1], see section 3, with 128 nodes, z in = 15 and z out = 1, we perform slight changes in the predefined partition, by moving nodes from module 1 to module 2. First we move only one node, then two nodes, and finally 8 nodes. This changes matrix C, which must in turn affect TSVD output. Fig. D1 contains the nodes' projection as the mentioned movements take place (squares, triangles and diamonds respectively). Consistently, module 1's nodes projections progressively decrease in R. Module 2 balances this fact, it retains the weight leaving from module 1. Sensitivity to inter-modular connections is also evidenced: when a single new node appears in module 2 (Fig. D1, squares), φ i has an outstanding value if compared to the rest; this is also evident when two nodes enter group 2 (Fig. D1, triangles). When moving 8 nodes, the effect is less drastic for the deviations in θ and more drastic in R. Unsurprisingly, modules 3 and 4 remain mostly unchanged, the interplay between modules 1 and 2 (nodes leaving from one group towards the other) does not drastically affect their internal characteristics, nor their importance in the whole structure.