Effective Augmentation of Complex Networks

Networks science plays an enormous role in many aspects of modern society from distributing electrical power across nations to spreading information and social networking amongst global populations. While modern networks constantly change in size, few studies have sought methods for the difficult task of optimising this growth. Here we study theoretical requirements for augmenting networks by adding source or sink nodes, without requiring additional driver-nodes to accommodate the change i.e., conserving structural controllability. Our “effective augmentation” algorithm takes advantage of clusters intrinsic to the network topology, and permits rapidly and efficient augmentation of a large number of nodes in one time-step. “Effective augmentation” is shown to work successfully on a wide range of model and real networks. The method has numerous applications (e.g. study of biological, social, power and technological networks) and potentially of significant practical and economic value.

Network expansion or augmentation is a ubiquitous feature in our rapidly growing technological society. It manifests in numerous and diverse scientific fields that include indexing and searching of the internet 1 , information dissemination on social networks 2,3 , financial and banking networks 4,5 , power distribution networks 6 , epidemic outbreaks in human populations 7 , and a vast range of computer applications from virus spread 8,9 to cyber warfare 10 . Expansion often occurs naturally, as in social networks, where new users may join and instantly create new connections. For other networks, increases in demand often naturally lead to growth. For example, higher electricity usage requires the connection of new power stations and loads to the existing power network to balance the extra demand. In biological contexts, where many exciting frontier network projects are now appearing, questions arise as for example, how can one design a "brain on a chip" by engineering neuronal network growth 11 . An immediate and important question, then, is how connections should be added from the existing network so that a certain optimal performance is achieved such as maintaining or enhancing overall controllability. Despite the widespread necessity to augment networks, and the intense scientific interest in network science, very little is known about how one might begin to expand a network subject to controllability constraints. Here we provide the first systematic study that sets out to solve some of these problems.
To address the question, a necessary stepping stone is to understand the controllability of complex networks [12][13][14][15][16][17][18][19] . According to control theory 20,21 , a linear time-invariant system that is controllable has the capability of being driven from any initial state to any desired final state within finite time. It is well known that controllability can be achieved by modifying the state of a small number of nodes in the network denoted as driver nodes 12,22,23 . Recent advances in applying control theory to complex networks show that the minimum number of drivers (N D ) necessary to yield control over the whole network can be simply determined by the graph theoretic technique of maximum matching 12,24 which is just the maximum set of edges that do not share start or end nodes 25 . A node is defined to be 'matched' if one matching edge points to it. N D equals to number of unmatched nodes. The set of unmatched nodes is referred to as the Minimum Driver node Set (MDS) 12,26 . Correspondingly, we denote the Minimum Terminal node Set (MTS) as the set of unmatched nodes in the maximum matching of its transpose network ( Supplementary Fig. S1).
We study node augmentation of arbitrary directed networks while insisting that N D remain unchanged. In the present work we analyse the addition of nodes having only a single edge so that they are either source nodes (having zero in-degree k in = 0) or sink nodes (having zero out-degree k out = 0). Figure 1a provides an example of a 57 node network that can be fully controlled with N D = 11. In Fig. 1b, the network in Fig. 1a is augmented by using our method and adding 10 new nodes (shown in triangles which includes 8 source nodes and 2 sink nodes) while N D = 11 remains unchanged. However, had these ten nodes been augmented randomly, it is highly unlikely that N D would remain constant, and conceivably N D could have increased by as much as ten.
It is always possible to connect a new source node to an existing driver node without changing N D . This is a basic result from structural control theory based on the stem/cactus structure of the network 12 . Thus adding source nodes to the current driver nodes is considered a trivial solution to the augmentation problem, and therefore one that we do not discuss in depth or make use of in this paper. Similarly adding a sink node to a terminal node should be considered as a trivial solution to the augmentation problem. Hence, the main goal of the present paper is encapsulated in the following questions: How can one add source nodes to an existing network other than to the current driver nodes (the trivial solution), while ensuring N D remains unchanged? Similarly how can one add sink nodes in a non-trivial way under the same constraint? And is it possible to find multiple solutions that can accomplish the same goal?
In general we are interested in determining many possible solutions to the augmentation problem. The main advantage of having multiple solutions to choose from is: with different applications, new nodes might need to be connected to quite different positions in the network in order to take into account spatial constraints. The position may need to be selected from a set of possible solutions based on the modified networks reliability, stability, vulnerability etc. For example, adding ten new nodes (substations, load, and buses) to a power network, in a way that minimises cost, maximises system reliability while ensuring N D is unchanged. Our methodology finds multiple combinations of nodes other than the trivial existing drivers to connect to. Naturally, we also acknowledge that there is always the option of connecting to the existing drivers (i.e., the trivial solution).
Network augmentation via the addition of edge connections rather than adding nodes has previously been examined from the controllability perspective [27][28][29] . In these studies, augmentation was implemented by only adding new edges, with the aim of decreasing N D and therefore enhancing controllability. However, the growth of networks involves not only increasing the number of edges but also the number of nodes. For example, in a supply chain network 30 , new storage warehouses (nodes) may need to be built in order to meet gradually increasing commodity demands. To our knowledge, currently there are no publications that address the augmentation of nodes to a network subject to controllability constraints, as is the goal of our work here.

Results
Node classification. To explore source augmentation, we examine the consequence of connecting a single source test node to any node x in an arbitrary directed network G(A), where ∈ × R A N N is the state matrix and N is total number of nodes. We define the node x to be: i) Source Invariant (SI) if N D does not change, and ii) Source Redundant (SR) if N D increases. Obviously all driver nodes in a network are SI. Thus, in the example of Fig. 2a, nodes 1, 2, 3 are SI. Adding a source node to any of these nodes leaves N D unchanged (Fig. 2b). Nodes 4 and 5 are SR (Fig. 2c).
Similarly, for sink augmentation, it is useful to examine what happens when a sink test node is added to any node x in the network (Fig. 2d). We define node x to be: i) Sink Invariant (KI) if N D remains unchanged (e.g., node 2, 3, 5 in Effective source augmentation. According to the definitions of node classification, it is always possible to augment a network by adding a single source node to an SI node, or a single sink node to a KI node without changing N D . However, the method is problematical when there is a need to augment more than one single node in parallel, since the classifications of many nodes in the network change every time a new node is added. In theory, nodes can be augmented serially one after the other but the node classification procedure would need to be performed for every node of the network for each step. When the size of a network is large, the method becomes extremely inefficient. However, there is a way around this problem. We find that SI nodes are not isolated but are "correlated" and may be grouped into distinct clusters such that connecting a new node to an SI node affects all others in the cluster in the same way. As such, it is possible to identify a set of clusters in the network, each cluster consisting of a set of correlated SI nodes. The network shown in Fig. 3a contains 10 SI nodes (red and blue) and 8 SR nodes (yellow). The SI nodes are further grouped into three clusters, {14, 1, 5, 7}, {12, 3, 15}, {4, 6} ( Fig. 3c shade areas). The unique features of these clusters are that: i) Each cluster contains only one driver node (blue). ii) The number of clusters is exactly the number of sources that can be added to the network in one time-step without increasing N D . If there are k clusters, it is possible to augment the network with k source nodes in parallel. iii) Once a source node is connected to any node in an SI cluster, all nodes in the cluster become SR and they can be ignored from this point in the augmentation process. iv) A newly connected source node replaces the driver node in a cluster, and thereby result in the network having a new MDS. Figure 3d shows that when source nodes (S 1 , S 2 , S 3 ) are connected to the three different clusters, all nodes in the clusters change to SR but N D = 4 remains unchanged, which verifies the properties of clusters. Since all nodes apart from the drivers have become source nodes, from this point no new augmentation is possible.

Identifying clusters. SI-network.
To locate all clusters, it is necessary to first simplify the network by forming the so-called 'SI-network' , which is easier to work with. The SI-network is obtained by first locating all SR nodes and then removing all of their incoming edges (i.e., setting = k 0

in SR
). This procedure fragments the original network. The connected network that remains is defined as the SI network, since it contains all SI-nodes, as shown in Fig. 3b. The following discussion concerning clusters only refers to the SI-network.
V-motif. The simplest and most elementary cluster of SI-nodes possible is referred to as a V-motif and is the basis for clusters of all types. Figure 4a, displays a typical V-motif in the SI-network for the nodes R 1 → {L 1 , L 2 }, where root-node R 1 points to two and only two leaf nodes L 1 and L 2 . The nodes {L 1 , L 2 } form an elementary SI cluster as long as any one of the nodes is a driver.
This two-node cluster has the fundamental property that its two constituent nodes L 1 and L 2 cannot both be drivers and thus can never participate in the same MDS. The property can be inferred from a simple maximum matching of the V-motif. (We see that if an independent source node 'S' is augmented to L 2 , then every node of the SI cluster becomes matched (Fig. 4c) and thus all nodes in the cluster {L 1 , L 2 } change to SR nodes (Fig. 4b). The new node 'S' become unmatched and replaces L 1 as the new driver node. Hence N D remains conserved.) Moreover for other isolated motifs of the form, say, R1 → {L 1 , L 2 , L 3 } (Fig. 4d) which is not a V-motif, maximum matching shows that the nodes L 1 , L 2 , L 3 lack the above fundamental property.
Building on the example of the V-motifs, it is possible to obtain three-node clusters and multiple-node clusters etc., each of which has the same fundamental property: there can only be one driver node within any cluster. Figure 4e gives an example of a three-node cluster (grey region) obtained from two intersecting V-motifs. Here root-node R 1 points to leaves L 1 and L 2 while root-node R 2 points to leaves L 2 and L 3 . The set of SI nodes L 1 , L 2 and L 3 form a cluster if any one of these nodes is a driver. Figure 4f gives an example of a four-node cluster obtained from three intersecting V-motifs. For networks with different types and varying degree distributions, V-motifs can become entangled in the SI network but may nevertheless be identified by these methods (Supplementary Fig.  S2 and Supplementary Note 2).  (Supplementary Fig. S3) and after augmentation, all nodes in the cluster become KR. Furthermore, we find that sink augmentation has no effect on the current MDS (Supplementary Note 3). Thus sources and sinks augmentation can be implemented in parallel without interference. These features make network augmentation very efficient and flexible. We have developed an algorithm to search all possible V-motifs in both SI and KI networks, and then determine all possible SI and KI clusters in linear time (see Methods).

Effective augmentation in synthetic networks.
We have illustrated the importance of clusters in network augmentation. As the minimum number of new source (or sink) nodes that can be augmented by effective augmentation method depends on the numbers of SI (or KI) clusters existing in the network, there is a need to explore the latter empirically. Denote the number of SI clusters in a network as N SIC and the number of KI clusters as N KIC . Base on their properties, we understand that SI clusters never contain source nodes and each cluster has exactly one non-source driver node. Thus for an arbitrary network G(A) the upper limit of N SIC must be N MAS = N D − S, where N MAS is the maximum number of augmentable source nodes, and S is the number of source nodes in A. Similarly, N KIC is limited with N MAK = N D − K, where K is the number of sink nodes in A.
We first study scale-free networks 31,32 (Supplementary Note 4) of different average degree but having identical input and output degree exponents (γ in = γ out = 3). Without loss of generality, we consider here networks with no disconnected nodes. For the purposes of exploring the cluster distributions over different network sizes, we normalise the parameters based on the network's total number of nodes, N, respectively, i.e., (n SIC , n KIC , n D ) = (N SIC , N KIC , N D )/N. Figure 5a demonstrates that n SIC (blue dots) and n KIC (red squares) follow the same trend: both decrease as the mean degree of the network increases. For networks with low average degree, e.g., 〈k〉 = 4 the minimum fraction of new nodes that can be augmented in parallel is approximately 9% of the total number of nodes (n SIC = 9%). For networks with high mean degree, augmentation is limited independent of the method. For example, in a fully connected network with N D = 1, only one new source node can be augmented and n SIC = 0 since no V-motifs exist.
To investigate the capability of the method in capturing a high proportion of augmenting solutions, the number of clusters N SIC (blue dots) and N MAS (yellow triangles; the upper limit of augmentation) are plotted as a function of average degree 〈k〉 in Supplementary Fig. S5a. At 〈k〉 = 4, approximate 65% (N SIC /N MAS ) of the total possible source nodes N MAS can be augmented simultaneously via our cluster based method in SF network. Again, Fig. S5a shows that the higher the mean degree of the network, the less the number of new nodes that can be augmented non-trivially. To explore n SIC and n KIC over different network types, we further investigate the parameters in Erdös-Rényi networks 33 (Fig. 5b, Fig. S5b, Supplementary Note 4). We find that ER networks are less augmentable (N SIC /N MAS ≈ 90% at 〈k〉 = 4) compared to SF networks ( 65% at same 〈k〉). This is due to the fact that in general ER networks require less driver nodes to achieve full control when compared to SF networks having identical network size and mean degree 12 . Intriguingly, we find that in ER networks, the decrease of n SIC and n KIC as a function of 〈k〉 are sharper compared with SF networks. When the average degree exceeds a critical value k c ≈ 8, the networks become almost non-augmentable. This is because the maximum number of source nodes that can be augmented, N MAS , reaches approximately zero at k c ≈ 8 (Supplementary Fig. S5b). On the other hand, N MAS approaches zero when k c ≈ 13 in SF-networks. Thus SF networks can be augmented within broader average degree regions than ER networks. Furthermore, we find the effective augmentation method can be applied more efficiently in ER networks as compared to SF networks. This can be observed clearly in Supplementary Fig. S5 which shows that the difference between N SIC and N MAS for ER networks is much smaller than their difference in SF networks (compare Fig. S5a,S5b).

Effective augmentation in real networks.
To demonstrate the feasibility of the augmentation algorithm, we apply the tools developed above to several real networks from a variety of different applications (Table 1,  Supplementary Tables S1 and S2). Figure 6a shows the fraction of clusters (both SI and KI clusters) determined from real networks versus mean degree. Overall, both n SIC and n KIC decrease as 〈k〉 increases which is in agreement with what was found from simulations of ER and scale-free networks. For example, networks as the cellular, electronic circuit and power grid with relatively low average degree have many more clusters and are thus much more augmentable than networks having large average degree, such as social networks.
It is also important to note that some networks exhibit different proportions of SI and KI clusters. We find that electronic networks can be readily augmented with sink nodes (n SIC ≈ 20%) but are limited with source nodes (n KIC ≈ 3%). To further investigate this property, we plot n SIC versus n KIC in Fig. 6b and n SIC , n KIC versus n D in Supplementary Fig. S6. We find that some networks, such as cellular, food web, power grid, social networks have approximately the same fraction of SI and KI clusters, while other networks, like electronic circuit (more augmentable with sink nodes than source nodes) and transcription networks (Yeast networks are more augmentable with source nodes than sink nodes) are away from the diagonal line. This may be caused by different in-and out-degree distributions and degree asymmetries 18,34,35 . Furthermore, large numbers of SI clusters and KI clusters are found in cellular networks, power grid and cortical networks indicating that these sorts of networks can be augmented efficiently with the clustering method used in our effective augmentation. Finally, we also observed that networks from the same categories have similar n SIC and n KIC , which means they have similar augmentation properties.

Discussion
The cluster method proposed here allows us to generate a relatively large number of workable solutions in one time-step. For our method, the number of ways of adding m new nodes to the network non-trivially with n clusters (assuming each cluster has k nodes) is − k C ( 1) n m where m ≤ n, which is usually quite considerable. Furthermore, the solutions allows connecting multiple source and sink nodes in parallel simultaneously without conflict, which increases the effectiveness of the method. It might seem a limitation that the newly connected source and sink nodes are required to be of degree one (i.e., have one edge). However, by taking advantage of the clusters present, it is a straightforward extension of the method to augment new nodes having higher degree. Each new edge can be attached to a different cluster. ő -é Our approach is based on the presence of V-motifs in both the SI and KI network. When the method is applied to random networks with varying average degree, we find that the ability to augment a network is inversely proportional to its average degree. This is attributed to the fact that the network clusters reduce in number as network connectivity and average degree increase. Networks having large average degree tend to have few clusters, and thus are hard to be augmented in practice, as was found for social networks. For real networks, networks from the same categories tend to have similar fractions of clusters. In summary, we have formulated a successful and efficient procedures for augmenting arbitrary directed networks while keeping the minimum number of driver nodes required to fully control the network.

Methods
Identifying minimum driver node set and minimum terminal set. The MDS and MTS of a directed network G(V, E) can be identified by the following steps. 1) Bipartite representation ( Supplementary Fig. S1): divide node set V into two disjoint node sets + and − , such that any directed edge (V i , V j ) ∈ E can be represented as (  Identifying SI and KI clusters. Nodes in a directed network G(V, E) should be classified into two sets: SI and SR. To expedite finding SI clusters it is helpful to first form the corresponding SI network defined as the remaining connected sub-graphs found after removing all incoming edges to all SR nodes. The identification of SI clusters is then achieved through a sequential process of searching for V-motifs. The search begins with low out-degree nodes and ends with nodes having highest out-degree. The process makes use of cluster merging. Two clusters (sets of nodes) can be merged into one if: i) the two clusters appear as two distinct leaves of a V-motif; ii) the two clusters contain one or more common nodes. The detailed procedure is as follows: (1) Search the SI network for all nodes having out-degree equal two (k out = 2). If there are no nodes with k out = 2 in the SI network there will be no clusters. Each identified node is the root node of a prototypical V-motif, and the two nodes connected to the root are leaves. Being part of the SI network, they must both be SI. From here on, treat both of these SI nodes as a merged cluster. Multiple clusters are merged if they meet the above merging conditions i) or ii). For example, from the V-motifs in Fig. 3c, the sets of SI nodes {14, 1}, {1, 5} and {5, 7} are considered as three clusters (merge condition i). The clusters can be further merged together as {1, 5, 7, 14} (merge condition ii). (2) Search for all nodes with out-degree equal to three. First, each of these root nodes has three leaves needs to be merged into clusters where possible. A V-motif is formed if each root node points to two and only two distinct clusters, e.g., In Supplementary Fig. S2a, R 2 → {L 1 , L 2 , L 3 } is a V-motif, since L 1 and L 2 belongs to the same cluster. Finally, all clusters identified should be merged where possible before move to the next step. (3) Repeat procedure (2) iteratively on nodes after increasing k out by one each step. The process terminates when either no V-motifs can be found, or in the worst case terminated on nodes with highest possible k out value. Finally, the desired SI clusters are those clusters which contain more than one nodes, one of which must be a driver node. Distinct SI clusters do not share common nodes. (4) Due to the complexity of the network, the first implementation of the above procedure will not identify all possible SI clusters in the network. However, the remaining SI clusters can by identified by eliminating the existing SI clusters from the SI network (delete all nodes in the SI clusters along with their edges) and repeat procedures (1)-(3) until no further V-motifs can be found. This will eventually reveal all possible SI clusters.
The identification of KI clusters follows the same procedures except that it is necessary to begin with the KI network which is determined by removing incoming edges to KR nodes. A standard V-motif in the KI network is defined as one leaf node pointed to by two KI root nodes ( Supplementary Fig. S4). Similarly, KI cluster identification follows the same procedures except searching begins with low in-degree (k in = 2) nodes. The conditions of two clusters merge together in sources augmentation holds for sinks augmentation as well. SI (or KI) clusters determination is based on the degree of individual nodes in the SI (or KI) network. Thus, the complexity is proportional to number of nodes and edges in the SI (or KI) network. Therefore, the complexity of the algorithm is linearly proportional to N and L in the network.