Symmetry-driven network reconstruction through pseudobalanced coloring optimization

Symmetries found through automorphisms or graph fibrations provide important insights in network analysis. Symmetries identify clusters of robust synchronization in the network which improves the understanding of the functionality of complex biological systems. Network symmetries can be determined by finding a {\it balanced coloring} of the graph, which is a node partition in which each cluster of nodes receives the same information (color) from the rest of the graph. In recent work we saw that biological networks such as gene regulatory networks, metabolic networks and neural networks in organisms ranging from bacteria to yeast and humans are rich in fibration symmetries related to the graph balanced coloring. Networks based on real systems, however, are built on experimental data which are inherently incomplete, due to missing links, collection errors, and natural variations within specimens of the same biological species. Therefore, it is fair to assume that some of the existing symmetries were not detected in our analysis. For that reason, a method to find pseudosymmetries and repair networks based on those symmetries is important when analyzing real world networks. In this paper we introduce the {\it pseudobalanced coloring} \eqref{eq:mainip} problem, and provide an integer programming formulation which (a) calculates a pseudobalanced coloring of the graph taking into account the missing data, and (b) optimally repairs the graph with the minimal number of added/removed edges to maximize the symmetry of the graph. We apply our formulation to the {\it C. elegans} connectome to find pseudocoloring and the optimal graph repair. Our solution compares well with a manually curated ground-truth {\it C. elegans} graph as well as solutions generated by other methods of missing link prediction.

symmetries in these networks. Existence of these synchronous solutions can considerably augment our understanding of the functionality of the modeled system [14,15].
Symmetry and synchronization are important concepts in the field of physics. Symmetry lays in the foundation of the standard model and has broad applications in other parts of physics from Lagrangian mechanics to crystallography and nuclear spectroscopy. One of the fundamental problems in chaos theory is the search for synchronous solutions in dynamical systems [25]. In the context of biological networks, in particular in neural networks, complete synchronization is often studied by considering the dynamics of networks of coupled oscillators using the Kuromoto model [26,27]. Methods of statistical mechanics have broad applications in the studies of networks of coupled oscillators [27][28][29]. For example, the transition of a network of coupled oscillators to the synchronous state, called the synchronization transition, can be thought of as a phase transition (bifurcation) in the ensemble of nodes of the network of oscillators [29], which then permits using the tools of statistical mechanics developed for study of phase transitions.
Despite these advances, modeling biological systems as networks with symmetries pose additional challenges. Firstly, experimental data on networks collected by different techniques are never complete due to experimental errors and the impossibility to measure every possible link. The missing links include protein-protein interactions, binding between transcription factors and DNA, neural synaptic connections in the brain or even metabolic reactions. Secondly, natural variability across individuals of the same species results in different networks making interpretation of the data across specimens more difficult [5]. Thirdly, organisms adjust to the changing habitat using neural plasticity, epigenetics and other adaptations. For example, even two organisms that were identical at a given moment of time may become different in the future. Organisms survive and benefit from these small variations, so evolution through natural selection occurs. Therefore, networks that model these systems exhibit not only differences across individuals, but also have missing experimental links that can mask their underlying regularities. In particular, the missing experimental interactions and natural variations make the existence of perfect symmetries difficult to find in biological networks. Thus, symmetries in biological networks have been very hard to find since the time of Monod's first formulation of the problem [30]. The difficulty persisted despite the ubiquitous existence of synchronization across all biological networks, which is a manifestation of symmetries in the underlying networks supporting the biological activity the probability of the existence of a link between a pair of nodes in the network that is missing from the data [32][33][34][35][36]. Previous work proposed several approaches to find missing links based on different statistical and Markov chain models, see review in [32]. However functional and dynamical features of the incomplete networks, such as synchronizability, are rarely taken into consideration. The approach presented here employs network symmetry as an organizing principle of the network structure to guide the link prediction and reconstruct the network from its incomplete state.
Traditionally, network symmetry is defined by using graph automorphisms or symmetry permutations. An automorphism is a permutation of nodes that preserves the adjacency structure of the graph [37,38]. That is, before and after the permutation the nodes are connected to the same nodes. The set of all automorphisms of a graph form the symmetry group of the graph. Automorphisms are, however, sensitive to small perturbations in the graph, i.e., the removal of one link can drastically change the automorphism groups [11,14,39]. For an explicit example, consider the two graphs in Fig. 1 and their corresponding automorphisms presented on the right side. Graphs in figures 1a and 1b differ only by link (1,3). Missing this link has a global effect on the symmetry of the network decreasing the size of the automorphism group from five elements (σ 1 , . . . , σ 5 ) to only two (σ 1 and σ 2 ).
Therefore, the symmetry of the graph can be substantially reduced by missing just a few links in the topology of a graph. Inversely, the graph in Fig. 1b can be restored to be a perfectly symmetric graph in Fig. 1a by adding just one link.
The automorphisms of the graphs are important since they lead to robust cluster synchronization of the nodes activity. The symmetry group of the network leads to synchronization of the nodes associated by the orbits of the graph. An orbit of a node is the set of nodes that are obtained by the application of all the automorphisms of the graph. The orbits are non-overlapping clusters and each orbit can sustain an independent cluster synchronization, although its stability is not guaranteed.
The pseudosymmetry formulation proposed in [11] attempts to reconstruct the graph based on the automorphism group of the network. However, automorphisms cannot predict all the possible symmetries and cluster synchronization present in the network. Instead, balanced coloring [17] and associated symmetry fibrations [14,15] provide a more general type of symmetry. Since all orbits of the graph are also balanced colored, but not the opposite [40], a balanced coloring-based formulation captures more cluster synchronizations than of links to every other color, but graph (a) is much 'more symmetric' than (b) and requires a fewer non-trivial colors. On the contrary, missing a single link makes the graph in (b) less symmetric as seen from the abundance of trivial colors (colors assigned to just one node) needed to color the graph. The purpose of this work is to develop a formulation that would identify this missing link that makes the graph more symmetric, in a precise way.
those predicted by automorphisms. In this paper, we derive the symmetry reconstruction of the graph by finding the ideal balanced coloring of the graph rather than the ideal automorphisms as done in [11]. Thus, we will reconstruct the network based on pseudobalanced coloring. Reconstructing the network has two objectives: (1) determine the "best" number of clusters, i.e., determine K, and (2) perform the minimum number of perturbations to perform on the graph. In general, we use perturbation to refer to any change to a link weight with the convention that reducing a link weight to zero removes the link altogether.
For this paper, we focus on the case of unweighted links and only adding links to the graph.
In particular, we are interested in the following optimization problem: find the minimum number of new links to add to a graph so that it possesses a balanced K-coloring. After solving this problem for the allowable values of K, we use graph functions to determine the "best" choice of K.
A balanced coloring with the minimal number of colors gives rise to a symmetry fibration, which is a transformation that collapses all the colors of the graph into a representative node of the base. Thus, the present formulation based on pseudobalanced coloring is related to finding a quasi-fibrations of the graphs [41]. Ref. [41] deals with this analogous graph reconstruction formulation based on quasi-fibrations.
For smaller networks, adding links manually to find symmetry is possible [11], but such a method is computationally impractical for larger networks. Moreover, such ad hoc methods do not have an optimality guarantees. Thus, an important endeavor is to develop models and algorithms that can find symmetry in the presence of small perturbations in the underlying input data. In this paper, we describe an integer programming formulation that finds the minimum number of edges to add to a given network so it has a balanced coloring [16] of a given size (balanced colorings are also referred to as equitable partitions [12]). Our complexity results in Section IV show that the directed variant of our problem is weakly NP-Hard and we conjecture the same is true of the undirected version so an efficient algorithm is unlikely to exist for our problems of interest. Therefore, integer programming is a natural method to use to solve our problem. We are able to solve the integer program for modestly sized instances (20-30 nodes and 100-120 edges) of interest using Gurobi 9.11 [42]. Most instances required seconds to solve although there were cases where hours were needed. In preliminary tests, we found that solving larger instances with approximately 280 nodes and 3000 edges required minutes to hours to solve. We emphasize that we do not draw conclusions from these results as we did not conduct a full computational study. For future work, we also describe an algorithm based on a Bender's decomposition of the integer program in Appendix XI B to help improve the computational efficiency of solving the integer program for larger scale instances.
The paper is organized as follows. In Section II, we review the previous pseudosymmetry formulation [11] and provide intuition and motivation for our method. In Section III, we give formal definitions for the problems of interest. In Section IV, we prove that a simplified variant of our problem is weakly NP-Hard. In Section V, we describe the integer linear programming formulation of our problem. In Section VI, we present results from our integer linear programming approach applied to experimental data on the C. elegans connectome.
In Section VII, we show how a repaired graph can be partitioned into co-functioning clusters.
In Section VIII, we compare results obtained by using different objectives with some of the traditional link prediction methods. In Section IX, we discuss implications from our results and outline future work. Appendix XI B provides a Bender's decomposition and algorithm for our integer linear program.

II. PREVIOUS WORK ON PSEUDOSYMMETRY
Consider for the moment an undirected graph G = (V, E) and let π denote a permutation of the node labels, i.e. π : V → V . Then π is an automorphism if, and only if, for all u, v ∈ V , if uv ∈ E then π(u)π(v) ∈ E. The automorphism group, denoted Aut(G), is the set of all automorphisms of G along with function composition as the multiplication.
A useful characterization of automorphisms involves permutation matrices. Let A be a node-node adjacency matrix of G and P denote the permutation matrix associated with a permutation π. Then π ∈ Aut(G) if and only if where 0 denotes the matrix of all zeros. Let P denote the set of |V | × |V | matrices. If the permutation matrix P ∈ P satisfies Eq. (1), we say that P is a perfect symmetry. The trivial symmetry is the identity matrix and is shared by all graphs.
In order to study graphs that slightly vary from those with perfect symmetry, Morone and Makse [11] suggest looking for pseudosymmetries defined by a small parameter ε ∈ R: where ||X|| is the Frobenius norm of the matrix X = (x ij ), i.e., ||X|| = n,m i=1,j=1 x 2 ij . For ε = 0 this definition is the same as perfect symmetry. Note, the set of permutations associated with P ε doesn't necessarily form a group. The naive way to find elements of P ε is to consider all possible permutations and determine whether Eq. (2) is satisfied. Because |P| = (|V |)! this naive approach is computationally intractable and finding a more efficient method is desirable. Moreover, for a given permutation matrix that satisfies Eq. (2), we would like a method that finds a related graph that is symmetric. Thus, we want an alternative approach that also finds a way to "repair" G, i.e., minimally perturbs G and results in a symmetric permutation matrix.
Because the Frobenius norm is unitarily invariant i.e. X = U XV for all matrices X and unitary matrices U and V , We focus on the case of adding edges to an undirected graph G and show that, in this case, ε is bound by four times the number of added edges. Suppose that we have a graph, G, with added edges so that P ε is a symmetry and let the adjacency matrix of this graph (3), and the triangle inequality imply that: The difference matrix (A ε − A) is the adjacency matrix corresponding to the graph that just has the repaired edges. Therefore, [P ε , A] is less than the number of repaired edges multiplied by 4 (2 because the graph is undirected and A is symmetric and another 2 because of the coefficient in front of the norm). Hence, a reasonable approach to repair G is to add the minimum number of edges required for the graph to have a symmetry.
To summarize this approach, finding pseudosymmetries in a graph using brute force is computationally intractable and not guaranteed to find an optimally repaired graph with minimal modifications. Instead, in this paper, we formulate the problem in terms of the balanced coloring problem. It is known that balanced coloring is a pre-processing step in finding the automorphisms of the network as it is used in the popular McKay's algorithm Nauty [12]. This is because all orbits are balanced colored (but not the opposite). Thus, finding first a minimal balanced coloring of the graph (which can be found in quasilinear time) leads to the orbits of the graphs, which can then be used to find the generators of the automorphism group. In Section III, we make this idea more precise.
Based on these ideas, rather than searching for pseudo automorphisms like in [11], here we search for pseudobalanced coloring [14,15], which are more general symmetries than automorphisms. We use an integer linear program that adds the minimum number of edges to a graph so that we make the graph "more symmetric" in terms of balanced coloring.

III. DEFINITIONS AND PROBLEMS
In this section, we describe both a general version of the pseudobalanced K-coloring problem and the specific case that we are interested in. We believe the general version contains many interesting variants that pose important challenges to be solved as the graph can be undirected or directed and weighted or unweighted. In addition, the perturbations on the graph in the most general version are permitted to have very general restrictions or lack thereof. The specific version of our problem is on an undirected, unweighted graph and the only perturbations allowed are the addition of edges.
Let G = (V, E) be a given graph. We let A ∈ R |V |×|V | denote the weighted nodenode adjacency matrix associated with G. We also define the directed complement of E to be E C = {(i, j) ∈ V × V : ij ∈ E, i = j} and the undirected complement of E to be E = {ij : i, j ∈ V, ij ∈ E}. We require both E C and E for our formulation when the graph is undirected. When the graph is directed, E C = E . To help emphasize the graph type, we use the convention that consecutive node indices, e.g., ij, represent an undirected edge.
An ordered pair of node indices, e.g., (i, j), represent a directed edge. We also recall that a partition of a given set S, is a collection of pairwise disjoint subsets of S whose union is all of S, i.e., if C is a partition of S then C∈C C = S and, for all C, D ∈ C, C ∩ D = ∅.
Balanced coloring has been defined by several authors, e.g., see [12,16,43,44] for a definition corresponding to the one we use where it is sometimes referred to as an equitable partition. A stricter definition of balanced coloring is given and used in [14,18] although their definition is the same as ours for unweighted, i.e., binary graphs. There are different definitions including ones that are unrelated and more similar to traditional graph coloring, e.g., [45].
In our definition, we fix K, the number of colors, i.e., clusters, as the objectives of our eventual problem are to both determine the minimum number of links to add to a graph so that a balanced coloring exists, but also to determine the "best" number of colors, i.e., the best K.
Definition III.1. Balanced K-coloring. A balanced K-coloring of G is a partition, C, of the node set V which satisfies the following: • The cardinality of C is K.
• In the case that G is an undirected graph, the condition is as follows. For all C ∈ C, all pairs of distinct nodes p, q ∈ C, and all D ∈ C, j∈D:pj∈E • In the case that G is a directed graph, Eq. (6) corresponds to two separate conditions resulting in three kinds of balanced coloring. The two conditions are as follows. For all C ∈ C, all pairs of distinct nodes p, q ∈ C, and all D ∈ C, j∈D:(p,j)∈E and j∈D:(j,p)∈E A directed out-balanced coloring is if only Eq. We also define the minimal balanced coloring which, intuitively, corresponds to the minimum number of colors K to color the graph so that no nodes with different colors are balanced. Because the number of colors, K, to color a graph is unique [46], we omit K in this definition.
Definition III.2. Minimal balanced coloring. A minimal balanced coloring is a balanced K-coloring, C, where the following is also true. For every p ∈ V , let C p ∈ C be the unique set that contains p. For every distinct p, q ∈ V such that q ∈ C p there exists D ∈ C such that Eq. For example, in the undirected case, the following must be true for some set D ∈ C. j∈D:pj∈E In other words, in an in-balanced coloring partition, two nodes with the same color 'receive' the same colors from connected nodes. A minimal balanced coloring is a balanced coloring with the minimal number of colors. We call a color trivial if there is only one node that belongs to this color. A non-trivial color is a color that is not trivial. A coloring in which each color is trivial (corresponding to the discrete partition) is called discrete. We refer to a node as trivial if it possesses a trivial color and call a node symmetric if its color is non-trivial. Abusing this definition, we also say a node is symmetric to another if they are both the same color.
Relation to graph fibrations. In the framework of graph fibrations [14], a cluster of nodes with the same balanced color is equivalent to a fiber (see Methods in ref. [24]). Nodes in a fiber have isomorphic input trees. When an admissible set of dynamical equations is attached to the graph, nodes of the same balanced color (or fiber) are predicted to be synchronous (cluster synchronization). A symmetry fibration is a transformation that collapses nodes in a minimal balanced color cluster into a single representative node in the base of the graph. We note that, while nodes belonging to the same orbit are also synchronous, nodes of the same balanced color do not necessarily belong to the same orbit of the automorphism group. However, all nodes in each orbit do have the same balanced color [40,47]. Thus, balanced coloring, fibers and symmetry fibrations represent more general symmetries than automorphisms, and reveal cluster synchronization in the graph that is not captured by orbital partitions, see [14] for more details.
Relation to graph automorphisms. The minimal balanced coloring and automorphisms are fundamentally related. In this case, the number of colors found is a lower bound on the number of automorphism orbits. Moreover, the nodes in each nontrivial color found corresponds to potential automorphisms of the graph.
To explain the connection between coloring and automorphisms further, we describe how graph automorphisms and isomorphisms are found. (Note that the two problems are equivalent [48]) The graph isomorphism and automorphism algorithms we are aware of [43,[49][50][51][52] all use a search tree to find isomorphisms/automorphisms. For a given graph G, each node of this search tree represents a balanced coloring of G. The root of the tree corresponds to the minimal balanced coloring. Each child in the search tree is found using the individualization-refinement process. In this process a child's coloring is obtained by refining its parent's coloring by choosing a node of a non-trivial color (in a parent), assigning it with a unique color and obtaining a new balanced coloring of the graph while keeping this node individualized (having a unique (trivial) color). This individualization-refinement process is repeated until the coloring is discrete, hence all leaves of the search tree correspond to the discrete coloring. In the last step, strings corresponding to all leaves are constructed by combining the rows of the adjacency matrix in the order defined by the coloring of each leaf. Due to the fact that the search tree is isomorphism-invariant, whenever strings corresponding to two leaves are the same, colored graphs in these two leaves are isomorphic and hence a color-preserving mapping between these two graphs is an automorphism of G.
The size of graph's search tree has a deep connection with the number of non-trivial colors in the minimal balanced coloring of the graph. Each non-trivial color requires refinement resulting in more children nodes and thereby increasing the search tree size. Therefore, repairing the graph to a version with similar topology but fewer nodes trivially colored so that more nodes have non-trivial colors than in the original graph. Repairing graphs in this way creates a "more symmetric" version of the graph, that is, a version of the graph that has (or at least may have) more elements in the automorphism group. Any graph could be be repaired to a complete graph, i.e., a graph with a link between every pair of nodes. The automorphism group of the complete graph is a symmetric group, but naturally is unlikely to be similar topologically to the original graph. In particular, we wish to use the minimum number of repairs necessary to find a more symmetric version of the graph so that the essential network topology of the original graph is maintained as much as possible. Thus, the repairing process must balance the trade-off between making the graph more symmetric (by increasing the number of non-trivial colors) versus making minimal changes to the graph topology.
Relation to stochastic block models. It is worth mentioning a parallel to stochastic block models (SBMs) introduced in the context of social networks by Holland et al. [54]. SBM is a graph generative model in which the graph is partitioned into groups and edges between nodes are placed with the probability defined by the clusters these nodes belong to. In other words, SBMs are the stochastic version of a graph partition (different from the equitable partition), where the analogous constraints of Eq. 6, 7 and 8 hold only in expectation. We refer readers to ref. [55] (section IIA and figure 4) for a more rigorous discussion of the connection between SBMS and equitable partitions (i.e., balanced coloring, fibers).
We now define the general version of this problem. In this version, existing edge weights are allowed to be perturbed, i.e., changed, and non-existent edges are allowed to be introduced with weights restricted in some manner (or not). We still wish for a minimum amount of perturbations so that the balanced conditions are enforced. We first formally define the set of generalized pseudobalanced colorings of a graph and then the corresponding optimization problem which occurs over this set. We also add the option that the colorings considered must adhere to some prior knowledge about the given graph. In particular, we allow that some nodes' colors are already known. Such an option corresponds to prior expert knowledge about the graph structure and our methods permit this restriction.
Definition III.3. Generalized pseudobalanced K-coloring. Let G = (V, E) denote a graph with edge weight matrix A ∈ R |V |×|V | and R ⊆ R |E|+|E | given. Let F denote a collection of disjoint subsets of V . A generalized pseudobalanced coloring of G is an ordered triple (C, Φ, Ω) with the following true.
• C, the coloring, is a partition of the node set V with |C| = K, i.e, C satisfies Eq. (5).
• If F = ∅ then for all pair of distinct sets C, D ∈ F, there must exist a pair of distinct sets S, T ∈ C such that C ⊆ S and D ⊆ T . If C satisfies this condition, then we say that C respects F.
is a vector of perturbations on the edge weights.
• Ω = (ω ij ) ∈ R |E | is a vector of new weights that do not exist in G.
• The perturbation vectors Φ and Ω are related and restricted through the set R, i.e., • Appropriate balanced equations are satisfied, i.e., Eq. (6), Eq. (8), and/or Eq. (7) are satisfied by (C, Φ, Ω) depending on whether G is undirected or directed. If undirected, then for all C ∈ C, all pairs of distinct nodes p, q ∈ C and all D ∈ C, j∈D:pj∈E If G is directed, then one or both of the following two conditions replaces Eq. (11).
Recall that E and E are ordered pairs of nodes in what follows. For all C ∈ C, all pairs of distinct nodes p, q ∈ C and all D ∈ C, For a given positive integer K ∈ {1, . . . , |V |}, we let G G,K,F ,R denote the set of all generalized pseudobalanced K-colorings on G with fixed nodes F, i.e., |C| = K, C respects the subsets of F, and edge and non-edge perturbations restricted to R. In particular, (10), and ( * ) are satisfied.} Here, ( * ) represents the appropriate choice(s) of (11), (12), and/or (13).
Definition III.4. The Optimal Repair Problem (ORP). The optimization problem of interest is then find the minimum amount of perturbations to the edges (addition/removal/weight changes) so that a balanced K-coloring exists on the perturbed graph: There are many interesting cases of ORP, Eq. (14), involving the particular restrictions to R, e.g., R are box constrained by an appropriate small constant or R is an elliptical set. In this paper, we focus on the following restricted case: We consider undirected and unweighted graphs where links are permitted to be added but not removed from the graph. We note, however, that our methods can be extended to directed graphs and also weighted edges.
We call a restricted pseudobalanced K-coloring of a given graph as the case of the generalized pseudobalanced K-coloring where the graph is unweighted, link perturbations and removals are not allowed, and only new unweighted links are allowed to be added. We also allow for some of the nodes to have fixed colors in advance, e.g., by running a balanced coloring algorithm in advance of repairs or leveraging expert knowledge. Formally, we define this as follows and we call it pseudobalanced K-coloring for short: Definition III.5. Pseudobalanced K-coloring (PBC). Let G = (V, E) be a given undirected graph with node-node adjacency matrix A ∈ {0, 1} |V |×|V | , and K ∈ {1, . . . , |V |} and F to be a given collection of disjoint subsets of V . Denote the set of pseudobalanced K-colorings As we show in the subsequent section, solving the in-balanced ORP on directed graphs is unlikely to be efficiently solvable. We conjecture the same is true of the (PBC).

IV. PROBLEM COMPLEXITY
The first step of many algorithms for graph isomorphism is to find the minimal balanced coloring, which is polynomial time solvable [43,[49][50][51][52]. In our paper, we use the algorithm of Kamei and Cock [18]. The complexity of the variants we have described of the balanced coloring problem are, to the best of our knowledge, unknown. In this section, we prove that finding an in-balanced K-coloring on a weighted graph is weakly NP-Hard when F = ∅. We note that the four conditions that the proof relies on are (1) the restriction that exactly K colors must be used, (2) the allowance of weighted edges, (3) that the graph is directed, and (4) the inclusion of fixed nodes, i.e., F = ∅. As a corollary, the in-balanced ORP on directed weighted graphs is also weakly NP-Hard.
Our proof reduces the weighted, directed in-balanced K-coloring problem to the partition problem. In this problem, a list of positive integers is given and the decision problem is to determine if there is a way to partition the list into 2 sets elements so that the sums of each set is equal. Formally, this problem can be stated as follows.
Definition IV.1. Let n ∈ Z + , and a set of positive integers, denoted X = {x 1 , . . . , x n }, be given whose sum is even. The decision problem is to determine if there is a partition of X into 2 sets, denoted by S and T so that The partition problem is weakly NP-Hard [48] and we can show that being able to solve the weighted directed in-balanced K-coloring problem solves the partition problem.
Theorem IV.2. Finding a weighted directed in-balanced K-coloring with fixed nodes is weakly NP-Hard.
Proof. Let n and X = {x 1 , . . . , x n be a given instance of partition. The graph we construct consists of K −3 nodes in a cycle and two in-stars. The first in-star has 3 nodes and the other has n + 1 nodes. Let ω be equal to the sum of the weights in W . The weights on the first instar will all be identically ω/2 and the weights on the second in-star will be x 1 , . . . , x n . The weights on the edges in the cycle with K −3 nodes are ω, 2ω, . . . , Note that this forces any in-balanced coloring to have ω/2 weight going from nodes in a partition with u 1 to v 0 as well as the same ω/2 weight going from nodes in a partition with u 2 to v 0 . Note that if K = 3, then both C and E C are empty sets. If not then, then any balanced coloring must assign the nodes of C distinct colors from all others, i.e., the nodes of C are all isolated and use K − 3 colors. Thus, the nodes v 1 , . . . , v n must be in a partition either with u 1 or u 2 as there are edges that enter v 1 , . . . , v n .
Thus, a partition exists if and only if there is a balanced coloring or corresponds exactly to the color v 1 , . . . , v n are assigned.
We have an immediate Corollary.
Corollary IV.3. The in-balanced ORP is weakly NP-Hard.
Proof. Note that if any instance of ORP is solved with a zero optimal objective, then the original graph and has a balanced K-coloring.
We are unaware of any other complexity results involving variants of balanced coloring/equitable partition.
Interesting conjectures and open questions include the following.
• We conjecture that finding an undirected weighted balanced coloring with fixed nodes is NP-Hard.
• Famously, the complexity of graph isomorphism is unknown although many variants are efficiently solvable, including when the input graph has bounded degree [44]. Given that finding a minimal balanced coloring is a preprocessing step for the graph isomorphism algorithms [12,[49][50][51][52], is there a connection between Theorem IV.2 and graph isomorphism?

V. LINEAR INTEGER PROGRAM FORMULATION
In this section, we formulate Eq. (15) as an integer linear program. We also note that our formulation extends to the directed case as well as to the addition of positive or negative weights. Modifying the formulation to account for edge perturbations is possible but would require the addition of a new family of variables. The latter changes could require the formulation to become a mixed-integer linear program although this would depend on whether the perturbation restriction was to a discrete set or not.
We consider a problem of form Eq. (15), i.e., we are given an undirected graph G = (V, E) and a positive integer K which denotes the number of colors we require for our pseudobalanced coloring. We let M = |V | − 1.
We use three sets of decision variables indexed over V , E, and K := {1, . . . , K}. For every p, q ∈ V , k ∈ K, and (i, j) ∈ E C , let 1 if nodes p and q are the same color 0 otherwise 1 if the pseudoedge of (i, j) exists and j is color k 0 otherwise Note that z ijk is an indicator of a pseudoedge that also models the color of the node j. The natural objective function is to minimize the sum of pseudoedges. However, there are edges that we wish to incentify more than others, so we minimize the weighted sum of pseudoedges.
Setting c ij = 1 minimizes the sum of pseudoedges. For the graphs we solved, we found setting where d i is the degree of node i, for (i, j) ∈ E C produced the best results. We discuss this objective in more detail below.
We must ensure that every color is used at least once.
We must ensure that every node is assigned a color.
We must ensure that a color can not be assigned to two distinct nodes unless they are indeed the same color.
Note that Eq. (21) only prevents the same color from being assigned to two nodes that are different colors. Nodes p and q with x pq = 1 could be assigned different colors. Thus, to prevent this, we also need the following constraints.
We must enforce that pseudoweights are only assigned if the color is permitted.
We must ensure that the edge and pseudoedge weights from any pair of nodes that are the same color agree to every other color.
Note that this inequality is not symmetric and must be posed twice for every pair of nodes.
We must ensure that the sum of pseudoweights of the forward and backwards edges agree.
In order to improve the formulation, we add the following antisymmetry constraints [56].
This imposes an arbitrary ascending order on the size of the color sets.
The complete formulation for the instances we solve is then as follows.
s.t. (19), (20), (21), (23), (24), (25), (26) (PBCIP) For some experimental samples, it is also possible to have advance knowledge about nodes that are of the same color. For instance, many biological networks studied in [14] contain clusters of nodes in perfect balanced coloring already. These perfect colors are found by the algorithm of Kamei and Cock [18]. In these cases, it is important to keep these perfect colorings, and search for repairs that will not break these balanced colors. In such a case, as a pre-processing step, for all nodes in perfect balanced coloring u, v ∈ V , we add the constraint:

VI. COMPUTATIONAL RESULTS FOR REPAIRING A GRAPH
Solving the integer linear program (PBCIP) from Section V finds the minimum number of edges to add to a given graph to ensure a balanced coloring for a fixed number of colors.
This is an important step towards our goal of finding pseudosymmetries of a graph. Our overall repair method is as follows and takes a candidate input graph with n nodes.
1. Pre-processing: Use the algorithm of Monteiro, et al. [19] or Kamei and Cock [18] to find a minimal balanced coloring present in the graph and identify the initial non-trivial colors. For the colored nodes, set Eq. (27). Let C denote the number of non-trivial colors and T the number of trivial colors in this initial coloring.
2. For k = C to C + T do the following: (a) Solve (PBCIP) for k colors, with constraints of form Eq. (27) added to fix the nodes that were already one of the C non-trivial colors.
(b) Construct the resulting graph with the added edges dictated by (PBCIP) and color the nodes using the algorithm of Monteiro, et al. [19] again.
3. Evaluate the resulting T graphs to identify what is the number of colors of the best repaired graph employing indices characterizing graph topology and stability analysis, see below.
In this section, we investigate different heuristics using graph topology and stability to accomplish step 3. We are aware of the methods presented in [11], whose method uses pseudosymetries (Section II) and applies repairs manually using expert knowledge, in [41], whose method uses quasifibrations and applies repairs based on the similarity between nodes' input trees in directed graphs, in [39], who devises a Monte-Carlo randomized method that minimizes Eq. (1), [P, A] , and [57], whose method applies repairs in order to create an idealized graph. In our approach we generate a series of potential solutions by repeatedly solving (PBCIP) as described in steps 1 and 2. In order to accomplish step 3, a repaired graph is selected from this series of solutions by analyzing functions that characterize the underlying graph topology and stability of the solution. We call these graph functions indices. We consider the same networks analyzed in the pseudosymmetry analysis done by Morone et al. [11] from the connectome of C. elegans [13] obtained from the curated graph studied in [5]. The graphs studied are induced subgraphs of the connectome composed of neurons that are known to be involved in the locomotion function of the worm. These neurons have been identified in many experiments [13,31] and are classified into command interneurons (like AVBL, AVBR or AVAL, AVAR) and motor neurons (the series of neurons denoted by VA, DA, and VB, DB, for dorsal and ventral neurons). The connectome is composed of two types of connections, gap junctions and chemical synapse. Around 10% of the synapses are gap junctions. We test the algorithm only with undirected networks, so we use the connectome of gap-junctions connections between those forward and backward neurons. The connectome of chemical synapses is made of directed edges, and it is not used here, but in the test of the repaired algorithm of [41] which uses quasifibrations and directed graphs. We study two undirected gap junction networks referred to as the forward circuit and the backward circuit. These networks represent the neural circuitry responsible for forward and backward movement of the worm C. elegans.
Following [11], we obtained the data from [5]. The undirected edges of this connectome are weighted since, generally, there are more than one gap junction connection between a pair of neurons. Typically, a neuron has a long axon which touches another neuron in different places along axon-to-axon or axon-to-soma contacts. The weight of the gap junctions is the number of those contacts. Here we will ignore the weight of the gap junctions and consider a binary adjacency matrix of 0 and 1 representing the absence or the present of a link. The neural network of C. elegans is known to have a 25% variability between specimens [5,13].
Morone et al. [11] presented a repair of these networks done manually by completing the network with the fewest possible links chosen by taking into account biological knowledge on the functions of the neurons. Although this manually repair network is not optimal, nor the true one, we will refer to this network as the 'ground truth' graph (or 'manual') with the ideal symmetry, so it provides an opportunity to test our more automated repair method.
The original (unrepaired) backward and forward circuits obtained from [5] are shown in Figure 2a and 3a, respectively. The original backward circuit contains 17 colors: 11 trivial and 6 non-trivial and forward circuit contains 15 colors: 13 trivial and 2 non-trivial. These colors are obtained using Kamei and Cock algorithm [18] for perfect balanced coloring.
Solving (PBCIP) results in 12 possible repaired graphs for the backward circuit with the total number of colors ranging from 6 to 17 and 14 possible repaired graphs for the forward circuit with the total number of colors ranging from 2 to 15. In section VI A we describe these graphs and in section VI B we justify our selection of the best repaired graphs with 12 colors and 9 colors for the backward and forward circuit, respectively. Our selected solution graphs are shown in Figures 2b and 3b.

A. Backward and forward circuit repairs
We first describe the repairs of the backward circuit. We observe that the addition of just a few edges can drastically change the coloring. For example, adding the edge (AVAL, RIML) (as shown in Fig. 4b) will make nodes RIML, RIMR, AIBR and AIBL symmetric.
All repairs of the backward circuit can be decomposed into more simple repairs applied to the different part of the graph which are shown in figures 4 and 5. Using this simple decomposition of repairs we demonstrate the way our method combines them to obtain compound repairs in Table I. Note, repair Fig. 4d is the combination of repair Fig. 4a Table I corresponding to the number of colors between 11 and 9 for the simplicity of the interpretation. The repair with 9 colors shown in Fig. 5c is a combination of all the simple repairs and all the repairs with 17-9 colors can be visualized using Figure 2a and Table I.
Thus far, we have described the repairs with the number of colors between 9 and 17. The 8-color graph is shown in Fig. 5d. This solution combines all the repairs above additionally symmetrizing nodes DA08 and DA09 with the cluster DA01, DA02, DA04 and VA01 and nodes AIBL and AIBR with nodes DA03 and VA12 in order to symmetrize hubs AVAL and AVAR. Repairs with less than 8 colors have a very complicated structure that is hard to visualize, and, as we will see later, the number of colors of the best solution lies far above this range, therefore we omit these repairs for simplicity.
Note that the 9 color repair in Fig. 5c    or merged together with the pink color. To obtain the coloring in Fig. 5c in step 2(b) our method obtains coloring using the minimal balanced coloring algorithm [19], which combines all these nodes.
Another thing to notice is that the solutions with 14 and 13 colors are the same. This happens due to the fact that the solution obtained from solving (PBCIP) with 14 colors assigns DA04 and DA01 a color that is different from the color of VA01 and DA02 even though the repair in Fig. 4a is applied. In the 13 color solution obtained by solving (PBCIP) all these nodes are assigned the same color as non-minimal balanced coloring is permitted and the minimum number of repaired edges required to obtain 13 colors is the same as that for 14 colors.
Next, we present the repairs of the forward circuit.  is quite different from the topology of the backward circuit. As we saw above, the backward circuit can be decomposed into a few independent parts that can be repaired separately. The forward circuit, on the other hand, can only be decomposed into two parts: the bottom part, namely nodes DB05, DB06, DB07, VB10, VB11, VB08, VB09, RIBL and RIBR that are symmetric and are therefore fixed by the formulation and the top part consisting of the rest of the nodes. The top part has high connectivity and the optimal repairs obtain solutions with symmetries that act on most of the nodes. Instead of studying repairs as combinations as we did in the backward circuit, we examine all repairs separately. Figure 6 shows the obtained repairs starting from the original circuit with 15 colors and ending with 7 colors. Table II shows the incidence matrix of the edges used in these repairs. As before, we omit solutions with 2-6 colors since it will be shown that the best solution is not in this range.
Observing solutions in Fig. 6 and Table II   Morone et al. [11] showed the importance of circulant structures in the locomotion networks. In particular, a circulant matrix in the adjacency matrix corresponds to cycles in the network. A cycle is a path in a network plus an edge from the last node to the first, e.g., in Figure 3b, VB02-DB03-DB02-VB01-VB02 or VB04-DB01-VB06-VB05-VB04.
Cycles are important to generate oscillatory behavior [11,15,58] and, as was mentioned before, the C. elegans locomotion process follows oscillatory patterns. Therefore we look for circulant submatrices in the ideal solution for the forward circuit. In general, finding circulant submatrices in a given adjacency matrix is a problem on its own, and we are not aware of an algorithm that could do this automatically. For the small matrices considered here, we resort to find this circulant matrices by inspection, as done in [11]. Figure 7 g -9 colors  shows the adjacency matrix of the optimal forward circuit in Fig. 6g and Fig. 11a and a matrix B in the off-diagonal, such that the full 8 by 8 matrix is represented as: The transpose is needed due to the undirected character of the graph. The same circulant structure, with exact the same loops represented by F, has been found in the manually curated solution in [11]. This result is encouraging since it implies that the formulation is able to find not only close to optimal balance colorings but also by-product structures like circulant cycles in the network, which, in the particular case of locomotion, are necessary for the oscillatory motion of the animal.  Fig. 6g and Fig. 11a. This circulant structure is the same as found in the manually crafted solution in [11].

B. Indices on the graph
To identify the best repair over the number of K colors, we consider a set of indices that characterize the graph topology and the stability of the dynamical solution imposed on the graph. We are looking for indices that have a qualitative change in their behavior that can indicate the most optimal solution of K colors.
First, we list a few indices characterizing the partition induced by the balanced coloring.
We consider the number of trivial colors and the number of non-trivial colors (NNTC) as a function of the K colors. Each of these count the numbers of colors that are trivial and non-trivial, respectively. We also consider the number of nodes in non-trivial colors which calculates the total number of nodes that belong to non-trivial colors.  Additionally, we can gain some insight by examining the limits of high number of colors and low number of colors. For a high number of colors, all non-fixed nodes will be in the separate classes, hence NNTC will be low. For a low number of colors a lot of non-fixed nodes will merge with fixed colors, thus NNTC will be low as well. We see in Fig. 8 that NNTC possesses two local minima for high and low numbers of colors and has a local maximum in between them. We use the solution corresponding to the local maximum as our best solution to pick the optimal number of colors which we call K opt .
Using the NNTC index we choose the best solutions for the forward and backward circuit to be at K opt = 9 and 12 colors, respectively. The best solution is shown in green and the manual solution obtained in [11] is shown in red. Index values for the manual solution are shown with red horizontal lines. The number of colors for the best solution of the forward circuit was the same as the manual solution and the repairs used are similar. The number of colors for the backward circuit is different from the manual solution. We believe this is because the 12 color solution does not restore the symmetry between hubs creating more colors as shown by the dark and light blue colors in Fig. 2b. We will further discuss these differences in Section VIII. The number of trivial colors stays approximately constant below the best number of colors and starts increasing above it. The number of nodes in non-trivial colors behaves similarly, but decreases instead of increasing above the chosen solution.
The graph indices above characterize the coloring of the graph, but we cannot limit our consideration to the topology of the graph since biological networks and neural networks model systems that perform functions vital for the organism's or ecosystem's survival. Therefore the dynamics of the network also needs to be accounted for. Next, we search for the optimal graph over K by looking for the repair that could provide the largest stability to a given dynamical solution on the graph. In particular, we look for a graph that could provide a larger synchronizability, as defined by a larger stability of the cluster synchronization solution predicted by the balanced coloring on the graph.
Consider the eigenspectrum of the graph Laplacian. For a graph, G = (V, E) with nodenode binary adjacency matrix A, the random walk graph Laplacian [59] is the matrix, where D is a |V | × |V | diagonal matrix with the D ii equaling the degree of node i and L N G is the normalized graph Laplacian. The matrix L R G is symmetric, positive semidefinite, and singular so its eigenvalues are all nonnegative and real with at least one equaling zero.
The eigenvalues of the graph Laplacian are well known to be associated with both the combinatorial and dynamical properties of the graph [60][61][62]. In particular we consider the normalized Fiedler value defined as the second smallest eigenvalue of the random walk Laplacian. This has been shown to provide a measure of graph complete synchronizability [63,64] for networks in which the response of the individual nodes is adjusted based on the number of incoming connections, which is consistent with the case of neural networks [59].
Complete synchronization of a graph is the state in which for all i, j ∈ V : x i (t) = x j (t).
Complete synchronizability, then, refers to the ability of the graph to achieve stable complete synchronization.
Different clusters of a biological network, e.g., clusters in the backward and forward circuits, perform different biological functions through cluster synchronization from the balanced colorings. These clusters are able to perform independent synchronized functions, and still be integrated in the network. Thus, a functional biological network requires cluster synchronization. Complete synchronization, instead, is deleterious for the organism. Therefore repairs that increase complete synchronizability of the graph are undesirable. We observe that the normalized Fiedler value of the found best solution K opt is in the range of the lowest values of all solutions. This behavior suggests that the best solution is one of the solutions that is less prone to the complete synchronizability, which is desirable for the dynamics of the system.
To summarize, we saw that the number of non-trivial colors can be used as an indicator for the best solution K opt . Therefore, in this paper we identify the best solution as the one with the highest NNTC. We also examined other indices such as the mean color size (the number of nodes divided by the number of colors), average clustering coefficient, average path length, the Randic index, the number of repaired edges and normalized Fiedler value, but they exhibited erratic behavior that did not seem indicative of any critical point on the studied graphs to choose the optimal one based on these indices. However, the normalized Fiedler value presents a plateau with minimum near K opt (Figure 8) indicating that there is a range of graphs with colors around K opt that minimize the stability of the complete synchronization state which is beneficial for the biological network.
We believe that further work needs to be done in order to identify the best index, in particular, we believe the ideal index would need to more accurately characterize the synchronizability and stability of the dynamics on the graph. For instance, it would be desirable to analyze the stability of the cluster synchronization solution, rather than the instability of the complete synchronization solution as captured by the Fiedler value. The theory of cluster synchronizability is being developed in [65] and could provide a good index candidate to choose the optimal coloring solution. Here we only consider two networks that are fairly symmetric from the start and, in order to avoid overfitting, we choose NNTC as the graph function to select our solutions. An interesting open question is then to find the best index or indices to use to select the best candidate solution.

PAIRED GRAPHS
In this section, we describe how the factorization into normal subgroups of the automorphism group of a graph [66] performed in the C. elegans connectome in [11] partitions the graph into functional clusters, and how this partition can be obtained from (PBCIP). We emphasize that a repair of the graph guided by balanced coloring provides a less stringent condition on the number of repaired edges than a repair following the restoration of the full symmetry automorphism group as done in [11]. That is, repairing the network to achieve perfect balanced coloring will always require less (or at most equal) number of edges than repairing the network with automorphisms. This is because, automorphisms impose more strict conditions on the structure of the network than balanced colorings and fibrations. This is particularly true when the graph is directed and one is only interested in cluster synchronization. Cluster synchronization is exclusively determined by the information that a node receives through its in-degree from the entire network. This information is taken into account either by the input tree of the node as in the fibration formalism of [14] or by the analogous in-balanced coloring as considered here. The out-balanced coloring is irrelevant for cluster synchronization (see [14] for further details).
Instead, the symmetries imposed by automorphisms require invariance of the full adjacency, including the in and out-degree. Thus, automorphisms require more stringent conditions than what is required to achieve cluster synchronization. Nevertheless, Ref. [11] showed that automorphisms exist in the connectome, in particular in the undirected gap junction connectome where the distinction between in and out balanced is meaningless.
These symmetries pose the particular property of factorization of the symmetry group, and this factorization separates the neurons into known classes like interneurons, motor and touch neurons. Thus, this particular symmetry group has a functional connotation. Therefore, next, we relate the pseudobalanced coloring formalism with the pseudosymmetry formalism of [11].
For a graph, G = (V, E), let Aut(G) denote the automorphism group of G. For any permutation p ∈ Aut(G), the support of p [11,12,66], denoted by supp(p) is defined as the nodes which do not have the same labels after the permutation p has operated on G.
Formally, this can be written Two permutations, p, q ∈ Aut(G), are said to be support-disjoint if the node labels they change are completely different, i.e., if supp(p) ∩ supp(q) = ∅. Two sets of permutations P, Q ⊆ Aut(G), are support-disjoint if every permutation in P is support disjoint with every permutation in Q. Note that any two support-disjoint permutations commute.
Let S denote the set of generators of an automorphism group Aut(G). Partition S into n support-disjoint subsets S = S 1 ∪ S 2 ∪ · · · ∪ S n such that none of the S i can be further decomposed into smaller support-disjoint subsets. Each S i generates a group H i that is a subgroup of Aut(G). Since all S i are support-disjoint, H i commute with each other and, since S is a union of all S i , Aut(G) is a direct product of all H i : All H i are normal subgroups, since they commute with the rest of the group. Therefore, partitioning the set of generators of an automorphism group into irreducible support-disjoint subsets creates a factorization of this group. The uniqueness and irreducibility of this representation has been shown in [66]. Note, H i acts on the separate non-intersecting subsets of nodes defining a partition (also referred to as a factorization) on the nodes of a graph.
We call the equivalence classes of nodes created by this partition sectors [11].
Therefore, we have a way to factorize the graph via the automorphism group associated with it. We describe how to classify the obtained factors. Each factor is vertex-transitive, but vertex-transitive graphs are too broad a class to admit a complete classification. Nevertheless, we apply a simple computational approach that allows us to classify most of the obtained subgroups. In order to classify each normal subgroup we determine whether it is isomorphic to a symmetry group from this list: a symmetric group, a dihedral group, a cyclic permutation group, or an alternating group of sizes 1 to n, where n is the number of nodes in the graph. If the normal subgroup is isomorphic to one of these classes, then we assign this class to it. The implementation uses NAUTY [67], Sympy [68] and Sage [69] and is available at https://github.com/makselab/PseudoBalancedColoring.
To illustrate the decomposition process, we apply the algorithm to the repaired backward circuit. The automorphism group is decomposed into We apply this algorithm to the repaired graphs obtained by our algorithm on the backward and forward circuits. The result of this analysis for the backward circuit is shown in Fig. 9. Similar to the earlier presented conclusions from Table I, we see here that between 17 and 9 colors, the formulation (PBCIP) combines different simple repairs to obtain a symmetrized version of the network. This is evident by the fact that almost all new subgroups for K − 1 colors are created by merging some nodes (or subgroups) in K colors and nodes never move between subgroups. Note, our computational approach was able to classify all of the subgroups aside from the subgroup marked as "UNKNOWN" in the backward circuit.
The results for the forward circuit are shown in Fig. 10. As observed before, the forward circuit is decomposed into two parts: a top part and a bottom part. As expected, the bottom part of the graph represented by the bottom 9 nodes of Fig. 10 stays unchanged. The top part, as seen before, does not have a consistent partition and an increase in the number of colors combines different nodes in normal subgroups. Two things can be observed: 1) as the number of colors decreases, the number of symmetric nodes increases, and 2) nodes are mostly combined to exhibit mirror symmetry (D 1 ). Note that groups D 1 and S 2 are isomorphic and we use them interchangeably, D 1 is more appropriate in a case of mirror symmetry and is therefore used here. For example, in Fig. 6g there is mirror symmetry in the top part between the nodes on the left and the nodes on the right.
To complete our analysis of the repaired networks, we find ε described in Section II. ε represents the cutoff of the norm in the equation (2) i.e.
where P ε represent all permutations in automorphism group of the repair of a graph G. We denote the repaired graph G ε ; ε then can be found using Aut(G ε ) as: Due to the size of the automorphism group, finding ε using the entire automorphism group is computationally intractable. To give an approximation, we find ε on the set of generators of the normal subgroups. That is, we first calculate the maximum ε on generators of each normal subgroup and then we take the resulting ε as their maximum. To simplify the interpretation of the obtained ε, we define where M is the total number of edges in the amended graph and 4 is the constant appearing due to the circumstances described in Section II. Then, is the maximum of the number of edges that needs to be added to the original graph in order to make the permutation P ε a part of the automorphism group of the amended graph normalized by the total number of edges in the graph. Figure 8 (last row) shows the values of for the obtained repairs. We see that all obtained repair graphs are well below 0.25 (25%, except the backward repair with 5 colors) which is the value found in [5] to be associated with the natural variability in the number of links that are different across the five animals that have been used to build the connectome in [13]. Therefore, from the point of view of pseudosymmetries, each of the repaired networks can be a candidate for the "blueprint" ideal symmetry network of C.
elegans. Note that Fig. 8 shows the maximum values of as defined in Eq. (35).

VIII. COMPARISON WITH OTHER LINK PREDICTION METHODS
To complete our analysis, we compare the results obtained by the formulation (PBCIP) against the results of the manual repairs obtained in [11] and some of the traditional link prediction methods.
We begin by comparing our results with the manual repair 'ground truth' of [11]. Figure   11 shows . Colorings corresponding to these decompositions are the same and therefore, because the objective of (PBCIP) is to minimize the weighted sum of edges added, adding two extra edges is not optimal. However, the logical assumption that the function of the hubs is different from the rest of the nodes led [11]   in the manual solution. The blue edges were added to the manual solution in [11] to factorize the command interneurons AVB from the motor sector, but the formulation found another more optimal solution.
hubs in a bigger network in order to symmetrize other parts of the network connected to them. Figure 12a shows the ideal solution obtained by our formulation and Fig. 12b shows the manual solution for the backward circuit. We can separately compare the repairs applied to different parts of the circuit. Firstly, repairs (AVAL, RIML) and (VA01, AVAR) were obtained in both solutions. Secondly, nodes DA08 and DA09 were repaired to be symmetric similar to the manual version, but links between these two nodes and AVAL needed to symmetrize them with the rest of orange nodes weren't repaired. Lastly, hubs AVAL and  AVAR along with the manually restored nodes with S 12 symmetry in pink were not repaired.
The solution with 9 colors in Fig. 5c does symmetrize AVAL-AVAR, all of the pink nodes and DA08-DA09, but creates an extra symmetry between nodes DA05, VA04 and VB05.
Hence, this repair can be obtained, but it wasn't chosen due to the fact that the maximum of NNTC corresponds to the solution with 12 colors. This provides evidence that further analysis of indices is needed to improve the way to identify the best solution.
Now we take the manual repair as the ground truth and compare the performance of the formulation with some of the traditional link prediction methods. The link prediction problem is a problem of predicting the probability of the existence of the edge between two not connected nodes given the observed data. This probability is defined as the similarity between considered nodes obtained using a topological measure of the graph [32,34,36].
To repair a graph using a link prediction approach, an empirically chosen number of top ranked edges is predicted to exist [35]. Commonly used measures of similarity are divided into local, global and quasi-local classes. Further, we will define some of them following [32] and propose a way to estimate their performance.
First, we introduce a few indices characterizing the local topology of the graph. Let S xy be the distance between nodes x and y, Γ(x) be the neighborhood of node x (set of nodes connected to x), | X | be the cardinality of set X and k x be the degree of node x. Preferential attachment index calculates the similarity between two nodes based on their degree: This measure is widely used in scale-free networks and is generalized by the Randic index [70,71]. In particular, s(G) = x,y∈V S xy is called a scale-free metric and is used as a measure of scale-freeness of the graph [72]. A degree based metric is also used while generating a random scale-free network with the Barabasi-Albert model [73]. In this model, preferential attachment implies that each new added node in a generated graph is connected to other nodes with the probability proportionate to their degree.
Common Neighbors similarity metric is based on the assumption that nodes that have a lot of common neighbors are likely to be neighbors themselves. This assumption has had success in social networks [74,75], however, as will be discussed in more detail later, it does not increase the symmetry of the network. Common Neighbors (CN) is defined as: We also use two differently normalized versions of the Common Neighbors metric: Salton Index normalized by the degree of the nodes and Jaccard similarity normalized by the total size of the node neighborhood: An example of the similarity based on the global topology of the graph is Katz Index.
This index counts the number of walks (paths that can visit nodes and edges more than once) between considered nodes of increasing length weighted by the coefficient β < 1. (A) k xy is the number of walks of length k between nodes x and y [76]. Katz index is defined as: This series converges for β less than the inverse of the largest eigenvalue of A to the expression: where I is an identity matrix.
To analyze the accuracy of each method, we used the traditional hypothesis testing performance metrics. The confusion matrix in Table III introduces    In this paper, we use two approaches to choose the number of top edges to be repaired.
First, as shown in Table IV,   circuit and 3 in the backward). Second, we use the process described in Section VI B, apply it to a given reconstruction method and choose the number of edges corresponding to the maximum number of non-trivial colors of the obtained graphs. Table V shows results obtained by using this approach. Since the adjacency matrix is fairly sparse, the TN value is very high for all the methods. Therefore the best metric for the overall performance assessment is the F-measure that is proportional to the number of edges guessed correctly and inversely proportional to the sum of missed and falsely identified edges.
The formulation (PBCIP) with sum and multiplication objectives, c ij = 1 d i +d j and c ij = neighbors. Therefore, measures that are based on the number of common neighbors will often rank nodes of the same color as likely to have a connection, which is likely to break some symmetry in the network rather than restoring it. The possible alternative symmetryrestoring assumption may be: nodes that have a lot of common neighbors are likely to have more common neighbors, which will keep the symmetry between the nodes with a lot of common neighbors and increase the symmetry in their neighborhood.

IX. CONCLUSION
In summary, we described a method that allows a user to repair a graph to a more symmetric version and compared our results with the manual repair obtained in [11] as well as showing better performance than any of the traditional link prediction methods. We conclude with observations and ideas for future work.
1. The pseudobalanced coloring obtained by solving (PBCIP) formulation sometimes is not minimal as the objective minimizes the weighted sum of links added. So if a pseudobalanced K-coloring had the same optimal objective as a pseudobalanced (K + 1)-coloring, the pseudobalanced K-coloring could be returned for the K + 1 case with one nontrivial node made trivial. This occurred five times and these solutions were not chosen. Moreover, our problem had two objectives: (1) find the pseudobalanced K-coloring with the most non-trivial nodes, and (2) minimize the weighted sum of links added. Thus, our results can be interpreted as determining the best solution to use along the pareto optimal tradeoff curve between the number of colors, K, and the optimal weighted sum of added links. Under this interpretation, the aforementioned balanced (K + 1)-coloring would not be on the pareto optimal curve as the K-coloring would be considered more optimal on both objectives.
2. During preprocessing some of the nodes in the network are assigned fixed colors using constraints of the form Eq. (27). As a result, sets of nodes assigned different nontrivial colors in the original partitioning can't be merged together. Therefore, repairs like the pink S 12 group in Fig. 5c are not obtainable by solving (PBCIP) and need to be obtained as an additional step. This can be solved by implementing a two-step process as described in steps 2(a) and 2(b) of Section VI.
3. Removal of edges or reduction of their is not allowed due to the fact that pseudoedges in Definition III.5 are constrained to non-negative weights on non-edges. This choice is appropriate when studying neural networks of gap junctions in C. elegans. The reconstruction of this network is done using images of cross-sectional areas of the worm obtained using electron microscopy [77]. Each connection is followed from the neuron through all the layers leading to another neuron, therefore it's much more likely to miss a link than to add a non-existent one. When working on a different biological network, the possibility of the removal of edges or edge weight reduction may be required.
4. The two graphs studied in this paper have a high degree of symmetry. We deduced that, for these graphs, the number of non-trivial colors is a good indicator of the best solution and found the best solution using this index. However further investigation on a greater variety of graphs, including those with directed and weighted edges and those of bigger size, is required to make final conclusions. Biological networks present real dynamical systems that need to be able to maintain certain stable synchronization patterns, therefore indices characterizing stability and synchronizability of the cluster synchronization solution predicted by the balanced coloring are likely to give a better indication of ideal repairs, at least in biological networks, which are expected to produce stable cluster synchronization of their units [15].
5. There are two potential ways of how this problem can be formulated: 1 -find the minimal number of edges to add to find a graph with a given number of colors, 2find the repair with minimal number of colors given the amount of edges that can be added. Both formulations make sense and both formulations would produce a number of graphs that will need to be analyzed in order to choose the best solution. The number of colors in a graph is no more than n, the number of nodes, therefore the 1st formulation will generate no more than n graphs. The number of edges in a graph is no more than n 2 , therefore the 2nd formulation can generate up to n 2 graphs. In the interest of decreasing the search space, we chose the 1st formulation and the 2nd formulation was not explored.
6. Solutions to optimization problems can often be improved by using expert insights about the object of optimization. For example, in Section VIII we saw that the result obtained with c ij = 1 can be improved by using an assumption that edges between nodes with higher degrees are easier to repair by setting c ij = 1 d i d j . Additional insight based on the biological considerations can provide further improvements to the result. 7. SBMs have important applications to network reconstruction methods [78][79][80]. In particular, Guimerà and Sales-Pardo [78] describe how the reliability of each potentially missing or spurious link can be calculated using graph generative models.
As such, a generative model promotes certain topological features in the repaired graph, i.e. if certain features appear more often in generated graphs, the corresponding links are considered more reliable. We speculate that if a generative model for random equitable graphs (as described for example in ref. [81]) is chosen with the proper set of parameters, the reconstruction will promote the higher degree of symmetry in the reconstructed network. Parameters of such model can be identified from the large scale analysis of the equitable partitions (balanced colorings) in networks performed in ref. [14] using the fibrationSymmetries library (available at https://github.com/makselab/fibrationSymmetries). Additionally, one of the fundamental problems in network reconstruction via SBMs, and also in general, is the choice of the number of edges to be added/removed. We believe that indices derived from synchronizability and stability as discussed in section VIB can provide a novel approach to this problem. 8. Our method can be applied to directed graphs by modifying the constraints Eq. (24) in (PBCIP) suitably to account for the version of the directed balanced coloring problem that is being solved. In particular, the edge set E must be modified so that only direct edges are used and the number of constraints reduced to reflect whether both directions of the balancing is required. Future work will include a formulation to repair directed networks and its comparison with the quasifibration formalism developed in [41]. 9. In this paper we applied our method to the unweighted graphs as a simplest case, however this method can be applied to weighted graphs without further modification by relaxing the integrality of the z ij variables. This would change (PBCIP) from an integer linear program to a mixed-integer linear program. 10. Our method is reasonably fast for small graphs. In order to be applied to graphs of bigger size, a more computationally efficient methods such as the Bender's decomposition method described in Appendix XI B needs to be implemented.
Overall, we presented a formulation of the ORP following the PBC of the graph. Our analysis shows encouraging results for the case of binary unweighted undirected networks as compared to a manually curated graph and other methods for missing link prediction. More work needs to be done to extend the formulation for large scale networks with weighted and directed edges, including negative weights which are important in all biological networks which contain inhibitory interactions.

A. Additional complexity comments
Here, we show that balanced coloring is hard when the balanced condition is only enforced between nodes of the same color versus the original version where the balanced condition is enforced between nodes of different colors as well. Such a restriction implies the directedversion of the balanced K-coloring problem is NP-Hard when only a subset of the constraints are enforced. This further means the balanced coloring problem is easier when additional constraints are added as the directed-version of the balanced K-coloring problem is polynomial time solvable. Formally, we define this variant as follows.
Definition XI.1. Let K be a given positive integer and G = (V, E) a given directed graph.
The in-directed self-balanced K-coloring problem is to determine whether there exists a partition, C, of V which satisfies the following. For all C ∈ C, and all pairs of distinct nodes p, q ∈ C, j∈C:(j,p)∈E A jp = j∈C:(j,q)∈E A jq .
Note how Eq. (44) is a subset of Eq. (8) restricted to the sum of edge weights within the color set of p and q. We show that finding such a coloring is NP-Hard. We show this by reducing traditional vertex K-coloring to restricted directed in-balanced K-coloring.
Traditional vertex K coloring is the problem of coloring a graph so that no two vertices of the same color share an edge. Formally, we can define it as follows.
Definition XI.2. Let K be a given positive integer and G = (V, E) a given undirected graph with edge weights w e for e ∈ E. The traditional vertex K-coloring problem is to determine whether there exists a partition, C, of V such that |C| = K and for all C ∈ C, if u, v ∈ C then uv ∈ E.
For K ≥ 3, traditional vertex K-coloring is NP-Complete as shown by Karp [82]. By augmenting the input graph to K-vertex coloring with the appropriate subgraph, we can perform the reduction. First, note that each of node in W must be assigned a different color or else the balanced condition would be violated. This implies that no node can have any incoming edge from a node of the same color. Thus, the coloring on V must be a K-vertex coloring of the original graph.

B. A Bender's decomposition approach
For the size instances we report on, the runtime to solve the integer linear program was acceptable. However, we suspect runtimes will scale poorly on larger instances. Because of this, we describe an adaptation of a Bender's decomposition approach due to Codato and Fischetti [83] that accelerate solution times for integer programs with big-M constraints such as Eq. (24) for future implementation. In this approach, we define L V = {s = (x, y) ∈ {0, 1} |V | 2 × {0, 1} |V |×K : (19), (20), (21), (26), V} where V denotes a set of constraints of form Eq. (52) as defined below. The leader problem is to find a solution s * = (x * , y * ) ∈ L V . Given such an s * , we calculate EK * = {(i, j, k) : y * jk = 1, (i, j) ∈ E C } and V * 2 = {(p, q) : x * pq = 1}.
Note that EK * represent the set of all non-edges that the solution s * permit to exist and that V * 2 is the set of node pairs which must be balanced based on the solution s * . Then, given an upper bound U * on B * , the follower problem constraints are as follows. We have an objective bound constraint.
The constraints in Eq. (24) are modified as s * indicates which nodes must be balanced.
Then, we define and the follower problem is to find a solution in F s * . If F s * is empty then an Irreducible Inconsistent Subsystem (IIS) is found [84]. An IIS is a subset of the constraints of F s * such that • the system defined by the subset of constraints are infeasible; and • removing any one of the constraints results in a feasible system.
Note that the constraints in the IIS are indexed by a union of a subset of V * 2 × K and a subset of E . Using the IIS found, let V K I 2 and E I denote these subsets, respectively. Note that each of the constraints in the IIS corresponds to several binary variables in Eq. (45). First, consider an arbitrary element (p, q, c) ∈ V K I 2 . In this case, the binary variable x * pq = 1 must be true for the equality constraint to be a part of the formulation. In addition, the constraint could be violated some z pjc and z qjc are not in the formulation as y jc = 0, i.e., (p, j, c) ∈ EK * or (q, j, c) ∈ EK * and pj ∈ E or qj ∈ E . Let V K * C = {(j, c) : y jc = 0}. Then, the indices of the y jc set to zero that could potentially cause the infeasiblity are described by the set {(j, c) ∈ V K * C : pj ∈ E or qj ∈ E }. Now consider an arbitrary pq ∈ E I . In this case, the cause of the violation must be due to one or more z pjc or z qjc being held to zero. So, again, the set of indices of y jc that could cause the infeasibility is {(j, c) ∈ V K * C : pj ∈ E or qj ∈ E }. Indices of the binary variables that could be causing the infeasibility are then I x = (p, q) : ∃c ∈ K, (p, q, c) ∈ V K I 2 I y = (j, c) ∈ V K * C : (pj ∈ E or qj ∈ E ) and (∃c ∈ K, (p, q, c) ∈ V K I 2 or pq ∈ E I ) .
Note that this inequality may not be the strongest possible as the cause of the infeasibility in any particular constraint could be from one or more z pqc variables or, in the case of the constraints in V K I 2 , from the setting that x pq is one (which causes the existence of the constraint). Thus, we must have a systematic way of adding the z pqc variables back to Denote the optimal solution to Eq. (53) by z * . The objective value U * is used to update the upper bound in used in Eq. (47) and Eq. (50) is then resolved.
The precise description of the method is as follows. We note that δ has not been specified and is dependent on the weights of the edges in the input graph. For an unweighted graph, we can use δ = 1. Algorithm 1: Bender's decomposition of (PBCIP) Result: An optimal solution, (s * , z * ) = (x * , y * , z * ), to (PBCIP) Initialize V to the empty set; Use Eq. (45) to form L V ; Set U * to an upperbound, e.g., the maximum possible edge weight times |V | 2 ; Initialize z * to zero.; Use I x and I y to add a valid inequality to V via Eq. (52); end end return (s * , z * ) = (x * , y * , z * )