Clustering matrices through optimal permutations

Matrices are two-dimensional data structures allowing one to conceptually organize information. For example, adjacency matrices are useful to store the links of a network; correlation matrices are simple ways to arrange gene co-expression data or correlations of neuronal activities. Clustering matrix values into geometric patterns that are easy to interpret helps us to understand and explain the functional and structural organization of the system components described by matrix entries. Here we introduce a theoretical framework to cluster a matrix into a desired pattern by performing a similarity transformation obtained by solving a minimization problem named the optimal permutation problem. On the computational side, we present a fast clustering algorithm that can be applied to any type of matrix, including non-normal and singular matrices. We apply our algorithm to the neuronal correlation matrix and the synaptic adjacency matrix of the Caenorhabditis elegans nervous system by performing different types of clustering, including block-diagonal, nested, banded, and triangular patterns. Some of these clustering patterns show their biological significance in that they separate matrix entries into groups that match the experimentally known classification of C. elegans neurons into four broad categories, namely: interneurons, motor, sensory, and polymodal neurons.


I. INTRODUCTION
We formulate the optimal permutation problem (OPP) in a pragmatic way by considering the correlation matrix B showed in Figure 1a describing the neuronal activity of N = 33 neurons of the nematode C. elegans measured for three locomotory tasks of the animal (forward, backward, turn) in ref. [3]. The choice of this particular dataset is useful to get an instrumental view of the optimal permutation problem and how it relates to real-world data, although we could formulate our mathematical theory completely in abstracto as well. Thus, we emphasize that the optimal permutation theory and algorithm we present here can be applied to any square matrix of the most general form.
The entries B ij of the correlation matrix in Figure 1a  To identify the two groups, we consider a matrix A, called filter, with a two-blocks shape, as seen in Figure 1b ought to be clustered into two blocks. In other words, we use A to guide the clustering process in order to get a clustered matrix B as 'similar' as possible to A. Mathematically, this can be achieved by means of the objective function E(P ) defined as where ||A|| 2 = tr(A t A) is the Frobenius norm and P is a permutation matrix, whose entries P ij ∈ {0, 1} satisfy the constraints i P ij = j P ij = 1. The objective function in Equation (1) appeared for the first time in the formulation of the quadratic assignment problem (QAP) [5], which is one of the most important problems in the field of combinatorial optimization. The QAP was initially introduced in economics to find the optimal assignement (=optimal permutation in our language) of N facilities to N locations, given the matrix of distances between pair of facilities (=filter matrix A) and a weight matrix quantifying the amount of goods flowing between firms (=correlation matrix B).
To better explain the meaning of the objective function in Equation (1), let us suppose that we could find a permutation matrix P * such that E(P * ) = 0. This means that matrix A is permutation similar to matrix B via the transformation A = P t * BP * . That is, A itself is the desired clustering of the matrix B. But the equation E(P ) = 0 has no solution almost always (it admits solutions only for special choices of the filter A). Therefore, A is not permutation similar to B and it's not itself a clustering of B. What we can do in this situation is to look for a permutation matrix P * that minimizes the cost function so that the weaker condition E(P * ) ≥ 0 holds true, as seen in Figure 1d, that is the solution of the following optimization problem: We call P * the optimal permutation of matrix B (given the filter matrix A) and we show it in Figure 1e. Once we obtain P * we can proceed to cluster matrix B by performing a similarity transformation to bring B into its clustered form B : The result is shown in Figure 1f. The two-blocks clustering (left panel in Figure 1f) identifies two clusters separating two groups of neurons: one group contains neurons driving backward locomotion, and the other one contains those regulating forward locomotion. By using the threeblocks filter shown in Figure 1c we obtain a clustering of the correlation matrix B into 3 clusters: two of them are each a subset of the backward and forward locomotion groups defined previously.
The third cluster occupies the middle block in the right panel of Figure 1f. Neurons belonging to this cluster are classified by the Wormatlas database [6] as: ring interneurons (RIVL/RIVR) regulating reversals and deep omega-shaped turns; motor neurons (SMDVR/SMDVL, RMEV) defining the amplitude of omega turns; labial neurons (OLQDR/OLQVL) regulating nose oscillations in local search behavior; and a high-threshold mechanosensor (ALA) responding to harsh-touch mechanical stimuli. We term "Turn" the third block in the clustered correlation matrix shown in Figure 1f.
Having formulated the optimal permutation problem, we move now to explain the algorithm to solve it, along with several interesting applications to the C. elegans whole brain's connectome.

II. SOLUTION TO THE OPTIMAL PERMUTATION PROBLEM
In order to determine the solution to the optimal permutation problem (OPP) given in Equation (2) we use Statistical Mechanics methods [7]. The quantity which plays the fundamental role in the resolution of the OPP is the partition function Z(β), defined as where the sum is over all permutation matrices P . The statistical physics interpretation of the problem thus follows. The parameter β in Equation (4) represents the inverse of the 'temperature' of the system; the cost function E(P ) defined in Equation (1) becomes the 'energy' function. The global minimum of the energy function corresponds to the 'ground-state' of the system. Since a physical system goes into its ground state only at zero temperature (by the third law of thermodynamics), then the exact solution to the OPP corresponds to the zero temperature limit of the partitition function in Equation (4): In this limit the partition function can be evaluated by the steepest descent method, which leads us to the following saddle point equations (see SI Sec. IV for the detailed calculations): where X ij are the entries of a double-stochastic matrix X, that take values in the interval X ij ∈ [0, 1] and satisfy normalization conditions on row and column sums: i X ij = j X ij = 1 for all i and j (the space of all X's is also called the Birkhoff polytope [8]). The matrix Y (X) contains the information about matrices A and B and its components are explicitely given by The vectors U and V are needed to ensure the row and column normalization conditions, and we compute them by solving the Sinkhorn-Knopp equations [9][10][11]: Equations (6) and (7), represent our main result. Equations similar to (8) have been already derived in ref. [12] to relate the fitness of countries to their economic complexity. Note that the solution X * to the saddle point equations (6) is not a permutation matrix for β < ∞. To find the optimal permutation matrix P * defined in Equation (2) we have to take the zero temperature limit by sending β → ∞: in this limit the solution matrix X * (β) is projected onto one of the N ! vertices of the Birkhoff polytope: which is the optimal permutation matrix P * that solves the OPP [13] (details in Section IV). The implementation of the algorithm to solve the saddle point equations (6) is described in detail in SI Sec. V. Next, we use our optimal permutation algorithm to perform three types of clustering of the C. elegans connectome.
III. CLUSTERING THE C.ELEGANS CONNECTOME THROUGH OPTIMAL

PERMUTATIONS
We consider the neuronal network of the hermaphrodite C. elegans comprising N = 253 neurons interconnected via gap junctions (we consider only the giant component of this network). We use the most up-to-date connectome of gap-junctions from Ref. [14]. We represent the synaptic connectivity structure via a binary adjacency matrix B, with B ij = 1 if neuron i connects (i.e. form gap-junctions) to j, and B ij = 0 otherwise, as shown in Figure 2a ( ij B ij = 2M = 1028, so that the mean degree is k = 2M/N ∼ 4). Gap-junctions are undirected links, hence B is a symmetric matrix. We emphasize that our framework is not limited to symmetric matrices and can be equally applied to asymmetric adjacency matrices representing directed chemical synapses.
We perform a clustering experiment by using a filter A whose shape is shown in Figure 2b. We call it: 'nestedness filter'. This nomenclature is motivated by ecological studies of species abundance showing nested patterns in the community structure of mammals [15] and plant-pollinator ecosystems [16,17]. Nestedness is also found in interbank, communication, and socio-economic networks [12,18,19]. Remarkably, it has been shown recently that behavioral control in C. elegans occurs via a nested neuronal dynamics across motor circuits [20]. A connectivity structure which is nested implies the existence of two types of nodes, either animal species, neurons or firms, that are called 'generalists' and 'specialists'. Generalists are ubiquitous species with a large number of links to other species that are quickly reachable by the other nodes; specialists are rare species with a small number of connections occupying peripheral locations of the network and having a higher likelihood to go extict [21].
The entries of A are defined by: where p is the nestedness exponent controlling the curvature of the function f (i; p) separating the filled and empty parts of the matrix A, as seen in Figure 2b. By solving the OPP we obtain the optimal permutation matrix P * shown in Figure 2c, by means of which we cluster the adjacency matrix via the similarity transformation B = P t * BP * , as depicted in Figure 2d. In order to measure the degree of nestedness of the connectome we introduce the quantity φ(p), defined as the fraction of elements of B comprised in the nested region j ≤ f (i; p) by the following formula: We call φ(p) the 'packing fraction' of the network. The profile of φ(p) as a function of p is shown in Figure 2e, comparing the C.elegans connectome to a randomized connectome having the same degree sequence but neurons wired at random through the configurational model [22]. Figure 2e shows that the C.elegans connectome is 10% more packed than its random counterpart almost Last but not least, we present three more types of clustering performed on the C.elegans connectome, by using three more filters: the bandwidth filter [23], shown in Figure 3a the triangular filter and the square (or box) filter [4], whose mathematical properties are discussed in SI sec. VI.
Furthermore, we notice that if A represents itself the graph of a network, then the OPP is equivalent to the graph isomorphism problem [24], as exemplified in Figure 4, which becomes the graph automorphism problem in the special case A = B. In this latter case, the OPP is equivalent to the problem of mimizing the norm of the commutator E(P ) = ||[A, P ]|| 2 . Then, the optimal permutation P * is called a 'symmetry of the network' if E(P * ) = 0, or a 'pseudosymmetry' if the weaker condition E(P * ) > 0 holds true [25].
In conclusion, our analytical and algorithmic results for clustering a matrix by solving the optimal permutation problem reveal their importance in that their essential features are not contigent on a special form of the matrix nor on special assumptions concerning the filters involved. We may well expect that it is this part of our work which is most certain to find important applications not only in natural science but also in the understanding of artificial systems.

Data availability
Data that support the findings of this study are publicly available at the Wormatlas database at https://www.wormatlas.org.
The author declares no competing interests.
Correspondence and requests for source codes should be addressed to F. M. at: flaviomorone@gmail.com  The clustered adjacency matrix obtained from B by applying a similarity transformation with the optimal permutation matrix P * found in c, that is P t * BP * (left side). Right side: clustered adjacency matrices obtained with nine different filters having nestedness exponents 0.1 ≤ p ≤ 1.0.
e The packing fraction of C.elegans connectome φ(p) (red dots), defined by Equation (11)   and a matrix B obtained from B by a similarity transformation B = P r BP t r , where P r is a random permutation matrix with no fixed points (also called a derangement). In other words, B is permutation similar to B, hence the two graphs represented by B and B are isomorphic by construction. Of course, this information is not exploited by our algorithm, which has, a priori, no clue on how the matrix B has been generated and wether an isomorphism exists between B and B at all. The fact that the minimum of the energy function (red dots) goes to zero as the number of iterations t of the algorithm increases means that our algorithm is able to determine that graphs B and B are indeed isomorphic. Moreover, the optimal permutation P * returned by the algorithm is an explicit example of graph isomorphism. We note that P * and P r need not to be necessarily the same permutation matrix. This is due to the existence of symmetries (i.e. automorphisms) of the matrix B. These symmetries are permutation matrices S that commute with B, so that we can write B = SBS t for any S such that [B, S] = 0. Therefore, we will have as many solutions P * to the graph isomorphism problem as simmetries S there are in the connectome B. Thus, we can retrieve the original permutation P r only up to an automorphism of B, i.e, P * = P r S. The blue dots represent the maximum difference between the entries X ij (t) at iteration t and X ij (t − 1) at the previous step t − 1.
We call A the 'filter' (or 'template') matrix, and B the 'input' matrix. We define the inner product L|R between two matrices L, R ∈ M N ×N by means of the following formula: where tr indicates the trace operation: tr(A) = N i=1 A ii . Then, the norm of ∆(P ) can be computed as: where we have introduced the 'overlap' matrix Q(P ), which is defined as follows: The quantity we want to optimize over is precisely the inner product P |Q(P ) . Therefore, we define the objective (or energy) function E(P ) of our problem as where the factor 1/2 has been chosen for future convenience. The optimal permutation problem (OPP) is defined as the problem of finding the global minimum (or ground state) of the energy function in Equation (16): In order to determine the solution to the OPP given by (17) we take a statistical physics approach by introducing a fundamental quantity called partition function Z(β), defined by the following summation: where the notation P ∈P indicates the sum over all N ×N permutation matrices P . The parameter β in Equation (18) represents, in the statistical physics interpretation of the problem, the inverse of the 'temperature' of the system. We notice that the sum in Equation (18) involves N ! terms, and thus grows as the factorial of the system size, Z ∼ O(N !), rather than displaying the peculiar exponential growth, Z ∼ O(e N ), that appears in the study of the thermodynamic limit of manybody classic and quantum systems.
The global minimum of the objective function in Equation (16) corresponds, physically, to the 'ground-state' of the system. But a physical system exists in its ground state only at zero temperature (by the third law of thermodynamics), and thus the exact solution to our optimization (i.e. minimization) problem can be computed by taking the zero temperature limit (which is mathematically tantamount to send β → ∞) of the partitition function defined by Equation (18).
Specifically, the minimum of E(P ) is given by: Since the partition function in Equation (18) can be easily calculated when all P ij appear linearly in the argument of the the exponential, a good idea is to write the quadratic term which in the energy connects two variables P ij and P k on different links i → j and k → as an integral over disconnected terms. In order to achieve this result, we insert the δ-function where the integration over J ij runs along the imaginary axis, into the representation of the partition function [7]: To proceed further in the calculation, we enforce the costraint on the column sums: by inserting N δ-functions into the partition function: where P indicates the vector space of N × N right stochastic matrices P with integer entries P ij ∈ {0, 1}, that is, matrices with each row summing to one: N j=1 P ij = 1 (but no costraint on the column sums). Then, summation over the variables P ij is straightforward, and we find: Introducing the function F (X, J, z) defined by: we can write the partition function Equation (24) as which can be evaluated by the steepest descent method in the limit of zero temperature (i.e. β → ∞). The saddle point equations are obtained by differentianting F with respect to X ij , J ij , and z j : The solution to the saddle point Equations (28) is given by: Notice that the solution X ij satisfies automatically the condition of having columns summing to one: j X ij = 1, ∀i, as it should. Opposed to this, are the constraints on the row sums, i X ij = 1 ∀j, which are taken into account by the Lagrange multipliers z j .
Next, we eliminate J ij in favor of X ij and we make the constraints on the row and column normalizations manifest in the final solution. Introducing the matrix W (X) defined by we can write J ij as J ij = βW ij (X) − z j . Thus, X ij in Equation (29) takes the form We notice that Equation (31) is invariant under global translations of the form for arbitary values of ζ. This symmetry is not unexpected and can be traced back to the fact that out of the 2N constraints on the row and columns normalization, only 2N − 1 of them are linarly independent, since the sum of all entries must be equal to N , i.e., ij X ij = N . This translational symmetry can be eliminated, for example, by choosing ζ in such a way that: In general, the Lagrange multipliers z j in Equation (31) can be eliminated, in principle, by imposing the constraints i X ij = 1, i.e., by solving th following system of equations: We define, just for future notational convenience, the variable Y ij (X) as follows Y ij (X) = e βW ij (X) .
Then, we can make the normalization constraints manifest by defining two vectors: a right vector and a left vector with components U i whereby we can rewrite Equation (31) as where V j (X) can be calculated consistently with U i (X) using the following equations: We notice that Equations (37) and (39) are nothing but the Sinkhorn-Knopp equations [9,10] to rescale all rows and all columns of a matrix with strictly positive entries (as is indeed each term Y ij (X) = e βW ij (X) > 0 in the the present case) to sum to one.

V. ALGORITHM TO SOLVE THE SADDLE POINT EQUATIONS AND FIND THE OPTIMAL PERMUTATION
In order to find the matrix X * that solves equations (28) we set up an algorithm defined by the following iterative procedure. First of all, we need to introduce a regularized kernel Y (X; ) as with > 0 being a smoothing parameter to be send eventually to zero. In all our experiments we set = 10 at the start, and then decrease it by one, → − 1 until = 0, after each completion of the following routine.
ij at time zero to a uniform matrix as: X (0) ij = 1/N .
• 2) Calculate the quantity: We choose α = 10 −3 and β = 10 (the parameter α is a 'dumping' factor which helps the convergence of the algorithm).
• 3) Calculate U i and V j as follows: and U (1) i using the following equations: -d) If δ > 10 −5 then start over from step b); otherwise return U i and V j .
• 4) Update X by computing X ij as • 5) Calculate the quantity ∆ defined by • 6) If ∆ > 10 −5 then start over from step 2); otherwise return X ij .
The output matrix (X * ) ij is not a permutation matrix, but only a double-stochastic matrix, that is a matrix whose entries are real numbers X ij ∈ [0, 1] that satisfies the double normalization condition on row and column sums: To find the solution of the OPP we should take the zero temperature limit by sending β → ∞: in this limit the solution matrix X * is projected onto one of the N ! vertices of the Birkhoff polytope, which represents the optimal permutation matrix P * that solves the OPP. In order to find P * numerically, we use a simple method which consists in finding a solution X * at large, but finite, β (we use β = 10), followed by a hard thresholding of the matrix entries (X * ) ij defined by:

VI. FILTER MATRICES
Having discussed how to implement the algorithm, we present, next, several types of filter matrices A that we used in our clustering experiments.

A. Nestedness filter
The nestedness filter is described by a matrix A whose nonzero entries A ij are equal to 1 when the following condition is satisfied: where j max (i, p) is given by: as shown in Fig. 5a. The parameter p ∈ [0, 1] quantifies the nestedness of the matrix A. Specifically, low values of p correspond to a matrix A with a highly nested structure. Opposite to this, large values of p, i.e. p ∼ 1, describe profiles of low nestedness, as depicted in Fig. 5a,c The density ρ(p) of the filter matrix A defined by Equation (47) is given by which, in the limit N → ∞, becomes: The finite N behavior of ρ(p) together with its N → ∞ limit is showm in Fig. 5b.

B. Band filter
The band filter is a matrix A whose entries A ij ∈ {0, 1} are defined by where j min (i, p) = 1 + (i − 1) 1/p (N − 1) 1−1/p , where p is a parameter that controls the width of the band, hence we call p the bandwidth exponent. The band filter in Equation (51) has nonzero entries comprised in a band delimited by j min (i, p) and j max (i, p) for i = 1, . . . , N . The density ρ(p) of A is defined as the fraction of entries contained inside the band: For N → ∞, the density ρ(p) evaluates ρ(p) = 1 − p 1 + p .
The finite N behavior of ρ(p) along with the N → ∞ limit given by Equation (54) are showm in Figure 6b.
A useful quantity to characterize the shape of the band filter is the bandwidth b(p), which is defined by Let us define the rescaled coordinate x taking values in the range x ∈ [0, 1] as: whereby we can write the difference j max − j min as j max (i, p) − j min (i, p) = (N − 1)(x p − x 1/p ) .
Next we define the rescaled bandwidth b(p) as The square filter A is shown in Figure 8a and is parameterized by a number Q representing the number of blocks the matrix A is divided into. The size of each block is N Q × N Q . The Q square blocks are arranged along the main diagonal. Mathematically, the entries A ij of A are defined to be 0 or 1 by The density ρ(Q) is easy to calculate and evaluates: The triangle filter is shown in Figure 8b and is parameterized by a number Q representing the number of equal sized triangular blocks arranged along the main diagonal of the matrix.