Kemeny Constant-Based Optimization of Network Clustering Using Graph Neural Networks

The recent trend in using network and graph structures to represent a variety of different data types has renewed interest in the graph partitioning (GP) problem. This interest stems from the need for general methods that can both efficiently identify network communities and reduce the dimensionality of large graphs while satisfying various application-specific criteria. Traditional clustering algorithms often struggle to capture the complex relationships within graphs and generalize to arbitrary clustering criteria. The emergence of graph neural networks (GNNs) as a powerful framework for learning representations of graph data provides new approaches to solving the problem. Previous work has shown GNNs to be capable of proposing partitionings using a variety of criteria. However, these approaches have not yet been extended to Markov chains or kinetic networks. These arise frequently in the study of molecular systems and are of particular interest to the biomolecular modeling community. In this work, we propose several GNN-based architectures to tackle the GP problem for Markov Chains described as kinetic networks. This approach aims to maximize the Kemeny constant, which is a variational quantity and it represents the sum of time scales of the system. We propose using an encoder-decoder architecture and show how simple GraphSAGE-based GNNs with linear layers can outperform much larger and more expressive attention-based models in this context. As a proof of concept, we first demonstrate the method’s ability to cluster randomly connected graphs. We also use a linear chain architecture corresponding to a 1D free energy profile as our kinetic network. Subsequently, we demonstrate the effectiveness of our method through experiments on a data set derived from molecular dynamics. We compare the performance of our method to other partitioning techniques, such as PCCA+. We explore the importance of feature and hyperparameter selection and propose a general strategy for large-scale parallel training of GNNs for discovering optimal graph partitionings.


INTRODUCTION
Markov chains (MCs), and their extensions, have become a key tool across multiple domains to model dynamical systems [1,2] .In their simplest form, MCs incorporate sequential data as a Markovian process into a stochastic model, where the probability of observing the next event depends solely on the preceding event.By eliminating the long timescale dependencies that often occur in complex systems, MCs offer a robust and formal method to interpret high-dimensional data sets in the emergent era of big data.The simplicity, transferability, and capability of MCs reduce the difficulties associated with modelling and interpreting ordered dynamics, and have led to a wide range of use cases.These applications encompass modeling character sequences in computer science and genomics [3,4] , predicting web-user activity [5] , and modelling time-dependent processes in physics and chemistry [6,7] , to name just a few.
An application that has widely adopted these techniques is biomolecular simulation, where the Markov state model (MSM) formalism has emerged as a strategy for interpreting the results of molecular dynamics (MD) simulations as transitions between aggregated conformational states [8][9][10] .The surge in popularity of MSMs mirrors the increased ability to quickly generate large quantities of MD data, providing a useful tool to condense massive data sets into a small and intuitive probabilistic format.MSMs offer intuitive descriptions of complex dynamic systems as transitions between metastable minima and saddle points serving as transition states [11] , helping to elucidate the underlying functional kinetics of interest [12,13] .Over the past few decades, several domain-specific advances led to the development of techniques to automatically generate MSM representations of molecular systems.Simultaneously, these techniques have been employed to further sample the multitude of different configurations of biomolecular systems [14,15] .Despite being smaller than the huge data sets from which they originate, many MSMs remain high-dimensional, potentially consisting of hundreds of states.This large number of states is often unmanageable for downstream tasks such as drug design and biomolecular engineering [16] .As size often becomes the limiting factor to establishing an intuitive understanding of systems, many different approaches have been proposed to reduce the size of MCs and generate optimal ones [17][18][19][20] .However, the ability to generate large data sets is only increasing, as modern coarse-graining protocols [21] and new developments in machine learning [22] allow for larger time scales to be probed.Traditional graph partitioning (GP) algorithms, such as spectral clustering, k-means, or multilevel methods, and their derivatives, have been widely employed to address this problem [23][24][25] .However, these algorithms often face limitations in capturing the complex and intricate dependencies on global structure in MCs.Additionally, the generality of these methods means that they often disregard the critical dynamic information held within MCs.Alternative Markov-Chain Monte Carlo (MCMC) approaches have been applied for dimensionality reduction on MCs, which can be used with most clustering criteria [26][27][28] .Nevertheless, these approaches encounter difficulties in efficiently sampling the most relevant regions of partitioning space, resulting in impractical computation times and frequent overlooking of the most optimal solutions.
In this paper, we introduce machine learning (ML) models based on graph neural networks (GNN) to optimize dynamical network clusterings.We build upon previous work [29] for reducing the dimensionality of MCs, and propose ML architectures leveraging recent developments in using GNNs to solve the more general GP problem [30][31][32] .Although several existing approaches use GNNs to minimize partitioning cuts [33] or modularity [34] , these applications do not extend to kinetic graphs or optimization based on dynamical criteria.Additionally, these approaches mostly focus on providing approximate partitionings for large graphs rather than exploring their potential for identifying optimal partitionings in smaller networks.We extend these methods by representing MCs as kinetic networks and show how performing gradient descent on a large number of randomly initialized small GNNs can yield optimal network partitionings using elaborate optimization criteria.As a metric to quantitatively assess the degree to which a clustering preserves the underlying dynamics of the MSM, we use the change of Kemeny constant (KC).KC is an inherently dynamical quantity derived from mean first passage times (MFPTs) associated with any kinetic network, including those used within biomolecular simulations [35] .We also compare our clustering results with a commonly used kinetic network clustering method for MSMs, the robust Perron-cluster analysis (PCCA+), which does not use the KC as the optimization criterion [36] .
The proposed MC clustering method can produce a faithful low-dimensional replication of a system's underlying dynamics, while also providing a quantifiable measurement related to the amount of kinetic information lost.Along with outlining a general strategy for the large-scale training of GNNs for graph partitioning, we show our method's performance when minimizing the change in KC on several different kinds of synthetic graphs and an example derived from MD simulations.We also test more complex GNN architectures, showing how large-scale training of simple GNNs is the most effective for the task and hypothesizing why this is the case.

Graphs
A graph G with N nodes can be described as a set of vertices V = {v 1 , ..., v N } and edges e ij ∈ E, which are often represented as a binary adjacency matrix, A, where: Each node, v i , can be associated with a d feat dimensional feature vector, F i , similarly admitting a matrix representation as a (N × d feat ) matrix, F. These features can be derived from the underlying data or generated at run time in the case of a featureless graph.This makes it possible to define a general graph structure as: Therefore, in practice, a complete specification of the graph structure can be achieved solely using the matrices A and F.
In what follows, the terms clustering and partitioning are used interchangeably to refer to the GP problem.The general partitioning problem consists of dividing the nodes of G into M discrete nonoverlapping subsets, L ∈ {1, . . ., M }, such that specific criteria are met.These objectives can be general and applicable to nearly all graphs, and include, e.g., optimizing connectivity, minimizing inter-cluster edges, or maximizing intra-cluster similarities.Furthermore, these objectives can also be domain-specific, based on the underlying model the network represents, such as how the KC represents the dynamical information of the system.The (N × M ) partitioning assignment matrix, S, assigns the nodes of G into M distinct states, such that: Since the discrete nature of the assignment matrix poses difficulties for gradient-based methods, it is common to relax this constraint and work with continuous probabilities instead.In practice, this means introducing a soft assignment matrix instead, where S iL now denotes the probability of node v i belonging to cluster L, subject to the conservation of probability: where L runs over all M possible clusters.Some of the most commonly used clustering criteria used for the GP problem are the modularity, minimum cut, and Davies-Bouldin index (DBI), which are defined as follows: Modularity: The modularity, Q, measures how the connectivity of a graph partitioning differs from what would be expected in a purely random graph with the same node degree distribution.Consequently, this metric provides a quantitative assessment of how statistically surprising a proposed partitioning is, with higher values implying the presence of underlying communities within the data [37] .The modularity is calculated using the equation: where m is the total number of edges in the graph, d i and d j are the degrees of nodes v i and v j , respectively, A ij is given by Eq. ( 1), and S iL and S jL are given by Eq. ( 3).Networks characterized by high modularity exhibit dense intra-cluster node connections, but sparse inter-cluster node connections.While there have been criticisms of the underlying assumptions for modularity, it is a widely used and useful partitioning criterion in many areas [38][39][40] .
Minimum Cut: The minimum cut, C, of a given partitioning is the number of inter-cluster edges, which can be calculated using: where all indices and matrix values are given as previously defined.The simplicity of this criterion has led to its popularity across multiple domains, spurred on by its proved equivalence to the maxflow criterion [41] .It stands as one of the most studied partitioning criteria [42][43][44] , alongside its further generalisations commonly used to obtain more balanced clusters with minimal cuts [45,46] .Intuitively, the minimum cut partitioning of a network corresponds to the partitioning with the least amount of inter-cluster edges.
Davies-Bouldin index: The DBI quantifies the average similarity between clusters [47] , calculated based on the distance between clusters relative to the size of the clusters themselves, such that the DBI formula reads: where L runs over all possible M clusters, ∆ KL denotes the distance between cluster centroids, i.e. ∆ KL = ∥B K − B L ∥, and σ L represents the cluster diameter, calculated as the average distance between the feature vectors and the centroid of the respective cluster: where F i represents the feature vector of node i, and the centroid B L is computed by averaging the features vectors of all N nodes i within cluster L. A lower DBI value (bounded from below by 0) indicates better partitioning, as it suggests higher inter-cluster dissimilarity in comparison to intra-cluster similarity.We employed the DBI score as a metric for assessing clustering performance.

Markov Chains
The dynamical clustering criterion based on KC we use in this paper is derived from the kinetic properties of an MC.In previous work, we explored the relationship between the MFPTs and KC, providing thorough derivations of all the quantities presented below.Here we briefly summarise the key results required and direct the reader to our previous paper for a detailed reference [48] .
In a discrete-time, homogeneous MC comprising of n states, the probability vector of a random walk occupying any of the n-states at time t, p(t), can be expressed using a stochastic rate matrix, K, and a set of initial occupation probabilities, p(0), such that: where t = lτ , since in the discrete case transitions between states occur at integer multiples of a given time interval τ .The quantity Q(τ ) = e Kτ is known as the Markov matrix and is commonly used as the fundamental representation of a Markov chain.When using this representation, time propagation corresponds simply to repeatedly multiplying Q by the current state of the system.To ensure the conservation of probability, we have: The matrix K, or similarly Q, can be calculated from experimental or simulated transition data, such as MD trajectories, or a series of string-like sequences typically used in genomics.K fully defines the dynamic properties of the MC and can be used to derive several useful quantities relating to its underlying dynamics.Assuming detailed balance, the spectral decomposition of K into left, ψ l i , and right, ϕ l i , eigenvectors and corresponding eigenvalues, λ l , is given by: where the index l orders the eigenvalues in descending order.These eigenvalues are upper-bounded by 0 and are guaranteed to be real due to the detailed balanced assumption.The eigenvector corresponding to the largest right eigenvector, ψ (1) , is known as the equilibrium probability and is commonly denoted by p eq , corresponding to the distribution the system converges to as t → ∞.The other eigenvectors contain useful kinetic information corresponding to the relaxation and mixing times of the MC.These eigenvectors can be used to derive an expression for the average time it takes for a random walk starting from state i to first arrive at state j, i.e., the MFPTs of an MC, denoted t ji : which can be alternatively expressed in matrix form as: KC is the surprising result that the sum over all states of the product of the MFPTs with the equilibrium probabilities, p eq j , is a constant value for all i, i.e.: where the fact that j ϕ (l) j ψ (l) j = 1, ∀l and j ψ (l) j = δ l1 was used [48] .Intuitively, therefore, the Kemeny constant is an important quantity to measure the overall timescales present in a dynamical system, as it is equal to the sum of all relaxation times, 1/|λ l |.
By considering a clustering with the assignment matrix, S, from Eq. ( 3), we can define the equilibrium probability of cluster L, P eq L , as simply the sum of the populations of the individual nodes: KC can then be redefined in terms of a partitioned MC's KC, ζ S = L P eq L tLK , where tLK is the coarse-grained mean first passage time from state K to L. We can define tLK via [48] : L i∈K,j∈L p eq j p eq i t ji − 1 (P eq L ) 2 i,j∈L p eq j p eq i t ji , which leads to a variational form compared with the unclustered KC [29] : Rearranging this equation yields our partitioning metric ∆K: From this equation, it is clear that for the trivial single-state partitioning M = 1, ∆K = ζ holds, and when M = N , the value of ∆K is 0. As KC represents the sum of relaxation times in a system, and a clustering speeds up relaxation times by removing intra-cluster relaxation times, maximizing the KC of a proposed partitioning aims to preserve the original timescales as closely as possible.From Eq. ( 17) (with the last equality requiring a crisp clustering), we can see that maximizing the KC is identical to minimizing ∆K, and so this provides a useful metric of how kinetically disconnected the clusters are, in a form suitable for gradient-based optimization methods like neural networks.

Graph Neural Networks
To describe GNNs in this work we adopt a message-passing formalism [49] where a GNN-based encoder maps a node v i in G to a context-aware embedding using T separate layers via the iterative process: 5/21 where t ∈ {1, ..., T } denotes the current network depth, N (v i ) is the local neighbourhood of all nodes connected to node v i by edges e ij , and h t i represents the hidden representation of node v i at depth t.For consistency, the initial node feature vectors F i are represented by h 0 i .The functions M t and U t are commonly implemented as neural networks, known as the message-passing function and vertex update function, respectively.Their exact formula depends on the type of graph layer used.Two different graph layers are tested in our methodology: the GraphSAGE [50] and the GATv2 [51] layers.The former was chosen to provide a simpler, lower parameter function, which takes fewer resources to train, while the latter provides a more complex layer with the capability to better attune to all nodes.GraphSAGE layers use a simple message-passing function consisting of either the sum or mean, and a vertex update function which concatenates these expressions and feeds them into a feed-forward neural network: where [•||•] denotes concatenation and σ is any non-linear function.
GATv2 layers replace this with a formulation based on scaled dot-product attention seen in transformer layers.This differs by performing the aggregation step individually for each node in a neighbourhood to calculate a context-aware attention mask, which is then used to perform a weighted vertex update: which can be generalised across multiple attention heads.

METHODOLOGY
The proposed method assumes that a given neural network architecture, Φ(θ), is sufficiently expressive such that the mappings it explores, from the space of d feat dimensional node features, F ∈ R N ×dfeat , to the space of possible M -state partitionings, S ∈ R N ×M , includes the optimal solution, S opt , to the GP problem given our clustering criteria, along with a region of parameter space which can be optimized via gradient descent to yield it.Taking inspiration from Monte Carlo approaches to similar problems, we look to determine S opt by initializing Φ with several different sets of parameters θ and optimizing them in parallel using modern ML tools.In this framework, GNNs are an effective Ansatz to begin partitioning graphs using complex criteria.They also provide a generic and intuitive optimization framework with gradient descent.As Φ is a general model, operating in R without any problem-specific restrictions on its values or gradients, it can yield the trivial solutions to Eq. ( 17), such as the aforementioned M = N case.
To overcome this, we take inspiration from similar work on producing balanced clusters [33,34] , and added a penalty function to ∆K, which drives the network away from trivial solutions and gives the final form of the loss function, L: Optimal KC partitionings can have unbalanced subsets, with clusters that contain only a few nodes often being the most interesting transition states [11] .Therefore, the form of the penalty function is designed to penalise any invalid clustering (where any cluster has zero nodes), while also allowing the network to converge significantly unbalanced partitions with at least one node.The final penalty function is based on a Gaussian centered at 0, parameterized in this work using σ = 1.0 and η = 15.This means the penalty increases as any cluster approaches 0 nodes and becomes negligible for clusters with 1 or more nodes.
As MCs do not fit into the graph structure outlined in Eq. ( 2), simple methods were derived for interpreting node connectivity using an adjacency matrix, A, and node feature matrix, F. The connectivity of the graph was decided using a cutoff method, taking all node pairs where c is a constant.In practice, we found c = 0.015 to be sufficient for all graphs we worked with, although in other applications this could be changed depending on the density of the resulting network.
A systematic exploration of various node features was undertaken.We tested using the columns of various matrices which can be identified with nodes, such as the adjacency matrix, the rate matrix, the Markov matrix, and the MFPTs, as node features.For each node, we represented its connectivity, transition rates/probabilities to other nodes, or average time required to reach each node, by splitting these matrices into vectors.Alongside these, principal component analysis (PCA), implemented as the singular value decomposition (SVD) of each of these matrices, of all previously mentioned matrices, and the eigenvectors of the rate or Markov matrix were also tested.Finally, we also tested both a simple linear layer, which combined the adjacency, rate/Markov, and MFPT columns for each node, and a trainable embedding layer, similar to those often used as dictionaries in large language models [52] .More details regarding the features used are given in SI(A).
All neural network architectures tested used a GNN-based encoder, mapping the d feat dimensional features into an embedding dimension d embed via a hidden representation of size d hidden .Mean pooling was used for the aggregation function in the GraphSAGE layers as it provided better stability during training.The encoder was then followed by either a series of linear layers or dot-product attentionbased transformer layers of dimension d hidden as a decoder [53] .When using attention, the output of each transformer layer was normalised and added back to the input before being processed by a feed-forward layer.Our implementation used two transformer layers with a dropout layer between them.The output of a single trainable graph layer was used as positional encoding before the decoder.This transformer-like, attention-based decoder can process a whole clustering at once as a sequence of node embeddings.It was chosen to provide a comparison between a simpler model and one that is larger, more expressive, and better suited to handle highly contextual sequences, such as the node-cluster assignment matrix.All neural networks used are described in detail in Fig. 1.Building on the simplest model with GraphSAGE layers and the linear decoder (GraphSAGE + LD), we validated this model and extended it in two directions to test whether more complex architectures could improve performance.Besides the GraphSAGE encoder, we also implemented the GATv2 encoder.In addition to the linear decoder, we also implemented a transformer decoder (TD).This results in four models: (i) GraphSAGE layers and the linear decoder (GraphSAGE + LD), (ii) linear decoder with GATv2 layers (GATv2 + LD), (iii) GraphSAGE with transformer decoder (GraphSAGE + TD), and finally (iv) GATv2 with transformer decoder (GraphSAGE + TD).
To enable fast and efficient training of the networks, a general parallel methodology for training GNNs for partitioning was established, outlined in Fig. 2.This requires three additional functions to be defined, as well as a valid loss function and the number of processes P : Generator Function I: This function returns valid parameters θ for a given neural network Φ.This provides the initial starting point in parameter space for optimization, and complex generators can be used to provide a quicker convergence.In this work, however, simple initialization using both Kaiming and Glorot algorithms was tested.We used the Glorot method due to its slight advantage observed in early tests. [54,55]Additionally, we also explored bespoke initializations based on converged models, where a loss function was used that corresponded to the PCCA+ clustering.Culling Function C: This function determines whether or not a worker thread has finished training a network, and is called after a predetermined number of optimization steps has finished.Observing individual runs, it is common to see neural networks which initialize in local minima of parameter space and are unable to be trained.An effective culling function allows for these types of runs to be terminated early, allowing for more time to be spent in interesting regions of parameter space.To show the validity of the method, all experiments in this paper terminate after a predetermined number of training steps.Termination Function T : This function determines when to stop training, and when combined with effective choices for I and C, can allow for the implementation of complex methodologies for determining the convergence of an experiment.This methodology aims to be as general as possible to make implementing future applications for different criteria straightforward.To test the validity and effectiveness of using GNNs to cluster Markov chains, the method was first tested on synthetic kinetic graphs.Firstly, a simple stochastic block model (SBM) [56] was employed to generate random networks with 4 neighbourhoods.The initial set of networks used sampled inter-cluster connection probabilities from a Gaussian distribution scaled to be in the range [0, 0.25] and set intra-cluster probabilities to a constant 0.5.This led to networks with dense clusters, where the optimal partitioning always successfully separated them.We also designed SBM networks where the optimal clustering was less clear, sampling all connection probabilities from the same Gaussian as before.For both sets of examples, SBM networks with disconnected nodes were discarded, and the random-walk normalized Laplacian matrix, L, was used to generate rate matrices via: where I is the identity matrix, D is a matrix with the degrees of each node along the leading diagonal, and A is the adjacency matrix given by Eq. ( 1).Results on these synthetic graphs were compared with clusterings and respective ∆K values obtained from the PCCA+ clustering.We note that as PCCA+ uses a different kinetics-based objective function [29] that reduced the clustering problem to M 2 dimensions, the optimization is readily solvable using Schur decomposition, therefore only a single optimal cluster is obtained using PCCA+.Each SBM experiment trained 100 independently initialized models for 250 steps of gradient descent on 500 randomly generated graphs, ranging from N = 150 to N = 300 nodes in increments of 50.
To establish a baseline performance for all models on a single graph, the network was tested on a graph corresponding to a rate matrix derived from an analytical 1D potential simulation [29] .The system consisted of 100 nodes distributed along the potential, which comprised four distinct wells: three with 20 nodes, and one with 40 nodes.The optimal KC partitioning successfully separated all 4 wells into individual partitions.This example of a linear chain network architecture was chosen to test the network in situations where connectivity is limited.All four models were tested 1000 times on this example for 500 gradient descent steps each.A feature comparison was also performed on this system, where 1000 GraphSAGE + LD models were trained for 500 steps using each set of features described above.
Ultimately, the method was tested on a network derived from MD data.We used the MSM benchmark system of PyEMMA [57] corresponding to 500 ns long implicit water MD simulations of a small 5-aminoacid system, here referred to as the pentapeptide system.The simulation initially underwent clustering into 250 states using K-means on tICA projected data, followed by the calculation of a Markov matrix [58,59] , which served as the input to our method.Each experiment consisted of training 1000 models for 500 gradient descent steps to cluster the 250-state MSM into 5 states.Given the relevance of this example to biomolecular simulations, an additional hyper-parameter search was conducted to identify the optimal network architecture.For this purpose, we again trained 1000 GraphSAGE + LD models for 500 steps while varying the number of layers in both the decoder and the encoder, along with the dimensions d hidden and d embed .Furthermore, we performed a feature comparison following the same methodology as previously described.We also compared the results of this example to the PCCA+ and parallel tempering variational clustering (PTVC), our previous MCMC approach. [29]Additionally, we used the DBI score with the Markov matrix as the node features vectors as a metric to assess the quality of the partitionings produced.
The neural networks and relevant code were implemented in Python 3.10.6 using PyTorch 2.0 [60] or Tensorflow. [61]Network diagrams were generated using the Netgraph library. [62]To perform gradient descent steps, the Adam optimizer [63] was employed with a learning rate of 1 × 10 −4 .All networks utilized d embed = 32 and d hidden = 64, with 4 GNN layers in the encoder and 2 layers in the decoder, unless otherwise specified.Due to the numerical sensitivity and dependence of the MFPT calculation on accurate eigenvalues, 64-bit floats were used for all calculations, which were run on the CPU.Each experiment involved the simultaneous execution of 90 processes in parallel, with each process running on a single core.To compute the DBI scores, the scikit-learn implementation was utilized [64] .

RESULTS & DISCUSSION
Our GNN approach successfully found the optimal clustering for all SBM networks in the first set (e.g., Fig. 3a), where the clusters were well-defined, in 100 training runs or less.The optimal cluster matched the PCCA+ solution, confirming the viability of KC as the optimization criterium.Figs.3b and 3c detail the performance on the second set of SBM graphs featuring a more diverse range of networks.Our implementation identified partitionings which better optimized ∆K than the PCCA+ (Fig. S1).However, the improvement diminished as the number of nodes in the network decreased, as often the clustering became clearer.The stochastic nature of the GNN-based method however lends itself more to these systems where several viable partitionings have a low ∆K value.In these examples the optimal clustering is close to many other similar solutions, therefore it is helpful to evaluate many initializations to evaluate most of the very low ∆K partitionings.Violin plot distributions of optimal KC values found are also shown in Fig. S1.The 1D potential (Fig. 4a) was used to test our approach on sparse graphs where connectivity is limited to a linear chain, and the majority of effective ∆K partitionings lie in a small region of partitioning space.In this regime gradient descent steps can become less effective as the majority of initial partitions are far from the region of interest, and can be unable to escape their local minima.Fortunately, simply training more neural networks can alleviate the problem by providing a wider range of starting points for optimization.In future applications, an effective choice of the initialization function I could speed up convergence, whereby I can learn and then avoid initializing in uninteresting regions of parameter space.To further test whether different node features could improve the frequency of successful optimization outcomes, we tested several additional node features, sampling the outcome 1000 times each.In total, we tested nine different node feature types: (i) adjacency matrix, (ii) the PCA vectors created from the adjacency matrix, (iii) rate matrix, (iv) the PCA vectors created from the rate matrix, (v) MFPTs (default in other examples), (vi) the PCA vectors created from the MFPT matrix, (vii) eigenvectors of the rate matrix, (viii) trainable embeddings, (ix) a neural network.The feature comparison results (Fig. 4b) show that it is possible to use improved node features, such as the adjacency matrix or a trainable embedding in this case, to achieve a much more successful optimization for the 1D model potential-derived network.When using the rate matrix as node features in this example, the GNN fails to find the correct clustering even once.This may be due to the difficulty in discriminating nodes at high energies (with low populations) based solely on the rate matrix, as these nodes have similar transitions.The adjacency matrix, however, provides more information on the node location in the global structure of the network, leading to a noticeable advantage.Our GNN implementation allows for simple changes like this to enable defining the node features on a problem-specific basis, therefore giving the method additional viability across a wide range of MCs which vary greatly in structure.We also tested the four GNN architectures on the 1D potential, training 1000 models of each variety using the adjacency matrix as node features.Results are shown in Fig. 5b and described in Table 1.We found that simpler models perform better at finding optimal partitionings over more complex models in this framework.Despite the better average performance the GATv2 and TD models provided, no models but the simplest GNN were able to find the optimal clustering over 1000 runs.

10/21
The wide variance of the observed results showcases the poor ability of all models to optimize initial partitionings on this network when started from many regions of parameter space, however, in the smaller models this is mitigated by the smaller solution space explored and training time required.The large magnitude of some solutions is indicative of a partitioning problem where there is only a small region of interest in partitioning space, leading larger models like the GATv2 and TD models to struggle to find this region, possibly due to the more complex and unclear gradients calculated when the node embeddings are more co-dependent and additional resources and required to train the extra parameters.There is a sharp cutoff in the performance of all larger models, with the best value attained being consistently around 806 and the corresponding clustering shown in Fig. 5a.We were unable to pinpoint the exact cause of this lower bound.We also note that in current MSM building strategies only metastable clusters are identified, unless discretized collective variables are introduced.Typically, a k-means clustering is performed that assigns very rarely sampled transition regions to metastable states.Therefore the optimal clustering even from more complex GNNs (Fig. 5a) is able to separate the three main metastable states, which is in line with current practical MSM applications.This problem is particularly challenging for GP objective functions that are not derived from kinetics-based measures, and such measures produce qualitatively incorrect optimal clustering, such as Modularity [29] .Accordingly, the non-optimal solution of the GNNs represented in Fig. 5a corresponds to a lower DBI value than that of Fig. 4a, which suggests that the DBI measure is also inadequate for this system (Fig. S2).The optimal ∆K clustering, however, is identical to the solution found by PCCA+ in this example and is consistently found by the PTVC method due to the small size of this network.Table 1.Network architecture comparison for the 1D potential derived network.The four network architectures are compared in terms of the mean (standard deviation) and minimum for 1000 experiments, along with the percentage of times the optimal solution was found.Examples derived from numerical simulations of biomolecular systems often result in noisy derived MCs with dense connections, such as in the pentapeptide case.Here, the optimal PCCA+ clustering does not correspond to the optimal ∆K solution, with four nodes assigned to different clusters, which are highlighted in Fig. 6a and contribute 7.4% of the overall equilibrium probabilities (Fig. S3).As shown in Fig. 7b and Table 2, again the simplest GraphSAGE and linear decoder method outperforms all other GNN architectures similarly to the 1D example above.The PTVC method does not identify the optimal ∆K clustering in general when using a random initialization, but has been observed to find the correct solution when given the PCCA+ as an initial clustering.Here, we also compared the optimal solutions from different methods in Fig. 7a by evaluating the DBI score.Interestingly, our optimal KC clustering also corresponds to the lowest DBI score, indicating a better clustering using this measure as well.

Model
As the pentapeptide system is relatively small, significant structural differences cannot be identified within the identified small microstates.Correctly identifying these microstates is an open problem, with significant implications on the dynamics of the MSM [65] .We expect more significant differences between states for larger networks that correspond to larger systems, going beyond prototype peptide simulations.

13/21
In this example, our method finds the optimal clustering less frequently than in the synthetic examples of SBM networks.This could be due to the increased number of low ∆K partitionings available, which provide many local minima which are more difficult to escape.However, this is not a problem unique to the GNN method, as the PTVC method encounters the same issue, having difficulty finding the relevant areas of partitioning space and is unable to find the optimal clustering in this example using random initialization.The dense and interconnected nature of the nodes means it is difficult for PCCA+-based methods to identify the most optimal clustering as well, as more optimal clusters may not differ much in terms of meta-stability.The larger models suffer from a vanishing gradient due to the heavily conditional nature of optimizing all node partition probabilities at the same time, resulting in an inability to find the optimal clustering even once while nevertheless providing a better result on average than the GraphSAGE and LD model.The best clusterings found by the GraphSAGE and TD models are shown in Fig. S4 for reference.Still, they are outclassed by the PCCA+, and all but the GraphSAGE with linear decoder model provide worse performance than PTVC, highlighting the efficiency of simpler networks for this kind of task.
While further analysis of learning rate adjustments could provide an effective solution to train larger and more complex models on this problem, it is clear that training simple networks is both quicker and more reliable in finding the optimal clustering.Feature comparisons on the GraphSAGE + LD example show the success of the method in finding the optimal partitioning over 1000 individual runs, with all features tested except for using the eigenvectors (Fig. 6b).MFPTs show the best performance here, finding the optimal solution the greatest number of times.Most likely due to the computational simplicity in finding effective mappings from the MFPTs to optimized partitionings, as the value of Eq. ( 23) directly depends upon them.The poor performance of the eigenvectors is most likely down to numerical stability, as some of them are very small 64-bit floats.
One advantage of the PTVC method, which uses an MCMC-based algorithm, over our method, is the ability to be initialized from any arbitrary partitioning.The results shown are for the PTVC when given no initial clustering information, however, when the method is initialized from the PCCA+ solution it consistently finds the optimal solution on all occasions.In the GNN models, we also changed our initialization function I by pretraining the models to replicate the PCCA+ solution using mean-squared error loss that is minimal for the optimal PCCA+ solution.Once the GNN is trained to this specific loss function and identifies the PCCA+ clustering, it is then further trained using the Kemeny loss of Eq. ( 23).This pretraining, however, did not improve the performance of the GNN optimization (data not shown).Fig. 8 shows the results of a hyperparameter search using the best-performing simple GNN model to identify the optimal parameters for the network on the pentapeptide example.It highlights the problem that the network's effectiveness is highly dependent upon the chosen hyper-parameters of the neural network and the size of the network.The best parameter space for a given problem must be sufficiently large to be able to express the wide multitude of possible S matrices, while also being small enough to allow for effective optimization in a feasible time.This is highlighted by the poorer performance of most networks featuring four linear layers in the decoder, where most of the time the method either fails to find the optimal solution or does so rarely it is classified as an outlier.Future work can further verify whether a varying or learnable adjustment to the learning rate could provide a reasonable improvement, as this would allow networks to more efficiently probe the relevant regions of parameter space.Alternatively, further work could evaluate if it is possible to quantify the relationship between the size of the graph being clustered and the most optimal network size.
Table 2. Network architecture comparison for the pentapeptide system.The four network architectures are compared in terms of the mean (standard deviation) and minimum for 1000 experiments, along with the percentage of times the optimal solution was found.16/21

CONCLUSION
We describe a GNN-based network clustering optimization method, which uses the computing infrastructure and tooling that has risen around GNNs to provide a working basis for difficult optimization problems on graphs.The benefits are numerous, as GNNs provide an ample Ansatz for beginning optimization for partitioning phenomena in situations where otherwise finding an efficient representation for the task proves difficult.We provide a novel implementation of a GP approach to cluster graphs based on kinetic optimization criteria.Here we use the KC to obtain metastable kinetically optimal clusters.We outline a general methodology for training GNNs to optimize graph partitionings and compare the performance of four different architectures across three different kinetic network types, using several different node features.Our results show that smaller and simpler networks often show superior performance in our test cases, as larger networks become more difficult to optimize, and often cannot find the optimal solutions.However, as we are only interested in the best solution with the maximal Kemeny constant, multiple evaluations can lead to the correct global solution, even if only obtained less frequently in individual initializations.GraphSAGE layers combined with a linear decoder achieved both greater consistency and accuracy than any tested attention-based alternatives.
To show that our proposed method provides a viable alternative to existing methods in terms of the coarse grained systems, we compare optimal clusterings obtained with those identified using the PCCA+ clustering.It is important to note, however, that as PCCA+ uses a different kinetics-based objective function, the task of identifying the optimal clustering using the KC-based measure is independent of this comparison.Testing the method on SBM graphs generated to have a clear optimal partitioning, we can attain similar results to PCCA+, but with more optimal values of the Kemeny constant.A linear chain corresponding to an analytic 1D free energy profile also poses a difficult task for GNNs due to the low connectivity, while PCCA+ in this case can produce the optimal solution.We show how we can adapt the node features the GNN uses and the number of runs to alleviate this problem, and improve the performance.However, the optimal clustering is only found in about 5% of the attempts in only one GNN architecture, and therefore improving the reliability of the method still requires future work.The outline of our general methodology, introducing the strategy defining generator I, culling C, and termination T functions, provides an ample framework for future work in this direction.
Our MD-simulation derived MSM for a pentapeptide is the most complex example we considered here.It has a slightly more optimal ∆K clustering that seems to better capture metastable states than the PCCA+-derived clusters, as also seen by comparing the corresponding DBI scores.Only one of the four architectures can find this optimal solution in less than 3% of the time, highlighting the problem of GNNs, which further work on I functions can hopefully provide a solution for.Additionally, GNNs do not provide a straightforward way to initialize the search from a close-to-optimal solution, which makes the PTVC algorithm more efficient.
In this work, we explored a number of node features that were used by the GNNs to enable efficient optimization.The MFPT-based features were some of the most successful ones, as these directly relate to the definition of ∆K.However, for examples derived from biomolecular simulations, features that are specifically relevant to physical and structural observables (angles, distances, energy terms, etc.) could outperform the general features used here.This is a promising avenue for future directions.
In summary, we have shown the validity of using GNNs as an effective method to solve the GP problem applied to MCs, alongside describing a general framework outlining the strategy for optimizing partitionings using GNNs in parallel on graph-like structures.We propose using such a framework to solve the many different and difficult non-convex optimization tasks which are commonplace in computational fields using graph structures, taking full advantage of the speed, availability, and efficiency of modern ML tooling and associated hardware.The general training paradigm outlined provides a working basis for generalizing this strategy to different problems, and the results highlight the usefulness of simpler GNN architectures when solving such optimization problems.It remains to be seen how this can be improved by implementing more complex parameter space sampling strategies to aid the identification of more optimal solutions, taking inspiration from similar techniques used for enhanced sampling in MD.

SI(A) FEATURIZATION
In order learn effectively, GNNs require a distinct feature vector F i associated with each node N , which is used as the initial input to the model.There are several different ways to produce N feature vectors of dimension d feat .Here, we aimed to use general features based on the properties of the MSM, and did not include system specific information, such as structural properties, which could also be used for molecular MCs.Below is a list of the different featurization methods we used for KC-based clustering.
• Markov/Rate Matrix: As a MC can be fully defined by either a Markov matrix Q or rate matrix K, both of shape (N ×N ), the simplest featurization is to use the columns of either defining matrix corresponding to the outgoing rates/transition probabilities as the N feature vectors for each node with d feat = N .This produces a set of features for each node F i where each element of the feature matrix is simply a column of the Markov or rate matrix: F i = (Q 1i , ..., Q N i ), or analogously, F i = (K 1i , ..., K N i ).
We also tested whether performing PCA on the matrix K or Q prior to splitting into individual F i would improve performance, as done in other work on clustering using GNNs [2] .To do so, we perform PCA using the Sci-Kit Learn Python library [1,4] .Interpreting each column F i as a sample and the rows as features, we take all components of the PCA.This can be seen as a linear rotation of the defining matrix onto the perpendicular directions along which the node transition probabilities or rates vary the most, with the aim of making nodes with similar transitions appear more similar.
• Adjacency Matrix: As in the main work, we interpret the adjacency matrix of an MC to be a simple cutoff of the defining matrix (Q ij + Q ji )/2 > c = 0.015.This yields a binary (N × N ) square matrix, which we split into columns representing each node with d feat = N .We also again tested a PCA of this, utilizing the same strategy as before.
• MFPT Matrix: As the MFPTs are one of the most relevant properties for the KC, we tested using them as node features.Firstly, the MFPT matrix with elements t ji of the unclustered graph was calculated using Eq. ( 12).This generates an (N × N ) matrix of the MFPTs between nodes in the network.
If needed, we can then extract the PCA features, before finally splitting the matrix up into feature vectors F i of size d feat = N using the columns of the MFPT matrix.
• Eigenvectors: As the eigenvalues and eigenvectors of the Markov matrix describe important kinetic properties, we also used the eigenvectors of the Markov or rate matrices as feature vectors.
• Trainable Embedding: Trainable embeddings have found much success in Large Language Models (LLMs), where they provide an efficient way to represent the sentiments of different tokens, and have been employed previously as random features for GNNs.Trainable embeddings are equivalent to an associative array data structure, where a predefined set of keys each point to a vector of arbitrary size.The vector is initialized with random numbers which are then trained along with the rest of the network during back-propagation steps.We used the torch implementation [3] , which uses integer keys from {1 . . .N } representing each node.Several embedding sizes were tested before settling on d feat = 128 which provided the best performance across all systems.
• Single layer linear network: A single feedforward layer that takes as input for each node the 3N -dimensional vector that is formed by concatenating the corresponding columns of the (i) Markov/rate, (ii) adjacency, and (iii) MFPT matrices.The output dimension was d feat = 128 in all systems, and the weights of this layer were updated as part of the training process.

Figure 1 .
Figure 1.Details of all four GNN architectures used, along with a description of the preprocessing required to begin training, together with an overview of the training loop process for a single gradient optimization step.Implementations of the GraphSAGE and GATv2 GNN layers used in the encoder are provided, along with a schematic description of the transformer decoder (TD) and linear decoder (LD) architectures.Activation functions for the various layers are not included for brevity.

Figure 2 .
Figure 2. Schematic description of the algorithm used to train multiple networks in parallel, and the use of the general functions I, C, and T .

Figure 3 .
Figure 3.The performance of the SBM network clustering for randomly sampled cluster connection probabilities.(a) Example of a partitioning produced by the GNN optimization.(b) and (c) Quantitative assessment of the method's performance over 100 runs for the 500 different networks.(b) The proportion of runs where the method yields a partitioning with a lower ∆K value than PCCA+.(c) The average time needed in seconds for one training run when limited to run on one 2.20GHz CPU core.

Figure 4 .
Figure 4. (a) Plot showing the 1D potential used for an example of a linear chain network.Nodes are coloured according to the most optimal 4-state clustering for the system.(b) Violin plots for feature comparisons on the 1D potential network.Each experiment was run 1000 times with each set of features.The corresponding distribution of final ∆K/KC values is shown, along with a dotted blue line signifying the ∆K/KC value of the optimal solution for each problem.

Figure 5 .
Figure 5. (a) The clustering corresponding to the lowest ∆K/KC values found by all GNN models other than GraphSAGE + LD.(b) The distribution of ∆K/KC values from training 1000 copies of each neural network architecture on the 1D potential derived network.A logarithmic scale is used to highlight the large differences between the different models.The blue dashed line indicates the value associated with the optimal clustering for this system.

Figure 6 .
Figure 6.(a) Comparison of the PCCA+ solution with the optimal ∆K clustering for the pentapeptide system, which has DBI scores of 0.823 and 0.758 respectively.Networks are displayed using a force-directed layout weighted with the MFPTs.Nodes which differ between the two clusters are highlighted with a yellow border.(b) Violin plot distributions for 1000 experiments with each set of nine features.The corresponding distribution of final ∆K/KC values is shown as a violin plot, along with a blue dashed line for the optimal ∆K/KC value.

Figure 7 .
Figure 7. (a) A bar-plot showing the DBI values associated with the optimal ∆K partitionings found by each method used.(b) A violin distribution plot of ∆K/K values acquired from training 1000 copies of each neural network architecture on the pentapeptide system.The blue dashed line indicates the value associated with the optimal clustering.

Figure 8 .
Figure 8. Parameter grid search on the pentapeptide example.1000 networks were initialized for each set of parameters, and the resulting ∆K/KC distribution is shown.The grey background indicates the parameters identified as optimal.

Figure S1 .
Figure S1.A comparison of the ∆K values of the partitionings found by PCCA+ (the dotted black line) and the GraphSAGE + LD GNN method on four SBM networks, where the GNN provides a clustering with the most optimal value.These four networks represent the largest difference in our examples between the Kemeny optimum and the ∆K value corresponding to the PCCA+ clustering.

Figure S2 .
Figure S2.DBI values corresponding to clusterings found for the 1D example.

Figure S3 .
Figure S3.A histogram of the equilibrium probability distributions of each node in the pentapeptide example.Red bars indicate the presence of a node which is placed into a different cluster by PCCA+ and the GraphSAGE + LD method.These differing nodes represent 7.4% of the overall node populations.

Figure S4 .
Figure S4.A comparison of the pentapeptide network partitionings with the lowest ∆K value found by each method used.Figures (a), (b), (c) show those found by the GATv2 + TD, GraphSAGE + TD, and GATv2 + LD respectively