Interpretable Embeddings From Molecular Simulations Using Gaussian Mixture Variational Autoencoders

Extracting insight from the enormous quantity of data generated from molecular simulations requires the identification of a small number of collective variables whose corresponding low-dimensional free-energy landscape retains the essential features of the underlying system. Data-driven techniques provide a systematic route to constructing this landscape, without the need for extensive a priori intuition into the relevant driving forces. In particular, autoencoders are powerful tools for dimensionality reduction, as they naturally force an information bottleneck and, thereby, a low-dimensional embedding of the essential features. While variational autoencoders ensure continuity of the embedding by assuming a unimodal Gaussian prior, this is at odds with the multi-basin free-energy landscapes that typically arise from the identification of meaningful collective variables. In this work, we incorporate this physical intuition into the prior by employing a Gaussian mixture variational autoencoder (GMVAE), which encourages the separation of metastable states within the embedding. The GMVAE performs dimensionality reduction and clustering within a single unified framework, and is capable of identifying the inherent dimensionality of the input data, in terms of the number of Gaussians required to categorize the data. We illustrate our approach on two toy models, alanine dipeptide, and a challenging disordered peptide ensemble, demonstrating the enhanced clustering effect of the GMVAE prior compared to standard VAEs. The resulting embeddings appear to be promising representations for constructing Markov state models, highlighting the transferability of the dimensionality reduction from static equilibrium properties to dynamics.


Introduction
Particle-based computer simulations can provide unprecedented mechanistic insight into the driving forces of complex molecular systems, in contexts ranging from biochemistry to materials science [1,2,3]. These simulations rely on numerical integration of the relevant equations of motion as a means to navigate the system's conformational space. Due to the high dimensionality of this space, which prevents the exhaustive enumeration of all microstates, exploration is typically achieved through importance sampling [4]. Conformational sampling leads to an estimate of the potential energy landscape (PEL), which follows a Boltzmann distribution at equilibrium. Unfortunately, characterization of the PEL suffers from the so-called curse of dimensionality [5]-organization of the data in the high-dimensional space is challenging due to low population density. This problem is often remedied by projecting the PEL onto a lower-dimensional manifold, i.e., by performing a dimensionality reduction. By averaging over presumably unimportant degrees of freedom, the resulting low-dimensional surface represents a free-energy landscape (FEL). The ideal FEL distinguishes between microstates that are separated by large barriers on the PEL, yielding a partitioning of configuration space into collections of microstates, i.e., metastable basins. If all the largest barriers are accounted for, 2 Theory and Methods

Autoencoder
Autoencoders are special types of neural networks that are used for the task of representation learning in an unsupervised manner. They are composed of two connected parts: the encoder compresses the input signal to a low-dimensional representation, whereas, the decoder aims to reconstruct the input at full dimensionality from the reduced-space representation. The reconstruction loss, usually defined as either the mean-squared error or cross-entropy between the input, x, and the output, x , is minimized via backpropagation. Since the bottleneck dimension is typically much less than the original dimension, autoencoders learn the most compact representation of the input. Furthermore, because neural networks are universal function approximators, the learned data projections can generally preserve much more of the relevant information than with PCA or other basic linear projection techniques. Figure 1 shows the schematic structure of autoencoder with mean-squared error loss. There are different types of autoencoders which are tailored for special tasks. For instance, sparse autoencoders impose sparsity constraints during optimization, whereas convolutional autoencoders utilize convolutional layers instead of fully-connected layers, in which case they learn the optimal filters. Variational autoencoders, which model the latent space probabilistically, are used for generative purposes, i.e., they can create new samples that look like the ones in the training dataset without simple data replication.

Variational Autoencoder (VAE)
Variational autoencoders were introduced in [25]. In general, the theory of VAEs is approached from two different perspectives: variational inference and neural networks. This section starts with the former interpretation and then illustrates the connection between them. We mostly follow the notation and reasoning used in [33]. The input data and the latent variable are denoted by x and z, respectively.
The objective of the VAE is to find the posterior distribution P (z|x), which can be written in terms of the likelihood P (x|z), the prior P (z), and the marginal probability density of x, P (x), using Bayes law as The denominator P (x) is called the evidence and it could, in principle, be calculated using once the prior is selected. However, the calculation is typically intractable, as it needs to be evaluated over all configurations of the latent variable z. Therefore, the posterior is approximated using variational inference with a chosen easy-to-evaluate family of distributions Q φ (z|x), e.g., Gaussian functions, where φ is the variational parameter of the distribution. In particular, P (z|x) is inferred using Q φ (z|x) by reformulating the problem within an optimization framework, such that the Kullback-Leibler divergence between Q φ (z|x) and P (z|x) is minimized. The KL divergence between Q and P is defined as Equation 1 is then inserted into the posterior definition.
Since the expectation is taken over z, P (x) can be moved out of the expectation.
ELBO(φ) (5) The initial objective of minimizing the KL divergence between the exact and the approximate posterior is equivalent to maximizing the ELBO (Evidence Lower BOund), defined in Equation 5.
Equation 5 can also be rewritten in terms of a different KL divergence: Here the neural network perspective comes into play, as depicted schematically in Figure 3 (a). Q φ (z|x) acts like an encoder (inference), and transforms the data into the latent variable z. On the other hand, P (z|x) (which can also be parametrized with the network parameter θ as P θ (z|x) 1 ) generates the data from the latent representation, analogous to a decoder (generator). The parameters correspond to the weights and biases of the neural networks. Note that the initial aim is to minimize D KL [Q φ (z|x)||P (z|x)], which is equivalent to minimizing the RHS of Equation 6. The first term enforces the encoder to be similar to the chosen prior P (z), which acts as a regularization, whereas the second term on the RHS deals with how well the reconstructions match the original input.

Standard Selections for the Family of Inference Distributions and for the Prior Distribution
In order to use Equation 6 in an optimization procedure, both the family of distributions for inference, Q φ (z|x), as well as the prior distribution, P (z), must be specified. The most common assumption is that Q φ (z|x) (P (z)) is a unimodal Gaussian distribution with mean µ(x) (0) and diagonal covariance Σ(x) (1). Then, D KL [Q φ (z|x)||P (z)] has a closed form solution: where d is the dimension of the Gaussian and tr denotes the trace. Although the unimodal Gaussian assumption simplifies the calculations, it also restricts the possible latent space representations, and may hinder the performance of the variational autoencoder by pushing the latent space to be described by highly-overlapping clusters.

Gaussian Mixture Variational Autoencoder
This section is largely distilled from the discussion and insights presented in [34]. The term Gaussian mixture variational autoencoder is open to misinterpretations. There exist several distinct architectures given this name, with variations in the choice of generative or inference models [30,35,36,37]. In the present work, we take both the approximate posterior, (i.e., the family of distribution functions for inference), Q φ (y, z|x), and the latent space distribution (i.e., the prior), P (z), to be Gaussian mixtures. Note that we have introduced a categorical variable, y, which identifies which Gaussian each particular data point belongs to. The inference model can be written as  In the probabilistic graph representation, circle nodes represent the random variables, and directed edges represent statistical dependencies between the variables in the two ends. Dot nodes are used to indicate the parameters of the model, while some of the nodes are intentionally filled to differentiate the observed random variables from the non-observed ones which are left empty.
The latent space is composed of k distinct Gaussians, i.e., Q φ (z|x, y i ) is assumed to be Gaussian, where i ∈ 0, 1, . . . , k − 1. Thus, the approximate posterior becomes a Gaussian mixture.
Similar to Equation 5, the ELBO can be written as where the number of Gaussians, k, is a hyperparameter, and the subscript m is used to distinguish ELBO m from the VAE ELBO. P θ (x, y, z) can be written as P θ (x, y, z) = P θ (x|y, z)P θ (z|y)P (y) using conditioning without any assumptions. Then, by assuming that x is conditionally independent of y, i.e., P θ (x|y, z) = P θ (x|z) (see the graph representation in Figure 2(b)), the joint probability can be expressed as P θ (x, y, z) = P θ (x|z)P θ (z|y)P (y) .
By inserting Equations 8 and 10 into Equation 9, ELBO m becomes Similar to the VAE, the third and fourth terms represent regularization and reconstruction contributions to the loss, respectively. The initial prior on y is selected as a uniform multinomial distribution, while E Q(y,z|x) [log Q φ (y|x)] can be interpreted as a conditional entropy, reflecting how informative x is on y. To directly control the impact of the clustering relative to the other loss terms during training, we introduced a weighting factor, α, on the mutual information between x and y: Figure 3 presents a more detailed schematic of the GMVAE architecture, while Table 1 presents a summary of the probability distributions utilized in the model. First, data points are probabilistically assigned to k clusters (NN(Q y )). Q(y|x) represents these cluster assignment probabilities, and has multinomial distribution. Since each cluster is assumed to have Gaussian distribution in the latent space, the mean and variance of each of these Gaussians (Q(z|x, y)) are learned via the encoder part of the neural network (NN(Q z )). The low-dimensional representation, z, is then obtained by first sampling and then taking the expected value of these samples, i.e., z = k−1 i=0 p(y i |x)z i . As the first step in decoding, the moments of the corresponding low-dimensional representation z is learned by NN(P z ) from each Gaussian-distributed individual cluster y i , which is then followed by a sampling operation. P (y) in the decoder is assumed to be uniformly distributed among the k clusters. Next, using the encodings, z i 's, the associated x reconstructions are obtained again by sampling from the x by the NN(P x ). Similar to the encoder, the decoder obtains a fixed reconstruction by taking the expected value of x i 's.

Determination of Cluster Labels and Thresholding Scheme
The clustering within the GMVAE is probabilistic, i.e., each data point is assigned membership probabilities (between 0 and 1) to each of the clusters. Since most configurations are assigned predominantly to a single cluster, we perform a hard cluster assignment by assigning each data point to the cluster with highest membership probability. However, in cases where a configuration has similar membership probabilities for multiple clusters, this simple assignment may introduce errors when determining properties (e.g., transition probabilities) of the clusters. Thus, we also considered a different approach by enforcing a thresholding value for cluster assignment. More specifically, each configuration is only assigned to a cluster if the largest membership probability is above a chosen cut-off value. A naive coring scheme followed the thresholding operation such that the points that had been identified as noise were assigned back to their previous cluster index for all other dynamical analyses.

GMVAE Architecture and Training Hyperparameters
The GMVAE algorithm was implemented in Tensorflow [38], and is available at https://github.com/yabozkurt/ gmvae. Training was performed in all cases with fully connected layers, using the Adam optimization algorithm [39]. The Softmax activation function was used for probabilistic cluster assignments, while ReLu activation functions were employed in all hidden layers. The means were obtained without any activation, whereas Softplus activation was employed to obtain the variances. Table 2 shows the values of the hyperparameters for each example system. Default values were employed wherever the parameters are not specified. Number of nodes (NN(·)) columns correspond to the neural networks labeled in Figure 3. NN(Q y ) performs probabilistic cluster assignments, NN(Q z ) is for learning the moments of each Gaussian distribution in the encoding, whereas NN(P z ) and NN(P x ) are for the decoding of the z and x, respectively. The lengths of the "Number of nodes" entries correspond to the number of hidden layers.
Hyperparameter optimization was carried out as follows. The number of nodes was initialized as [16,16]. The number of nodes in the decoder (NN(P x )) was then increased whenever a large and non-decreasing reconstruction loss was observed. Our overall observation for the considered examples is that the learning rate and batch size should be kept relatively low to promote the formation of distinct cluster. The VAE results (with unimodal Gaussian prior) that are provided as comparison are obtained using k = 1, while keeping the remaining parameters equal to the values in the corresponding GMVAE model.  Table 2: Architecture specification and training hyperparameters.

Markov State Models
Markov state models (MSMs) represent the dynamics generated by a molecular simulation trajectory as a series of memoryless jumps between a discrete set of states [40]. Given a configuration-space discretization, a transition probability matrix, P(τ ), is obtained by counting the transitions between pairs of states within a given lag time, τ , and then performing a maximum likelihood optimization [41]. The eigenvalues of P(τ ), {λ i (τ )}, are related to characteristic timescales of the system's dynamics: where t i (τ ) is the timescale corresponding to the i th eigenvalue, λ i (τ ). The time lag parameter τ is typically chosen by performing the "implied timescale test", which assesses the Markovianity of P(τ ) through the convergence of its timescales with increasing τ . In other words, {t i (τ )} is plotted as a function of τ , and τ is then chosen as small as possible such that the largest timescales are sufficiently converged. Once τ is chosen, the accuracy of P(τ ) is determined via the Chapman-Kolmogorov (CK) test, which compares the estimated and predicted probability decay out of a given state. The predicted values are obtained using the CK equation, i.e., using the Markovian property of the model: where p ij (τ ) is the probability of transitioning from state i to state j within time τ , and m is a positive integer. The CK test is often performed on metastables states of the system-collections of quickly interconverting microstates.
Within the standard Markov-state-modeling workflow, microstates are typically defined on low-dimensional projections of the full-dimensional configuration space. Therefore, obtaining a relevant transformation of the molecular simulation data is the key. To this end, time-lagged independent component analysis (TICA) [13,42] is one of the most commonly used dimensionality reduction methods, as its objective is to maximize the autocorrelation of the data at the given lag time, making it especially well suited for kinetic modeling purposes. Metastable states are typically obtained via a dynamical coarse-graining procedure, e.g., PCCA+ [43] whose objective is to retain an accurate description of the dominant eigenvectors of the transition probability matrix. The resulting metastable states are then used as representative collections of microstates for performing the CK test. In many cases, a coarse-grained MSM at the resolution of the metastable states is constructed, providing an easily interpretable, albeit often qualitative, picture of the long timescale processes. In this study, the GMVAE performs the dimensionality reduction and clustering simultaneously, yielding a coarse-grained description of configuration space directly, without the need for further dynamical clustering. The (coarse-grained) MSMs are constructed from the discretized trajectories obtained using the simple cluster assignment based on the GMVAE membership probabilities as described in Section 2.3.1. MSM construction and analysis was performed using the PyEMMA package [44].

Peptide Analysis
The helical propensity of the peptide was determined using the Lifson-Roig perspective, which assigns each residue to either a helical (h) or coil (c) state, according to the dihedral angles along the peptide backbone (i.e., the Ramachandran plot) [45,46]. Therefore, the number of different conformations of the peptide is limited to 2 N , where N is the number of residues; N = 15 for AAQAA 3 . The propensity of residue i to be part of a "helical segment", h i , is then defined as the probability that residue i as well as its two neighboring residues are simultaneously found in a helical state. The average fraction of helical segments, f h , is obtained by averaging h i over all residue positions: To distinguish between partial helical structures occuring at the N-and C-terminus ends of the peptide backbone, we Note that the terminus residue from each end is not taken into consideration.
The dRMSD measures the average deviation of internal distances from the corresponding distances in a reference structure, and is calculated as where X(t) represents the conformation at time t, X r is the conformation for the reference structure, and || · || denotes the Euclidean norm. Note that, unlike other RMSD metrics, no pre-alignment of structures is required. In this study, due to the large fluctuations of the end residues, two residues from each end of the peptide were excluded in the dRMSD calculations. dRMSD was calculated using the positions of the C α atoms only. Helix, hairpin-like, and extended (coil) structures were separately considered as reference structures as illustrated in Figure S12.

Results
Variational autoencoders (VAEs) have been previously applied for dimensionality reduction of molecular simulation data [28,17,47]. VAEs typically employ a normal distribution to represent both the prior distribution in the latent space and the family of distributions for variational inference. In this work, we extend traditional VAEs by representing these distributions with Gaussian mixture models. The resulting Gaussian mixture VAE (GMVAE) adopts the physics-based viewpoint that an optimal embedding of the simulation data should give rise to a free-energy landscape (FEL) with well-separated clusters of configurations, which correspond to metastable states that are separated by large barriers along the high-dimensional potential energy landscape. The GMVAE introduces a categorical variable, y, which represents the various underlying Gaussian distributions to which each configuration will be (probabilistically) assigned. As a consequence, the approach simultaneously performs a dimensionality reduction and clustering, while enabling direct control over the organization of configurations in the latent space. We demonstrate the properties of this architecture by considering two model systems and molecular simulations of alanine dipeptide as well as a more challenging disordered peptide ensemble. In the following, X ∈ R n represents the n dimensional input. The latent variable in the bottleneck is represented by z ∈ R d , d ≤ n.

One-dimensional 4-well Potential
We first consider a single particle in one-dimension interacting with a 4-well external potential, which has been previously employed for testing methods associated with constructing MSMs [48,49]. Figure 4(a) presents the potential, whose functional form and simulation details are given in Section S.II. We employ a GMVAE with a latent space dimension of 1, which assesses the clustering performance of the architecture in the absence of any dimensionality reduction. The GMVAE was trained with k = 4 according to the parameters in Table 2. Figure 4(b) presents the confusion matrix of the resulting model, which quantifies the probability that the model assigns a predicted label (x-axis) given the true label (y-axis). The true labels were determined using a coarse-grained representation of the system, where four metastable states are defined based on simple dividing surfaces, chosen as the maxima of the barriers between each potential well (dashed vertical lines in Figure 4(a)). The GMVAE assigns the state labels with 97% overall accuracy.
Figure 4(c) shows a normalized histogram of z values. Without dimensionality reduction, the GMVAE largely retains the description of the input space within the latent dimension. As a consequence, the decoder is able to quite accurately reconstruct the input from the latent variable (See Figure S1). This behavior is in stark contrast to traditional VAEs, which employ a Gaussian prior to represent the latent space distribution. As a result, anti-clustering effects can arise, leading to highly overlapping clusters of data in the reduced space. To demonstrate this effect, we constructed a traditional VAE for the present example. Figure 4(d) presents the corresponding normalized histogram of z values. In this case, even without a reduction in dimension, significant information is lost due to the constraint of the assumed prior distribution.
To further characterize the quality of the GMVAE clustering, we constructed an MSM from the trajectories of the predicted cluster IDs. Figure 5(a) presents the standard implied timescale test, which assesses the convergence of the characteristic timescales with increasing lag time parameter τ . Convergence indicates that the simulation dynamics, within the discrete-state representation, can be described within a Markovian approximation. The grey area indicates timescales that cannot be resolved by the model, since they are faster than the chosen lag time. From the test, the MSM with τ = 200 was chosen for further analysis. The accuracy of this model was assessed with the Chapman-Kolmogorov test, which compares the simulated and predicted decay of probability from a chosen set of metastable states. Figure 5(b) demonstrates that the predicted "cluster dynamics" accurately represent the long timescale kinetic properties of the underlying simulation trajectory.

Müller-Brown Potential
To assess both the dimensionality reduction and clustering performance of the GMVAE approach, we next consider a single Brownian particle in two dimensions interacting with an external Müller-Brown potential. The trajectory data was generated as the procedure suggested in [28] with the standard parameters [50] (see Section S.III for more details). As depicted in Figure 6(a), the resulting FEL contains two deep minima along with a less stable intermediate state.
We employ a GMVAE that is trained with a latent space dimension of 1 and with k = 5, according to the parameters in Table 2.
Despite employing k = 5, the resulting GMVAE model identified only 3 states with non-zero membership probabilities. Thus, somewhat remarkably, the GMVAE architecture was able to identify the inherent organization of the input data in the high-dimensional space, independent of the hyperparameter k. Figure 6(b) shows the identified clusters. We define the true cluster labels in this case using linear dividing surfaces, as shown in Figure S2 the confusion matrix from the GMVAE model with respect to these defined labels. Although it appears that there are errors in assigning state 1, this error is sensitively dependent on the precise definition of the true label dividing surfaces. Moreover, the overall classification accuracy is actually 99%, since state 1 corresponds to a very rarely sampled intermediate state. The model also demonstrates relatively high reconstruction accuracy (See Figures S2(b) and S2(c)). Figures 6(d) and 6(e) present normalized histograms of z values obtained from the GMVAE model and a traditional VAE model trained on the same data, respectively. The low-dimensional representations obtained from the GMVAE clearly demonstrate a better separation of metastable states. Additionally, the ability of the GMVAE to learn a nonlinear manifold is demonstrated in Figure S3, with respect to the linear embedding obtained using time-lagged independent component analysis (TICA). To further characterize the quality of the GMVAE clustering, we again constructed an MSM from the trajectories of the predicted cluster IDs. The implied timescale test (Figure 7(a)) shows two dominant processes. The MSM with τ = 10 was chosen for further analysis. Figure 7(b)) presents the Chapman-Kolmogorov test, which further verifies the accuracy of the GMVAE embedding.

Alanine Dipeptide
Alanine dipeptide is a representative model system for the characterization of conformational dynamics. Previous work [51,31,52,27,49] has shown that the (φ, ψ) backbone dihedral angles act as ideal collective variables for describing the metastable configurational basins and associated transition kinetics, making it an excellent system for testing the GMVAE framework within a more realistic molecular simulation context. Since in general the optimal set of input features is unknown a priori, we use this example to test the ability of the GMVAE to identify the proper collective variables from a larger set of input features. More specifically, we consider as input features both the normalized pairwise distances between heavy atoms as well as the (φ, ψ) dihedral angles (obtained from [53]). The pairwise distances were pre-processed using a kurtosis filter (with the threshold value of 0.03, see Figure S4 for more detail), to reduce the input dimension by removing the low-variance features. The dihedral angles were pre-processed by applying sin and cos transformations in order to account for periodicity [54]. angle space, with four labeled metastable basins corresponding to α R , α L , β, P II , and γ conformations [55]. The gray lines are drawn for reference and do not represent any sort of optimal dividing surface.   Figure 9(b) shows the simultaneouslyobtained 6 clusters (indexed from 0 to 5) as a part of the GMVAE algorithm. The GMVAE again obtains a FEL that better separates clusters of conformations, relative to a standard VAE ( Figure S8). The distribution of these clusters on the Ramachandran plot (Figure 8(b)) already strongly indicates their suitability for a kinetic analysis. The GMVAE clustering distinguishes all 5 of the metastable states, as well as a transition region between the α R and β states (cluster 4). An MSM was again constructed from the coarse GMVAE cluster assignments. The implied timescale and Chapman-Komolgorov tests are presented in Figure 10, demonstrating the accuracy of this kinetic model.
We found in this example that, unlike the toy systems, the clustering obtained using the GMVAE did not appear to be completely robust. In particular, the precise clustering probabilities depend on the random effects of the training procedure (e.g., random weight initialization and the random shuffling of the input data). This issue was most pronounced for the lowest populated state, whose probability differs from the other states by two orders of magnitude ( Figure S5(b)). As a consequence, the γ state was not always sufficiently separated from the α L state, resulting in a loss of one of the resolved kinetic processes (although the accuracy of the MSM remained intact, see Figure S7). Despite this issue, the obtained FEL appeared rather robust with respect to changes in the random factors during training. We observed a much more robust clustering for all other applications considered. Chapman-Kolmogorov test (at lag=20 steps).

AAQAA 3 Peptide -I
As a more challenging test, we consider simulation trajectories of the capped helix forming peptide AC-(AAQAA) 3 -NH2, which is a representative system for investigating helix-coil transitions. We employ a coarse-grained model [56], which describes the dominant attractive interactions, e.g., hydrogen bonding and effective hydrophobic interactions between side chains, with simple potentials between the C α and C β atoms. These interactions are the minimum required to sample the proper range of structures, (i.e., helix, coil, and hairpin-like). This model also represents excluded volume effects in near-atomic detail, which was demonstrated to be important for accurately characterizing the helix-coil kinetics. Here we employ a parametrization of the model that most accurately reproduces the experimental cooperativity of the helix-coil transition for AAQAA 3 . As a result, hairpin-like structures appear to have relatively low metastability (similar to the intermediate state in the Müller-Brown example, and the γ state in alanine dipeptide), as we discuss further below. The model and simulation protocol are discussed further in the Supporting Information, and also in [56,57]. The considered simulation trajectories correspond to a disordered ensemble of peptide configurations, representing a stringent test for dimensionality and clustering methods [58].
Similar to alanine dipeptide, the set of sin and cos augmented (φ, ψ) dihedral angles along the peptide backbone were used as conformational descriptors. Thus, the input dimension is 60 for the 15-residue AAQAA 3 peptide. We chose to consider only a latent space dimension of 2, given that the ultimate goal of dimensionality reduction is often to reduce the high-dimensional description to something that is easily visualizable. Unlike the simple model systems above, the number of clusters, k, is completely unclear a priori. In fact, we expect that this ensemble to have a hierarchical structure, such that differing number of clusters may be appropriate depending on the chosen level of resolution. While we initially considered the GMVAE with varying number of clusters, we found that the number of "non-zero clusters" (i.e., clusters with a significant probability of configuration assignment) was extremely insensitive to this choice, as discussed below. The GMVAE was trained according to the parameters in Table 2. Also in contrast to the previous examples, there is no definitive reference kinetic model with corresponding known metastable states. Instead, the analysis below assesses the GMVAE embedding and clustering (in terms of both statics and kinetics) with respect to the landscapes obtained using a standard VAE and also following the standard MSM workflow (i.e., TICA [13,42], see Section 2.4 for more details).  Figure 11 show the FELs obtained using the GMVAE and the traditional VAE, respectively. As in the model systems, the GMVAE method results in a latent space description with highly separated clusters, while the traditional VAE yields more overlapping states. The two-dimensional TICA landscape ( Figure S19) also separates a number of clearly distinct states, although there are large diffuse regions with relatively low free-energy values. The clusters obtained via the GMVAE are shown in Figure 12 (a). Despite employing k = 10 and obtaining a landscape that appears to have approximately 10 distinct basins, only 7 states (labeled 0, 1, . . . 6) were assigned non-zero membership probabilities (see Figure S9). Since standard metrics for analyzing peptide configurations do not yield a clear organization of the ensemble into a small number of metastable states, the distribution of these quantities are expected to be highly overlapping, even for a good clustering of the input data. Thus, to more easily visualize the characteristics of the GMVAE clusters, we applied a thresholding scheme, which removes configurations without a membership probability greater than 0.95 (see Section 2.3.1 for details and Figure S10 for cluster populations). Figure 12(b) shows 5 representative structures closest to the cluster centers. We stress that these images are intended to give the reader a rough idea of the types of structures contained in each cluster, but do not characterize the variance of structures within the clusters. This is a disordered ensemble and each cluster necessarily contains a diversity of structures. Nevertheless, Figure 12(b) indicates that the GMVAE successfully distinguishes between distinct secondary structures within the simulation data.

Panels (a) and (b) of
To characterize the structural properties of the clusters quantitatively, we calculated the distribution of the average fraction of helical segments, f h . Figure 13 (Figures S23(b) and S19(b), respectively), although the VAE does not characterize partially-helical structures as clearly as the GMVAE. Figure S11 presents the intra-cluster distributions of f h , which can be used to assess the quality of the clustering (relative to an alternative clustering). We expect that an optimal clustering will result in tight, unimodal f h distributions. The GMVAE clustering yields seemingly good distributions for the most and least helical clusters, while the partially-helical clusters appear broader and somewhat bimodal. For comparison, we consider three alternative clusterings obtained by performing a k-means clustering on a given landscape followed by the PCCA+ dynamical coarse-graining method [43] to define a set of metastable states (see Section 2.4 for more details): (i) an alternative clustering of the GMVAE landscape ( Figure S16), (ii) a clustering on the VAE landscape ( Figure S24), and (iii) a clustering on the TICA landscape ( Figure S20). The alternative clustering scheme on the GMVAE landscape, (i), does not improve the intra-cluster distributions of f h , demonstrating that the GMVAE clustering is reasonable, given the GMVAE embedding. Similar results were obtained from the VAE clustering, with slightly broader distributions for the most and least helical states. The TICA clustering resulted in somewhat improved distributions, in the sense that they appear to be mostly unimodal, although some of the distributions appear to be slightly broader.  We also characterized the average fraction of helical segments on the N-and C-terminus sides of the peptide: h N and h C , repectively (see Section 2.5 for more details). Figure 14 presents the difference of these quantities, h N − h C , plotted along the GMVAE embedding. Positive values (represented by blue) indicate conformations that contain helical structure on the N-terminus side of the peptide without helical structure on the C-terminus side. Conversely, negative values (represented by red) indicate conformations that contain helical structure on the C-terminus side of the peptide without helical structure on the N-terminus side. Values close to zero correspond to either fully helical or non-helical structures. Although the GMVAE embedding and clustering separate the most distinct structures in the ensemble (coils and full-helicies), some of the clusters (0, 1, 2) encompass partially-helical conformations on both sides of the peptide (see also Figure S15). This is not ideal since kinetic barriers within a cluster will negatively impact the accuracy of a kinetic characterization at the cluster level. However, it appears that this issue may have more to do with the clustering than the embedding itself, since blue-and red-labeled structures appear to be reasonably separated on the landscape. Similar to the other examples above, we also constructed an MSM directly from the discretized trajectories of GMVAE cluster indices. Although thresholding was applied in the results presented here (practically similar to coring methods for constructing kinetic models [59]), we found that this procedure had negligible effect on the accuracy of the resulting MSM. As shown in Figure S14, the MSM constructed from the GMVAE clustering displayed significant errors in describing, e.g., the decay of probability out of the helix state. Perhaps this is not so surprising, since coarse-grained MSMs are often only used as a qualitative analysis tool, while higher-resolution kinetic models that characterize configuration space with many microstates are used for quantitative reproduction of simulation kinetics. Thus, to more carefully assess the GMVAE embedding and to more easily compare to the VAE and TICA results, we constructed a higher-resolution MSM by performing k-means to define microstates on the landscape ( Figure S16). Although the resulting model demonstrates improved accuracy according to the Chapman-Kolmogorov test, the probability decay out of the metastable states occurs on a fast time scale relative to the chosen lag time. This may be indicative of poorly defined dividing surfaces between metastable states. The kinetic models constructed from the VAE and TICA landscapes ( Figures S24 and S20, respectively) demonstrate similar quickly decaying probabilities. Although coring procedures could be applied to attempt to fix this problem, it indicates that there are fundamental limitations of all of these landscapes in terms of characterizing the long timescale simulation kinetics. There are several possible reasons for these difficulties, including (i) the limitation of our embeddings to two dimensions, (ii) the limitation of the chosen input features in characterizing kinetically-distinct structures, (iii) the presence of many low-lying barriers along the potential energy landscape of this disordered ensemble, and (iv) the poor sampling of relatively rare transitions to the full helix conformation. Further investigation of this issue is beyond the scope of this initial study of the performance of the GMVAE, and is left for future work.

AAQAA 3 Peptide -II
To investigate the impact of the low sampling of helical structures on the GMVAE embedding, as in the AAQAA 3 -I simulations presented above, we also considered a second set of simulations which primarily samples helical-and hairpin-like structures, while only rarely sampling fully-coil structures. (Please see the Supporting Information for more details about the differences between the two sets of simulations). In addition to the dihedral angles, normalized pairwise distances between residues that are more than 3 residues apart were included as input features. Figure 15 presents the obtained GMVAE FEL (panel (a)), the corresponding clustering of 6 metastable states (panel (b)), and overlays of five structures that are closest to the cluster centers (panel (c)). The GMVAE embedding demonstrates significant separation of metastable states, relative to the landscape obtained with a standard VAE ( Figure S37(a)). Similar to the previous ensemble (AAQAA 3 -I), Figure 16 shows the separation of structures according to f h (panel (a)), and dRMSD hel (panel (b)). The VAE and TICA landscapes demonstrate similar trends ( Figures S37 and S33, respectively). The intra-cluster f h distributions are shown in Figure S28. The majority of the fully-helical structures are in cluster 3 and 5, while clusters 0, 1, 2 and 4 contain hairpin-like structures as well as partial helicies. The coil structures are gathered in the bottom-most part of the landscape (in cluster 4), though not separated as a distinct cluster by the GMVAE. The distributions are broader and less unimodal than those determined from the previous set of simulations, although these can be somewhat improved with the alternative clustering scheme on the GMVAE landscape ( Figure S32). Similar results are also obtained from the VAE and TICA landscapes ( Figures S40 and S36, respectively). Figure 17 presents the characterization of the N-and C-terminus, partially-helical conformations. In contrast to the AAQAA 3 -I embedding, the GMVAE embedding and clustering for AAQAA 3 -II more clearly separates the distinct types of structures. It appears that this difference may be due to the increased sampling of helical structures in AAQAA 3 -II, although the inclusion of pairwise distances as additional input features may also have played a role. N-and C-terminus partially-helical structures are mostly located in clusters 4 and 2, respectively, while both types of structures can be found to a lesser extent in cluster 5. Although the VAE and TICA landscapes also appear to largely distinguish between distinct partially-helical structures ( Figures S37 and S33, resepectively), the GMVAE landscape provides a significantly better clustering of these two distinct sets of conformations.
Despite the improved description of partially-helical structures, the MSM constructed directly from the GMVAE clustering for AAQAA 3 -II displayed similar discrepancies to the model built for AAQAA 3 -I ( Figure S29). Moreover, the high-resolution MSMs constructed from the GMVAE, VAE, and TICA landscapes ( Figures S30, S38, and S34, respectively) displayed very fast decay of probability out of the identified metastable states, as in the AAQAA 3 -I example.

Discussion and Conclusions
Variational autoencoders are quickly making an impact in the field of molecular simulations due to the inherent focus of the architecture on retaining the essential features of the system. Control over the topology of the latent space can increase the performance and interpretability of these methods by making a direct connection to the physics of the system through our physical intuition: an ideal free-energy landscape characterizes basins that are well-separated by the largest barriers along the higher-dimensional potential energy landscape. To explicitly enforce such features, we propose a Gaussian mixture model as the prior distribution in the latent space.
The performance of the Gaussian mixture variational autoencoder (GMVAE) was illustrated on two standard toy-model systems and on the standard benchmark alanine dipeptide, as well as on a challenging 15-residue-long disordered peptide. For each example, the GMVAE circumvents the aggregation of points in the latent space characteristic of traditional variational autoencoders. Instead, samples that are structurally distinct are clearly separated, leading to a latent space that displays apparent metastable basins and barriers. The GMVAE introduces a categorical variable that probabilistically assigns samples to a set of underlying clusters, each of which is Gaussian distributed. Thus, the approach combines the commonly distinct tasks of dimensionality reduction and clustering into a unified framework. In the absence of dimensionality reduction, the GMVAE retains the characteristics of the system within the latent space, while providing an accurate assignment between clusters. Remarkably, in the case of limited dimensionality reduction, the GMVAE identifies the inherent clustering of the input data, insensitive to the cluster-number hyperparameter.
Beyond statics, there have been several recent autoencoder architectures aiming at the characterization of molecular kinetics. Several of these methods directly incorporate kinetic information in the loss function, either by reconstructing time-lagged samples or by approximating the dynamical propagator [31,32,27,28,60,61]. The interpretability of the latent space is becoming a feature of increasing interest: Hernández et al. recently proposed an approach for identifying the most important input features for determining the one-dimensional latent-space representation within a time-lagged VAE framework [28], while Wang et al. relied on a linear encoder to interpret the relevant coordinates of interest [60]. Here, we argue that incorporating physical constraints into the architecture helps to construct an interpretable model for the kinetics, even when kinetic information is not used for learning the representation. The GMVAE architecture attempts to better mimic the shape of an ideal free-energy landscape within the latent space. In particular, the presence of barriers that separate metastable clusters determines the relevant kinetic properties through the separation of timescales between intra-and inter-basin transitions.
We report extremely encouraging results for constructing kinetic models from representations learned from static information alone. For the two toy models and for alanine dipeptide, the resulting Markov state models demonstrate excellent properties, as monitored by the implied timescale and Chapman-Kolmogorov (CK) tests. The disordered ensemble of the AAQAA 3 peptide proves more challenging: the CK test shows discrepancies for the decay of probability out of the longest-lived metastable states. Although higher-resolution MSMs constructed directly from the GMVAE landscape demonstrated an improved description of the simulation kinetics, these models were unable to resolve the longest timescale processes. Comparisons of two distinct peptide ensembles clarified the role that sampling can play in distinguishing distinct partially-helical structures on the GMVAE landscape. It remains unclear to what extent the restriction of our embeddings to two dimensions or the choice of input features prevented the GMVAE (as well as the more standard methods considered) from better describing the simulation kinetics. Moreover, the presence of many low-lying barriers along the potential energy landscape of this disordered ensemble may cause fundamental challenges in obtaining a clear few-metastable-state characterization of the conformational landscape. Thus, we propose that, in conjunction with simpler test systems that clearly assess a method's performance, such examples are important for significant advancements in data-driven characterizations of molecular simulation trajectories.
While we defer a more detailed investigation of these issues for future work, we highlight the promising performance of the GMVAE demonstrated through our results. First, in the context of static equilibrium properties, the incorporation of the Gaussian mixture model as a prior distribution on the latent space closely links our physical intuition about ideal free-energy landscapes, resulting in an inherently more interpretable latent space. Secondly, our results show encouraging performance when constructing kinetic models from the learned representations-an aspect that is entirely absent in the loss function, representing an independent validation of the procedure.

S.I Overview
This document presents additional results to the main text. X ∈ R n denotes the reconstructions. The sampling operation in the reconstructions (shown in the decoding part of Figure 3), corresponds to taking the means of the Gaussians for simplicity.

S.1
Interpretable embeddings from molecular simulations using Gaussian mixture variational autoencoders

S.II One dimensional 4-well potential
The trajectory data is obtained as suggested in [49], and using the code provided in [62]. 100 × 100 transition probability matrix is obtained among the equally-spaced 100 bins in the interval [-1, 1] as follows where V i and V j are the potential energies at the centers of bins i and j, which are defined according to the potential of the form: V (X) = 2(X 8 + 0.8e −80X 2 + 0.2e −80(X−0.5) 2 + 0.5e −40(X+0.5) 2 ), and C i is the normalization factor. The system is initialized randomly, and propagated according to P ij 5 × 10 6 steps in time. Figure S1 shows the reconstructions in a scatter plot. The X = X line shows the lossless reconstructions. Figure S1: X vs X for one-dimensional 4-well potential. Reconstructions are obtained via the GMVAE. S.2
The true labels are defined as shown in Figure S2 Figure S3 further demonstrates the ability of the GMVAE to learn a nonlinear manifold that separates the three distinct free-energy basins, compared with time-lagged independent component analysis (TICA), which can only find a linear separatrix for the basins. Figure S3, showing the projections obtained with the GMVAE and TICA, was constructed following [28], with the colors indicating values of the latent variable while the gray dots correspond to trajectory data. Figure S3: Projections via the GMVAE and TICA. The GMVAE learns the nonlinear dividing surface in the lowdimensional space.

S.IV Alanine dipeptide
As the input features, dihedral angles and pairwise distances for heavy atoms that are provided in [53] for three simulations of length 250 ns each are used. Dihedral angles are transformed to their sin / cos representaions, and the pairwise distances whose variance are low are removed from the feature set (using kurtosis function from scipy.stats library [64], with threshold value of 0.03). We applied TICA to the set of pairwise distances only, followed by a kinetic coarse-graining with the PCCA+ method into 4 metastable states. Figure S5(a) presents the resulting clusters plotted on the Ramachandran plot. Figures S5(b) and S5 (c) show the histograms of these metastable states, and the GMVAE clusters, respectively.

S.V.2 GMVAE Landscape and the Cluster Assignments
Since the GMVAE method is a probabilistic clustering method, each data point has a probability of assignment to each of the k clusters. For a data point d i , the probability of assignment to each of the clusters has probability values p di,0 , p di,1 , . . . p di,k−1 for k number of clusters. In the ideal case, all of the probability values except the true cluster is equal to 0, and the true cluster has a value of 1. Figure S9 separately shows the histogram of probability distributions of all of the data points for a cluster. For instance, for cluster 0, the probability distributions are accumulated at probability values 0 and 1. In other words, with high certainty, cluster 0 is differentiated from the others. None of the data points is assigned to clusters 7 − 8 − 9. Note that although the network is initially trained for 10 clusters, it is not possible to separate more than 7 clusters under the specified loss function. This suggests a way to find the inherent number of clusters, i.e., metastable states, provided that k is chosen larger than that true value. Cluster ID's are obtained after Figure S9: The population distribution as a function of probability of belonging to each of the clusters after the training. None of the data points is assigned to clusters 7 − 8 − 9.
a thresholding step as explained in Section 2.3.1. Figure S10 shows the cluster populations. Cluster -1 indicates the datapoints that are not assigned to any of the clusters. Figure S11 shows the inter-cluster f h distributions. To further Figure S10: Cluster populations characterize the clustering of secondary structures, we separately calculated dRMSDs with respect to three reference structures: helix (hel), hairpin-like (hp), and extended (coil). Figure S12(a) presents both the reference structures (right) and corresponding dRMSD distributions (left). The first and the second small peaks in the dRMSD hel distribution represent helical conformations, while the peak corresponding to dRMSD hel values between 2 − 3.5 hints at the presence of the hairpin-like structures. Note that there is an offset in dRMSD hp values due to (i) the scarcity of the well-defined hairpins in the trajectory data, and (ii) the subjectivity involved in choosing the reference structures. By plotting the two-dimensional free-energy surface along dRMSD hel and dRMSD hp , shown in Figure S12 Figure S12(c) presents inter-cluster free-energy surfaces along dRMSD hel and dRMSD hp , generated by considering only conformations within a single cluster.
In addition to f h and dRMSD hel , the radius of gyration R g distribution is also analyzed. R g measures the massweighted deviations from center of mass, and gives an idea on the overall spread and compactness of the molecule, and is calculated as where m i is the mass of atom i, r i the coordinates of atom i, and r c is the coordinates of the center of mass. Figures 13(b), S13(a), S13(b), and S13(c) show the heat map of dRMSD hel , dRMSD hp , dRMSD coil , and R g on the FEL obtained via the GMVAE, respectively.
As a final characterization of the clustering, we constructed an MSM directly from the discretized trajectories of GMVAE cluster indices. Although thresholding was applied in the results presented here (practically similar to coring methods for constructing kinetic models [59]), we found that this procedure had negligible effect on the accuracy of the resulting MSM. Figure S14(a) presents implied timescale test. The kinetic model can resolve two of longest characteristic processes and demonstrates reasonable convergence, although there is a small increase in the longest timescale with increasing lag time. This subtle discrepancy already indicates that there may be some issues with the accuracy of the kinetic model. An MSM is constructed at lag time 700 to balance between the convergence of the timescales and the resolution of shorter timescale processes. Figure S14(b) presents the CK test from this model. There are significant errors in the description of probability decay from each of the metastable states (i.e., clusters), especially states 0 and 1. First, we note that coarse-grained MSMs (i.e., MSMs built on a small number of metastable states) are often not expected to be quantitatively accurate due to difficulties in accurately defining the dividing surfaces between states [40]. However, we anticipate that it should be possible to make a more accurate coarse-grained MSM for this particular simulation trajectory. The discrepancies in the model can then originate from two coupled problems: (i) the GMVAE latent space definition places structures close together that are kinetically distinct (i.e., there are hidden barriers) or (ii) the GMVAE clustering fails to identify/separate distinct metastable states. The FEL within the latent space (Figure 11(a)) contains clearly separated basins that are not identified as unique clusters by the GMVAE. In particular, within clusters 0 and 1, there seems to be 2 and 3 separate states, respectively. Figure S11 shows that cluster 0 (1) contains structures with a range of helicities ranging from 0.46-1.0 (0.15-0.69). According to the conventional picture of the helix-coil transition, the overarching kinetics can be described by two timescales: (i) the rate at which a single helical segment is formed and (ii) the elongation rate of helical segments along the chain. By grouping together conformations with a single helical segment and several helical segments, the GMVAE has convoluted these two timescales, resulting in non-Markovianity in the kinetics described on these clusters. To further clarify the source of these errors, we constructed an MSM in the conventional way, directly from the latent space distribution. More specifically, we applied k-means algorithm with 1000 cluster centers, and then applied PCCA+ [43]. In order to enable comparison, we continued with the previous number of metastable states (7). The CK test for the resulting model with lag time τ = 700 (obtained from the implied timescale test, Figure S16(b)) is presented in Figure S16(c), and demonstrates slightly improved accuracy.