A collaborative visual analytics suite for protein folding research

https://doi.org/10.1016/j.jmgm.2014.06.003Get rights and content

Abstract

Molecular dynamics (MD) simulation is a crucial tool for understanding principles behind important biochemical processes such as protein folding and molecular interaction. With the rapidly increasing power of modern computers, large-scale MD simulation experiments can be performed regularly, generating huge amounts of MD data. An important question is how to analyze and interpret such massive and complex data.

One of the (many) challenges involved in analyzing MD simulation data computationally is the high-dimensionality of such data. Given a massive collection of molecular conformations, researchers typically need to rely on their expertise and prior domain knowledge in order to retrieve certain conformations of interest. It is not easy to make and test hypotheses as the data set as a whole is somewhat “invisible” due to its high dimensionality. In other words, it is hard to directly access and examine individual conformations from a sea of molecular structures, and to further explore the entire data set. There is also no easy and convenient way to obtain a global view of the data or its various modalities of biochemical information.

To this end, we present an interactive, collaborative visual analytics tool for exploring massive, high-dimensional molecular dynamics simulation data sets. The most important utility of our tool is to provide a platform where researchers can easily and effectively navigate through the otherwise “invisible” simulation data sets, exploring and examining molecular conformations both as a whole and at individual levels. The visualization is based on the concept of a topological landscape, which is a 2D terrain metaphor preserving certain topological and geometric properties of the high dimensional protein energy landscape. In addition to facilitating easy exploration of conformations, this 2D terrain metaphor also provides a platform where researchers can visualize and analyze various properties (such as contact density) overlayed on the top of the 2D terrain. Finally, the software provides a collaborative environment where multiple researchers can assemble observations and biochemical events into storyboards and share them in real time over the Internet via a client-server architecture.

The software is written in Scala and runs on the cross-platform Java Virtual Machine. Binaries and source code are available at http://www.aylasoftware.org and have been released under the GNU General Public License.

Introduction

Proteins are the building blocks of cells and are responsible for nearly all cellular functions [1]. One of the fundamental goals of biological research is to understand the mechanics of an organism in terms of its constituent proteins. Recent advancements in gene sequencing have provided massive amounts of genomic data encoding the amino acid sequences of the constituent proteins of an organism. However, this sequence alone is insufficient for determining the native conformational structure (i.e. the stable three-dimensional shape), and consequently the function, of a protein. It is through the process of folding that the conformational structure of a protein transitions from a random coil to the functional native conformation uniquely determined by its amino acid sequence [3].

Understanding this folding process is of paramount importance. Hence, significant effort has been devoted to investigating the dynamics and kinetics of protein folding. Molecular dynamics (MD) simulations are key tools in this effort. For example, an important theory of protein folding (the so-called “energy landscape” theory [39], [8], [25]) assumes that folding occurs through an organizing ensemble of structures via many possible pathways and intermediates. Through analysis of MD simulation data, key local and global interactions can be detected along the folding pathway obtained by, for example, projecting simulation data onto certain parameters (so-called reaction coordinates), e.g. native contacts, radius of gyration, and principal components [32], [27], [52], [48].

With the computational power of massively parallel modern computers, large-scale MD simulations are now routinely performed, yielding huge amounts of simulation data. As such, there is a pressing need for better tools to help users explore and interpret these massive, high-dimensional simulation data sets. In this paper, we describe an intuitive and effective visualization platform to facilitate the exploration of simulation data in a collaborative environment.

Given a set of MD simulation data, one can interpret it as a sampling of the conformational space of a given molecule. One important concept associated with a molecular conformation is its energy, which is often quantified in terms of free or potential energy. This energy function (defined on the molecular conformational space) has been fundamental in understanding protein conformational spaces and folding mechanisms.

The molecular conformational space is intrinsically very high dimensional. In order to obtain a global view of a molecular simulation data set, a natural approach is to project the data to 2 (or 3) in some way. For example, one can project the molecular conformations into 2 by choosing a pair of reaction coordinates,1 then represent each conformation in terms of these two coordinates [41], [34], [53]. One can then visualize the energy function on this low-dimensional projection by plotting the contours (isocurves) of the energy function. However, it is often difficult to judge the goodness of a set of reaction coordinates, and there is little consensus on this in the literature [39], [32], [52]. Furthermore, projecting data onto specific reaction coordinates may cause loss of information about other important properties.

Rather than committing to a set of reaction coordinates, one could instead use a general-purpose dimensionality reduction algorithm to project the data into low dimensions; e.g., [4], [13], [19], [42]. For example, Das et al. [13] and Plaku et al. [42] use a nonlinear dimensionality reduction approach based on the Isomap algorithm [47] to produce a 2D map from a set of MD conformations. By assigning an energy-correlated color to each point in the 2D map, their algorithm ScIMAP produces effective visualizations for several data sets generated from coarse-grained simulation. Hamprecht et al. [19] use multidimensional scaling techniques to obtain a low-dimensional representation of conformational space. They also employ an interesting basin spanning tree idea for visualizing the structure within energy basins. This in turn provides a glimpse into the topography of the high-dimensional energy landscape.

While many elegant dimensionality reduction algorithms have been developed in the past few years (see e.g. [47], [7], [43], [21], [12], [11] in the field of machine learning), these methods usually aim to preserve either global or local (distance) metrics. If the intrinsic dimension of the protein conformational space is far above 2 or 3, then distance distortion is unavoidable and can be arbitrarily large. If we now visualize a scalar field (such as the potential energy function) over the projected domain, then topological features (also referred to as topographical features), such as peaks and valleys in the low-dimensional projection, are often merely artifacts of distance distortion; that is, they do not correspond to true “peaks” and “valleys” in the high dimensional energy landscape. One example of this phenomenon is given by Harvey and Wang [20].

Given the importance of the energy function defined on the molecular conformation space, there has been a different line of work using the so-called disconnectivity graph to capture energy basins (valleys) and the connection between them through saddles. The disconnectivity graph was first proposed in [6] for potential energy, and has since been extended to free energy and has been widely used to understand the high-dimensional energy landscape for complex systems; see e.g., [15], [29], [26], [49], [50]. The disconnectivity graph is usually a rooted tree with leaves corresponding to local energy minima and internal nodes corresponding to connecting saddles of the energy function. While the disconnectivity graph provides a good way to show the key topographical features of the high dimensional energy landscape, it is not easy to explore and access the input molecular simulation data through this tree representation. As we will see later, the method proposed in this work will provide a terrain platform for such trees to enable visual analysis of the high dimensional simulation data.

Finally, we note that Stone et al. [46] presented an immersive visualization environment which uses specialized data structures and interactive techniques for trajectory visualization. However, their approach mainly focuses on efficient trajectory animation and does not deal with the problem of visual summarization or navigation of the data in its entirety. We also note that the term “terrain metaphor” was used before in [54], which presents a nice tool for visualizing and analyzing patterns in folding trajectories. In particular, the terrain in [54] is built based on patterns (identified by clustering algorithms) in folding trajectories and the frequency of their occurrences. The principle and visualization methodology are both different from our current work.

When metric preservation in dimensionality reduction becomes hopeless, we search for other sources of relevant information which can be conveyed in low dimensions. Given the importance of (free and potential) energy of a protein conformation in understanding molecular conformational spaces and folding mechanisms, we aim to preserve characteristic features of the energy landscape (which is the graph of the high dimensional energy function) using a low-dimensional metaphor. Specifically, the information which we preserve is similar to that which is encoded in the disconnectivity graph. However, rather than conveying this information in the form of a tree, we communicate this data in the form of a 2D landscape (see the right panel in Fig. 1) which provides some additional advantages and opportunities for interaction and visualization.

In particular, recall that given a set of molecular simulation data, we consider it to be a sampling of the conformational space C of this molecule. Now consider the energy landscape E:C, which is simply a high dimensional scalar function with the function value at each point (conformation) being the energy of this conformation. We then build a two-dimensional landscape L:[0,1]×[0,1] (i.e., a scalar field defined on the square [0, 1] × [0, 1]) as a metaphor for E with the property that E and L share the same contour tree (we will make this precise shortly in Section 2). L is called a topological landscape metaphor for the scalar field E, a concept originally proposed by Weber et al. [51] and further developed in [20]. See the right panel in Fig. 1, where we show the landscape (terrain) L (the graph of the function L), with the vertical direction indicating the energy function value. Intuitively, L preserves the mountain peaks and valleys (basins) of E. Merging or splitting of peaks/valleys in L indicates corresponding events in the high dimensional energy landscape E. Furthermore, areas of mountain peaks and valleys are proportional to the volumes of their high-dimensional counterparts as well.

While the current formulation of the software uses the contour tree of the high-dimensional energy function to build the 2D terrain, it would be straightforward to substitute an alternative structure in place of the contour tree to achieve a variety of different landscapes. Any tree with a scalar function defined over its nodes could be used as a substitute for the contour tree. Hence we can also build a landscape metaphor for the disconnectivity graph [6], [49] for data exploration and navigation.

Finally, the Mapper algorithm, developed and used in [45] for generic data analysis and successfully utilized in several biomedical applications [37], [31], uses a graph structure to summarize a high-dimensional scalar field. This graph is embedded in three dimensions and serves as a platform for high-dimensional data exploration. Our software explores an orthogonal direction to the Mapper algorithm, where we focus on building an intuitive and interactive information visualization framework centered on a 2D terrain: the terrain metaphor facilitates easy selection and inspection of molecular conformations, and allows the overlay of other (e.g. biochemical) information on the terrain (see Section 3.3).

We integrate this landscape metaphor to build an interactive, collaborative visual analytics tool for exploring massive, high-dimensional molecular dynamics simulation data sets. See Fig. 1 for one part of the interface of our software, where we link molecular structures and secondary structural information to the 2D landscape metaphor.

  • Our tool is interactive and intuitive: Researchers can now “see” and navigate the entire set of conformations as a whole, as well as interactively select and examine individual conformations. Users can, for example, by traversing our 2D landscape and annotating key conformations, perform conformational analysis from the local-fluctuation induced sub-ensembles (contained within certain energy minima basins) all the way to the global unfolded conformations (e.g., energy maxima). The data exploration utility of our software can facilitate researchers in forming and testing folding mechanism hypotheses.

  • Our tool provides a platform for information integration: Researchers can easily visualize other information of interest as overlays on the 2D terrain. Two specific examples include the fraction of certain secondary structure elements formed and the fraction of native contacts present in each conformation. Such integration would be much harder to visualize on a tree or a graph.

  • Our tool also provides a collaborative environment where multiple researchers can assemble interesting observations and biochemical events into storyboards, share them, and have discussions in real time over the Internet via a client-server architecture.

We present various encouraging preliminary examples to illustrate these points and the utility of our software in Section 3.

To the best of our knowledge, no previous system exists that can provide concise summarization of the global structure of MD simulation data while allowing local interactive exploration of the sampled conformations. In particular, the full structural information of every simulated conformation is preserved in our software, and can be easily accessed using our 2D landscape. We believe that an intuitive and effective tool for exploring the large-scale high-dimensional simulation data is essential, yet currently lacking. Our tool makes an important step forward in closing this gap.

Section snippets

Background on topological landscapes

This section serves as an accessible introduction to the mathematical and topological foundations of our proposed visualization framework for non-experts in computational topology. For a more rigorous treatment of theoretical concepts we refer readers to the work of Weber et al. [51], and Harvey and Wang [20].

Results and discussion

The goal of our software is to provide exploration utilities to enable domain experts and researchers to examine the simulation data, generate new insights, and help to build and test hypotheses. In this section we provide some examples to illustrate the basic utility and the potential of our software by applying it to two data sets described below.

Conclusions

We propose a visual analytics platform for analysis of massive, high-dimensional MD simulation data sets in a collaborative environment. The platform is built upon the idea of the topological landscape as a metaphor for the high-dimensional (potential) energy landscape. This landscape metaphor preserves important topological and geometric properties of the input data. More importantly, it provides a platform for users to explore and navigate the high-dimensional simulation data, allows them to

Acknowledgements

We would like to thank anonymous reviewers for helpful comments. And we thank the Ohio Supercomputer Center for generous computing resources. This work is partially supported by National Science Foundation under projects DBI-0750891 and CCF-1319406.

References (54)

  • B. Alberts et al.

    Molecular Biology of the Cell

    (2002)
  • D. Attali et al.

    Vietoris–Rips complexes also provide topologically correct reconstructions of sampled shapes

    Comput. Geom.

    (2013)
  • C.B. Anfinsen

    Principles that govern the folding of protein chains

    Science

    (1973)
  • B. Bienfait et al.

    Checking the projection display of multivariate data with colored graphs

    J. Mol. Graph. Model.

    (1997)
  • S.P. Bhat

    Crystallins, genes and cataract

    Prog. Drug Res.

    (2003)
  • O.M. Becker et al.

    The topology of multidimensional potential energy surfaces: theory and application to peptide structure and kinetics

    J. Chem. Phys.

    (1997)
  • M. Belkin et al.

    Laplacian eigenmaps for dimensionality reduction and data representation

    Neural Comp.

    (2003)
  • J.D. Bryngelson et al.

    Funnels, pathways, and the energy landscape of protein folding: a synthesis

    Proteins

    (1995)
  • F. Chazal et al.

    Towards persistence-based reconstruction in Euclidean spaces

  • H. Carr et al.

    Computing contour trees in all dimensions

    Comput. Geom. Theory Appl.

    (2003)
  • S Dasgupta et al.

    Random projection trees and low dimensional manifolds

  • D.L. Donoho et al.

    Hessian eigenmaps: locally linear embedding techniques for high-dimensional data

    Proc. Natl. Acad. Sci. U. S. A.

    (2003)
  • P. Das et al.

    Low-dimensional free-energy landscapes of protein-folding reactions by nonlinear dimensionality reduction

    Proc. Natl. Acad. Sci. U. S. A.

    (2006)
  • H. Edelsbrunner et al.

    Topological persistence and simplification

    Discrete Comput. Geom.

    (2002)
  • D.A. Evans et al.

    Free energy landscapes of model peptides and proteins

    J. Chem. Phys.

    (2003)
  • S.L. Flaugh et al.

    Contributions of hydrophobic domain interface interactions to the folding and stability of human gammaD-crystallin

    Protein Sci.

    (2005)
  • J.C. Gower

    Generalized procrustes analysis

    Psychometrika

    (1975)
  • Y. Hwang et al.

    A fast nearest neighbor search algorithm by nonlinear embedding

  • F.A. Hamprecht et al.

    A strategy for analysis of (molecular) equilibrium simulations: configuration space density estimation, clustering, and visualization

    J. Chem. Phys.

    (2001)
  • W. Harvey et al.

    Generating and exploring a collection of topological landscapes for visualization of scalar-valued functions

    Comput. Graph. Forum

    (2010)
  • Piotr Indyk et al.

    Low-distortion embeddings of finite metric spaces

  • J. Jester

    Corneal crystallins and the development of cellular transparency

    Semin. Cell Dev. Biol.

    (2008)
  • A. Jeyaprakash et al.

    Structure of a Survivin–Borealin–INCENP core complex reveals how chromosomal passengers travel together

    Cell

    (2007)
  • Jmol: An Open-Source JAVA Viewer for Chemical Structures in 3D

    (2013)
  • M. Karplus

    Behind the folding funnel diagram

    Nat. Chem. Biol.

    (2011)
  • S.V. Krivov et al.

    Free energy disconnectivity graphs: application to peptide models

    J. Chem. Phys.

    (2002)
  • I.V. Kalgin et al.

    Folding of a SH3 domain: standard and hydrodynamic analyses

    J. Phys. Chem. B

    (2009 September)
  • View full text