CHAPERONg: A tool for automated GROMACS-based molecular dynamics simulations and trajectory analyses

Molecular dynamics (MD) simulation is a powerful computational tool used in biomolecular studies to investigate the dynamics, energetics, and interactions of a wide range of biological systems at the atomic level. GROMACS is a widely used free and open-source biomolecular MD simulation software recognized for its efficiency, accuracy, and extensive range of simulation options. However, the complexity of setting up, running, and analyzing MD simulations for diverse systems often poses a significant challenge, requiring considerable time, effort, and expertise. Here, we introduce CHAPERONg, a tool that automates the GROMACS MD simulation pipelines for protein and protein-ligand systems. CHAPERONg also integrates seamlessly with GROMACS modules and third-party tools to provide comprehensive analyses of MD simulation trajectories, offering up to 20 post-simulation processing and trajectory analyses. It also streamlines and automates established pipelines for conducting and analyzing biased MD simulations via the steered MD-umbrella sampling workflow. Thus, CHAPERONg makes MD simulations more accessible to beginner GROMACS users whilst empowering experts to focus on data interpretation and other less programmable aspects of MD simulation workflows. CHAPERONg is written in Bash and Python, and the source code is freely available at https://github.com/abeebyekeen/CHAPERONg. Detailed documentation and tutorials are available online at dedicated web pages accessible via https://abeebyekeen.com/chaperong-online.


Introduction
Molecular dynamics (MD) simulation is a robust and valuable tool for studying the dynamic behavior, energetics, and interactions of diverse biological systems, including proteins, protein-ligand complexes, nucleic acids, and membrane lipids [1].These simulations provide insights-in full atomic details and at precise temporal resolutions-into the dynamics, stability, and functional properties of biomolecules, complementing experimental observations and providing guidance for further investigations [2,3].GROMACS [4] is a widely used MD simulation software.It is one of the gold standards for biomolecular simulation not only because of its efficiency, accuracy, and extensive range of simula-Fig. 1.An overview of the workflows and functionalities that CHAPERONg offers and automates.These include the entire GROMACS conventional MD simulation (left) and the steered MD-umbrella sampling (right) workflows, as well as several post-simulation analyses (middle).Ligand topologies are generated using parameters obtained from various servers/tools for the CHARMM, AMBER, GROMOS, and OPLS-AA force fields.Functionalities highlighted with various colors indicate those integrated with GROMACS but are offered or largely facilitated by CHAPERONg (blue) and other third-party tools such as DSSP (grey), PyMOL (red), g_mmpbsa (yellow), MD DaVis (green), etc. QA: Quality assurance, RMSD: Root mean square deviation, RMSF: Root mean square fluctuation, Rg: Radius of gyration, SASA: Solvent-accessible surface area, Hbond: Hydrogen bond, WHAM: Weighted histogram analysis method.
with working on the command line [5,6].In addition, setting up and running MD simulations requires a lot of manual tasks, making the process time-consuming, labor-intensive, and error-prone.Since many of these steps are repetitive and programmable, automation tools would greatly minimize manual user interventions, thereby improving efficiency and empowering users (beginners and experts) to focus on other aspects like optimization of parameters, data analysis, and result interpretation [9].
Furthermore, MD simulations typically generate huge amounts of trajectory data, but processing the data into particularly meaningful, relevant, and informative forms often requires programming and data analysis skills [10,11].Thus, beginner and intermediate users with limited or no such skills are only able to gain superficial insights from MD output data in spite of MD simulation being a computationintensive process.The ability to transform simulation data into more interpretable forms and consequently obtain optimally useful information would facilitate the gaining of relevant biological (structural and functional) insights from MD simulations [11].
In order to assuage the aforementioned challenges, several tools integrated into standalone or web-based graphical user interfaces (GUIs) have been developed to automate or simplify various steps or aspects of the GROMACS MD simulation process.The earliest GUI-based programs including GUIMACS [12], jSimMacs [13], and GROMITA [14] that offered some capability to carry out GROMACS MD simulation of protein (only) systems have not been updated for a long time, making them incompatible with recent GROMACS versions [6].Other GUI-integrated plugins like Dynamics PyMOL plugin [5,15], Enlighten2 (a PyMOL plugin and Python package) [16], and YAMACS (a YASARA plugin) [6] have such limitations as restrictions to the simulation of specific systems (protein only or protein complexes), support for only select force field(s), lack of trajectory analysis functions, non-trivial installation of dependencies, or the need to learn other software interfaces upon which they depend [16].Existing web-based interfaces include MDWeb [17] and WebGro [18]; both of which offer MD simulation over limited timescales, and CHARMM-GUI [19]; a toolkit for generating input files for MD simulations using the CHARMM force field.MDWeb does not support the simulation of protein-ligand complexes, and for Web-Gro, the support is limited to the GROMOS force field.VisualDynamics [8] and BioBB-Wfs [9] are two recent web-based initiatives that also offer MD simulations over limited duration.While they are excellent platforms, they, however, only provide basic analyses of simulation trajectories.They also do not offer advanced simulation workflows (such as biased or enhanced sampling simulations).
In this work, we present CHAPERONg, a comprehensive automated pipeline for GROMACS MD simulations and trajectory analyses.CHAP-ERONg is a command-line interface to GROMACS that automates and streamlines the entire MD simulation protocols for protein, proteinligand, and protein-DNA systems (Fig. 1).It supports ligand topology parameters obtained from popular external parameterization programs for the CHARMM, AMBER, GROMOS, and OPLS-AA force fields.CHAP-ERONg seamlessly integrates with GROMACS modules and third-party tools to enable an extensive automated workflow of up to 20 different post-simulation trajectory and end-point analyses.In addition, it automates the steered MD and umbrella sampling simulations, a biased enhanced simulation protocol often employed to overcome sampling limitations and investigate rare events.Thus, CHAPERONg would not only make MD simulation more accessible to beginner GROMACS users but also expand the toolset of experts by facilitating improved efficiency and providing a platform upon which advanced and customized analyses and scripting could be built.

Methods and code implementation
CHAPERONg has been developed using the Bash shell scripting and the Python 3 programming language.The framework and primary modules of the CHAPERONg source code were written using Bash shell scripting because GROMACS is a Linux-based software.This allows a seamless GROMACS-CHAPERONg integration and ensures that the only real dependency of CHAPERONg is simply a functional GROMACS installation.Thus, the entire MD simulation pipelines can be automatically executed without the need for installation of additional dependencies or software save those required by GROMACS itself.
Other modules of CHAPERONg which provide additional and advanced functionalities are written using Python.Various Python libraries are used including Numpy [20], Pandas and Scipy [21]; for data manipulation and numerical and scientific calculations, and Matplotlib [22]; for generating graphical plots and figures.PyMOL [23], ImageMagick, or ffmpeg is used for generating simulation movies.Secondary structure elements are analyzed using DSSP [24].Xmgrace is used for graph plotting and conversion.The MD DaVis package [11] is used for the construction of hydrogen bond matrices and interactive three-dimensional visualizations of free energy landscapes.Installation of CHAPERONg is achieved by simply running the install script provided in the package.To make all features offered by CHAPERONg easily accessible to users, an isolated Anaconda Python environment with all needed dependencies can be set up by running a conda setup script also provided in the package.
CHAPERONg offers automated GROMACS-based workflows for unbiased conventional MD simulation of protein(-only) and protein complex systems, using established and previously reported protocols [25,26].In addition, up to 20 automated analysis types covering system setup and simulation quality assurance analyses as well as postsimulation trajectory analyses are provided.A GROMACS-based workflow for the steered MD and enhanced umbrella sampling simulations for protein complexes are also automated [27,28,25].Automated quality assurance analyses and the WHAM (weighted histogram analysis method)-based free energy calculations [29,30] are also provided for the biased simulations.

CHAPERONg features and functionalities
CHAPERONg can be run in one of two modes of automation depending on the user's choice; either as full-auto or semi-auto.In the full-auto mode, all simulation steps and post-simulation analyses are automatically carried out based on the simulation type and user-provided parameters.This greatly reduces repetitive and tedious manual interventions, and the user is only prompted for inputs in a very few exceptional cases where automatic or pre-defined choices might not be trivial or suitable (e.g.determining the box size of an umbrella sampling simulation).The semi-auto mode still has most of the simulation and analyses automated, but the user is prompted more for inputs and confirmation of automatically selected choices to give more flexibility and control over the simulation parameters.

Automated conventional MD simulation
CHAPERONg offers automated MD pipelines for various systems, namely protein-only (including protein-protein complexes), proteinligand complexes, and protein-DNA complexes.For protein-ligand systems, the pipeline recognizes small molecule ligand topologies generated via popular ligand parameterization programs and webservers, including CGenFF (for CHARMM) [31], ACPYPE (for AMBER) [32], PRODRG2 (for GROMOS) [33], and LigParGen (for OPLS-AA) [34].The automated protocol is organized into 12 major steps, enabling the user to start or resume from any step of the simulation process.The minimum input files required to run CHAPERONg are the starting structure and appropriate GROMACS parameter (.mdp) files.

System preparation and quality assurance analyses
Once launched, CHAPERONg automatically runs through the conversion of the input structure file to the GROMACS format, generation of protein topology (and ligand topology, if applicable), definition of the simulation box, addition of ions to the system, energy minimization and NVT/NPT equilibration steps.For each of these steps, the user has full control over how the system is set up.The system and topology files are automatically updated accordingly, depending on the type of system.Following the energy minimization and equilibration steps, quality assurance analyses-such as plots of the progression of the potential energy, density, temperature, pressure, and other thermodynamic parameters-are run.These enable the user to monitor the convergence of the indices and, hence, the quality of the simulation system.

Hydrogen bonding analysis
Hydrogen bond (Hbond) analysis is often used in MD simulations of biomolecules for the investigation and understanding of protein structure, folding, function, and ligand binding as well as other biomolecular interactions [35].Hbond calculation in GROMACS is carried out using the gmx hbond module.Depending on the type of system, the numbers of intra-and inter-molecular Hbonds are calculated and plotted as a function of simulation time.Several other output files, such as the Hbond matrix and index files, are also generated and processed by CHAPER-ONg to parse them as input to other analyses, like the MD Davis-based interactive Hbond matrix calculations.

Principal component analysis
Principal component analysis (PCA) is a statistical technique used to reduce the high-dimensional simulation data-i.e., the coordinates of atoms over time-into a smaller set of orthogonal (principal) components.It helps to visualize the essential dynamics and conformational changes in the trajectory by identifying the most important collective motions in the system [43].PCA in GROMACS is carried out using the gmx covar and gmx anaeig modules.The principal components are also processed by CHAPERONg and parsed as input for further conformational analyses-e.g., as order parameters for constructing free energy landscapes.

Clustering analysis
Clustering in MD simulation is another common technique that is also used to reduce the complexity of trajectory data.It involves grouping similar conformations based on defined structural similarity, enabling the identification of dominant conformational states, dynamics, and transitions [44].The gmx cluster module in GROMACS carries out the analysis, and the automation by CHAPERONg maintains the flexibility and array of options it offers.Fig. 3A shows examples of two of the output data plots generated by the analysis.

Secondary structure analysis
Secondary structure (SS) analysis of MD simulation trajectory involves identifying and quantifying the protein secondary structure elements throughout the simulation.The SS analysis in GROMACS is carried out by the gmx do_dssp module which relies on the DSSP program [24] for the assignment of SS elements.In addition to the SS analysis plot featuring the default seven SS types assigned by DSSP (see Fig. 3B, left), CHAPERONg reprocesses the SS elements matrix data to generate a second copy of the plot containing only the four basic SS elements-helices, beta-sheets, turns and coils-as shown in Fig. 3B (right).This simplifies the appearance of the plot to aid its visualization and analysis.

Simulation movie
An MD simulation can be summarized into a movie, which is a collection of frames extracted at a specified interval from the trajectory.Simulation movies facilitate the analysis, interpretation, and communication of the simulation results [45].They provide an animated overview and visual representation of simulations, and can help to easily visualize the motions of regions of interest, such as active sites and pockets, or to observe conformational movements, interactions or displacement of ligands.The minimum requirement for CHAPERONg to For details of the simulation system and setup, see Supplementary Method section 3. (B) Secondary structure analysis of the human erythrocytic ubiquitin (PDB ID 4GD6) simulation trajectory.Two plots with the seven-SS-type (left) and the four-SS-type (right) representations are produced.For details of the simulation system and setup, see Supplementary Method section 2. create a simulation movie is PyMOL, a widely used molecular visualizer.CHAPERONg also utilizes either of the ImageMagick convert tool or ffmpeg (when either of them is detected on the user's machine) for improved movie quality.Supplementary Files S1 and S2 show two example movies generated by CHAPERONg for the example simulations of ubiquitin and ligand-bound KEAP1 Kelch domain, respectively.

Free energy landscapes
Free energy landscapes (FELs) provide insights into the energetics and stability of different conformational states in MD simulation trajectories.CHAPERONg offers three alternative automated ways for the construction of two-dimensional representations of the FEL (Fig. 4).These are enabled by the GROMACS gmx sham module for 2D visualizations (Fig. 4A), the CHAPERONg energetic landscape module for enhanced 2D visualizations (Fig. 4B), and the MD DaVis tool for interactive 3D visualizations (Fig. 4C).Each of these alternatives requires the user to specify two order parameters for the FEL calculations.Global parameters that describe the state of the system can be used as input, including RMSD, Rg, principal components, fraction of native contacts or number of Hbonds, backbone dihedral angles and configurational distance, etc. [46].Two preset pairs of order parameters-principal components from a PCA run and the RMSD-Rg pair-are available in CHAP-ERONg.A third option that allows the user to provide other quantities of interest as input is also available.
The FEL calculation by CHAPERONg employs a modified version of a previously described method [46,47].The relative free energies of states are estimated using Boltzmann inversion as shown in Equation (1).The relative free energy of the most probable state is set to zero while other states are computed to have more positive relative free energies.For all the three approaches, CHAPERONg also automates the extraction of the lowest energy structures from the FELs, as well as other FEL-guided structures or frames specified by the user.
where  is the Boltzmann constant,  is the simulation temperature,   () is the probability of the system being in a particular state  characterized by some reaction coordinate  (quantities of interest) and is obtained from a histogram of the MD data,   () is the probability of the most populated bin (i.e., most probable state), and Δ  is the free energy change of the state .

Kernel density estimation
Kernel density estimation (KDE) is a non-parametric technique used to estimate the probability density function (PDF) of a given dataset.This technique utilizes a smooth function, using the Numpy and Scipy Python libraries, CHAPERONthe kernel, centered at sampled datapoints or bins.The Gaussian kernel is one of the commonly used kernels.Given a sample  =  1 ,  2 , … ,   with an unknown density  at any given point .The kernel density estimator of the shape of the function  is defined as shown in Equation (2).
standing of the SASA data other than the time-dependent information provided in Fig. 2D.

Interactive hydrogen bond matrix
CHAPERONg automates the integration of the MD DaVis tool with GROMACS for the construction of a Hbond matrix.To achieve this, CHAPERONg prepares a reference (the first structure from the trajectory) and the Hbond list (from the Hbond index file produced by gmx hbond).These files are then parsed as input to MD DaVis to produce an interactive .htmlplot (Fig. 6) that gives detailed information about the Hbond contacts recorded in the trajectory [11].

Binding free energy calculations using g_mmpbsa
The g_mmpbsa [48] is a widely used tool that integrates MD simulation with MM-PBSA binding free energy calculation for protein complexes.It also carries out the decomposition of the calculated free energies into contributions per residue [48].CHAPERONg automates and streamlines the workflow for these calculations.Since the original g_mmpbsa is only compatible with GROMACS versions 5.x (or lower) and does not support the more recent and upgraded versions, it has become a common practice for users to install the older GROMACS version as a second copy for use by g_mmpbsa.Thus, the user would need to provide CHAPERONg with the path to the appropriate gmx executable.However, the g_mmpbsa code has recently been updated by other people [49] and is supposed to support newer GROMACS ver-Fig.6.An interactive hydrogen bond matrix generated with MD DaVis using the KEAP1 Kelch domain MD simulation trajectory as an example.For details of the simulation system and setup, see Supplementary Method section 1. sions.In this case, there would be no need to specify any path and CHAPERONg would automatically call the active GROMACS.

Averaged plot of replica analysis plots
It is a common practice to conduct replica MD simulations of a system, yielding multiple independent trajectories with a higher probability of a wider sampling of the conformational space.Typically, the analysis of the simulations is carried out as means of the replica runs to obtain statistically reliable data, ensure reproducibility, and provide error estimates [50].CHAPERONg offers a way to automatically generate averaged plots of multiple replica analysis plots, such as the replica plots of the RMSD, Rg, RMSF, SASA, number of hydrogen bonds, or some other user-provided replica plots.

Steered MD and umbrella sampling simulations
The steered MD-umbrella sampling simulation workflow is a powerful technique for estimating the free energy of binding for protein complexes [51,27,52], and for studying ligand unbinding pathways [28,53].This involves a pulling simulation driven by a biasing potential along a given reaction coordinate (Fig. 7).Umbrella sampling simulations are then carried out on a series of configurations in different sampling windows (Fig. 8A).A technique such as the WHAM is finally used to de-bias the system, calculate the potential of mean force (PMF), and consequently, estimate the free energy of binding (Fig. 8B).This entire workflow is streamlined and automated by CHAPERONg as briefly described below.

System preparation
Depending on the type of system being simulated, the protein and ligand topologies are generated.Then, a placeholder cubic unit cell is generated and the user is interactively guided to adjust the box and center-of-mass dimensions in an iterative visualize-and-adjust manner (using a molecular visualizer such as PyMOL).This is followed by the solvation, ion adding, energy minimization, and equilibration steps.Several system setup quality assurance analyses are then carried out.

Steered MD simulation and movie
Steered MD simulation involves the pulling apart of the defined pulled and reference groups (illustrated in Fig. 7A).Examples of some of the output files are shown in Fig. 6, including plots of each of the displacement of the pulled group and the pull force against time (Figs.7B and 7C) and a plot of the pull force against the displacement (Fig. 7D).In addition, a movie of the pulling simulation is also generated.Supplementary File S3 and Supplementary File S4 show example movies for protein-protein and protein-ligand steered MD simulations, respectively.Using the PyMOL interface, the user can customize the renderings in the movie, and then re-run CHAPERONg to effect the modifications.

Umbrella sampling
Coordinates are extracted from the steered MD trajectory and the COM distance for each frame is calculated using the gmx distance module.Based on user-specified spacing, CHAPERONg further uses the distances to identify the starting configurations for the umbrella sampling simulations.Umbrella sampling is then iteratively run for each sampling window.

Example test cases
Four detailed tutorials using example test cases are available online at dedicated web pages accessible via https://abeebyekeen .com/ chaperong -online -tutorials.These include individual tutorials for: 1. Protein-only systems MD simulation.2. Protein-ligand complex MD simulation.3. Protein-ligand Umbrella sampling simulation.
In addition, two studies that utilized CHAPERONg have recently been published [54,55] while this article was under review.These works demonstrate the application of CHAPERONg for GROMACS MD simulations in drug discovery projects.

Conclusions
In this work, we have developed CHAPERONg, an easy-to-use opensource software that automates the GROMACS MD simulation pipelines for conventional unbiased MD, steered MD, and enhanced umbrella sampling simulations for diverse biomolecular systems.It also offers automated extensive system setup, post-simulation quality assurance analyses, and comprehensive trajectory analyses.Thus, CHAPERONg makes MD simulation more accessible to users who have limited experience working with the command line or lack programming skills.It also enables users to gain more insights into MD simulation data by providing an interface that overcomes the technical barriers to processing and analyzing trajectory data.We aim to continuously enhance the usability of CHAPERONg based on users' feedback.Future updates would include additional functionalities to expand the capabilities of the software.

Declaration of competing interest
Authors declare no competing interest.

Fig. 2 .
Fig. 2. Some trajectory analysis plots generated from the MD simulation of the Kelch domain of the KEAP1 protein (PDB ID 4IQK) as an example.Analyses include the (A) Root mean square deviation (RMSD), (B) Root mean square fluctuation (RMSF), (C) Radius of gyration, and (D) Solvent accessible surface area (SASA).For details of the simulation system and setup, see Supplementary Method section 1.

Fig. 3 .
Fig. 3. Example output plots of clustering and secondary structure (SS) analyses of MD simulation trajectory.(A) Sizes of clusters (left) and time-dependent distribution of cluster members (right) for the clustering analysis of the ligand-bound Kelch domain of the KEAP1 protein (PDB ID 4IQK) MD simulation trajectory.For details of the simulation system and setup, see Supplementary Method section 3. (B) Secondary structure analysis of the human erythrocytic ubiquitin (PDB ID 4GD6) simulation trajectory.Two plots with the seven-SS-type (left) and the four-SS-type (right) representations are produced.For details of the simulation system and setup, see Supplementary Method section 2.

Fig. 4 .
Fig. 4. Examples plots of the free energy landscapes (FELs) generated using the free Kelch domain of the KEAP1 protein (PDB ID 4IQK) MD simulation trajectory as an example.(A) A 2D plot of the FEL based on principal components generated with gmx sham.(B) A CHAPERONg-based enhanced 2D plot of the FEL using Rg and RMSD as order parameters.(C) An interactive 3D visualization of the Rg-RMSD FEL generated with MD DaVis.For details of the simulation system and setup, see Supplementary Method section 1.

Fig. 7 .
Fig. 7. Example steered MD simulation pulling a ligand way from the KEAP1 Kelch domain.(A) Illustration of the pulling simulation.The pulled group (ligand) is shown in magenta sticks, and the restrained reference group (receptor) is shown in green cartoon.(B-D) Plots of (B) pull force against simulation time, (C) displacement of the pulled group (the ligand) with time, and (D) pull force against the displacement of the pulled group.For details of the simulation system and setup, see Supplementary Method section 4.

Fig. 8 .
Fig. 8. Analysis of an example steered MD and umbrella sampling simulations of the ligand-bound KEAP1 Kelch domain.(A) Histograms of the umbrella sampling simulations.(B) Potential of mean force (PMF) curve of the ligand unbinding obtained via WHAM calculations.For details of the simulation system and setup, see Supplementary Method section 4.