Modeling of protein-peptide interactions using the CABS-dock web server for binding site search and flexible docking

Protein-peptide interactions play essential functional roles in living organisms and their structural characterization is a hot subject of current experimental and theoretical research. Computational modeling of the structure of protein-peptide interactions is usually divided into two stages: prediction of the binding site at a protein receptor surface, and then docking (and modeling) the peptide structure into the known binding site. This paper presents a comprehensive CABS-dock method for the simultaneous search of binding sites and flexible protein-peptide docking, available as a users friendly web server. We present example CABS-dock results obtained in the default CABS-dock mode and using its advanced options that enable the user to increase the range of flexibility for chosen receptor fragments or to exclude user-selected binding modes from docking search. Furthermore, we demonstrate a strategy to improve CABS-dock performance by assessing the quality of models with classical molecular dynamics. Finally, we discuss the promising extensions and applications of the CABS-dock method and provide a tutorial appendix for the convenient analysis and visualization of CABS-dock results. The CABS-dock web server is freely available at http://biocomp.chem.uw.edu.pl/CABSdock/


INTRODUCTION
Despite the significant progress in experimental and theoretical studies of proteinpeptide interactions, the understanding of their role in the cellular machinery remains quite limited. Over the years it has become clear that understanding a particular protein function as a combination of its functional domains is not complete [1]. It has been shown that in higher eukaryotes up to 40% of protein-protein interactions (PPIs) are mediated by peptides [2]. Peptides responsible for PPIs are not necessarily independent molecules, but more often appear as disordered regions within proteins (at termini, between domains or flexible loops) that can act as a separate peptide molecule. The view that proteins can be understood through their discrete segments has already provided important insights into protein function [1]. Especially, protein-peptide interactions can be found in intracellular signaling pathways, cell localization, immune response, and protein degradation. Their new functional roles are constantly being discovered [3]. Importantly, many of these interactions are implicated in human diseases such as cancer or autoimmune disease [4][5][6][7]. Therefore, structure-based studies directed toward the design of completely new or modified receptor-interacting peptides have become a hot spot of current biomedical research.
In comparison to PPIs, protein-peptide interactions are more transient and interaction affinity is significantly weaker. Together with the high conformational flexibility of peptides, these factors make structural characterization of protein-peptide complexes really challenging. Therefore, there is an urgent need for the development of complementary computational approaches, such as effective molecular docking [8]. Assuming that the structure of a protein receptor has been solved experimentally or modeled with good accuracy, the modeling protocol for searching new protein-peptide interactions usually has two or three major steps: (1) The first step involves identification of the binding site on the protein surface. This goal can be accomplished by bioinformatics methods using data from already known protein and protein-peptide structures or simply protein sequences [9][10][11][12]. They mostly aim at creating a library of sequence, structure or surface landscape motifs that could be universally detected in unknown proteins [13][14][15]. It should be noted that due to its simplicity, information only about sequence patterns that occur within binding sites is not sufficient for accurate binding site prediction and could result in a high ratio of falsepositives [1].
(3) Third, those methods for local peptide docking may also serve in the final modeling step: high resolution refinement of initially generated peptide poses.
The first two steps of modeling protein-peptide interactions can also be achieved using techniques for the combined search of binding sites and peptide poses [13,22,23]. Usually, these methods allow the identification of a binding site, although the quality of resulting peptide models is often unsatisfactory [2].
Recently, we have developed a CABS-dock method and a web server for the simultaneous prediction of binding sites and protein-peptide docking [24]. The CABS-dock simulation engine, based on the coarse-grained CABS model, enables efficient docking search of fully flexible peptides over the entire surface of flexible proteins in a reasonable time, typically 1 to 8 CPU hours (which is thousand-fold shorter than analogical simulations using rapid molecular dynamics adapted to peptide docking [22]). CABS-dock has been extensively tested over the largest benchmark set of non-redundant protein-peptide interactions available to date (including docking to bound and unbound receptor forms). For over 80% of bound and unbound dataset cases, we obtained high or medium accuracy models (expected to be of sufficient accuracy for high resolution refinement) [24].
In comparison to other protein-peptide docking tools (listed above), the CABS-dock offers the following major advantages: (1) the method does not require knowledge of the binding site nor any information about the peptide conformation, (2) during docking peptide conformation is allowed to be fully flexible, and (3) it is possible to simulate significant conformational changes of the protein receptor structure (see section 3.2.1). These advantages become even more apparent in comparison to general purpose protein-ligand docking tools, which are usually less efficient in sampling conformational changes than methods dedicated to protein-peptide docking. The possible CABS-dock disadvantages include: (1) lack of option to guide the docking with the knowledge of the binding site (this will be available in the next CABS-dock update planned in 2015), however, it is possible to exclude some receptor areas from the docking search, and thereby to enforce more effective search in a closer neighborhood of the potential binding site (see section 3.2.2); (2) a small set of 10 best scored models may not show the high accuracy models (that may be present in the large set of CABS-dock predicted models), however, this is also the case for the other docking methods (scoring problem is discussed in section 3.4).
In this work, we evaluate CABS-dock performance and focus on particular examples of protein-peptide docking. The examples discussed illustrate CABS-dock performance using the default server mode as well as its advanced options. We also address the possibility of improving CABS-dock performance using an external scoring method over a large set of CABS-dock generated models. This can be achieved by a two-step procedure involving: (1) reconstruction and local optimization of CABS-dock models, followed by their (2) scoring using short simulations by all-atom molecular dynamics with explicit solvent. An Appendix is also provided with this paper where we provide a tutorial for the display and analysis of CABS-dock models and trajectories using VMD [25], a molecular graphics program.

METHOD
The CABS-dock method [24] is based on the CABS model (described in detail in ref. [26]) that was originally designed for the structure prediction of globular proteins and simulation of protein dynamics. CABS comes from the letters of pseudo-atoms used to represent a single protein residue: carbon alpha (CA), carbon beta (B) and side chain (S). An additional pseudo-atom, defined in the geometrical center of the virtual CA-CA bond, is used to define the main chain hydrogen bonds. To speed up calculations, the coordinates of the CA atoms are restricted to the beads of a dense cubic lattice with lattice spacing arbitrary set to 0.61Å. The remaining pseudo-atoms are located off the lattice and follow the movement of the main chain. The force field is based on knowledge-based statistical potentials derived from structural regularities seen in known protein structures. Sampling is controlled by the asymmetric Metropolis criterion. Additionally, CABS uses the Replica Exchange protocol for better sampling coverage of the energy landscape.
The CABS model was initially used for protein structure prediction and it performed exceptionally well in CASP6 (Critical Assessment of protein Structure Prediction, a community-wide blind test of structure prediction approaches). Using the CABS-based approach, the Kolinski-Bujnicki group scored best or second best, depending on the evaluation method [27,28]. CABS-based protocols for the de novo and consensus prediction of protein structure were made freely available to the academic community on an automated web server [29]. The CABS model has also been successfully used to simulate the dynamics of denatured protein states [30], protein folding mechanisms [31][32][33][34][35][36], the flexibility of globular proteins [37][38][39] and its influence on protein aggregation [40]. Finally, the CABS model has been optimized for the investigation of protein interactions and prediction of structures of protein complexes. It has been used to build a model of human telomerase [41], protein-peptide and protein-protein docking [42][43][44][45] and to investigate the mechanism of simultaneous folding and binding of an intrinsically disordered peptide [46].
The pipeline of the CABS-dock server is a multistage protocol that consists of multiple programs and associating scripts, with the CABS model (version dedicated to handle multimeric protein chains) at its center. As shown in Figure 1, the whole procedure consists of four main stages 1) flexible docking by the CABS algorithm; 2) initial filtering of probable solutions from all generated models; 3) further selection of representative models by the clustering protocol and 4) reconstruction to all-atom representation and local optimization of final models. The method is fully automated. As an input it requires only the 3D structure of the receptor and the peptide sequence. On output the server returns 10 top scored models of the protein-peptide complex. Each step of the procedure is briefly described in the following paragraphs. Sets of models generated with the pipeline are available for download from the "Docking prediction results" tab ("Download all files" button).

Flexible docking with the CABS model
The CABS-dock method requires two inputs: (1) amino acid sequence of the peptide, and (2) 3D structure of the protein receptor (the obligatory and non-obligatory input recommendations are listed in [24]). In the first CABS-dock modeling step, 10 copies of the protein-peptide system are generated as starting models for the Replica Exchange Monte Carlo sampling method. Each starting copy contains a random peptide structure that is placed in a random position within 20 Å from the input receptor structure (see Figure 2, left). During simulation of coupled peptide binding and folding, the CABS-dock protocol allows full flexibility of the peptide and small fluctuations of the receptor backbone. The simulation result is the set of 10,000 models (visualized in Figure 2, middle) in the C-alpha representation, collected in 10 trajectories. Each trajectory counts 1000 models and shows system evolution for each replica. Three stages are shown: 10 starting random peptide conformations (left), 10,000 models (middle), 10 final models (right). The protein receptor is showed as gray surface, peptides in cyan. The picture shows the human androgen receptor as an example (PDB ID: 2AM9).

Initial filtering
From the 10,000 models generated during the simulation up to 1000 are selected for further steps in the following procedure: 1) all unbound states (where interaction energy between the peptide and the receptor is zero) are rejected 2) from the remaining models up to 100 from each of the 10 trajectories are picked by the lowest interaction energy. This procedure usually retains the best of the generated models in the filtered set (see example in Figure 3). for the entire complex structure, while the bottom panels for protein-peptide interaction only. The left panels present the data for all models generated in the CABS-dock simulation (10,000), while the right panels for 1000 models selected in the initial filtering procedure (see the pipeline, Figure 1). Ligand-RMSD is the root mean square deviation calculated on the peptides after superposition of the receptor molecules. Note that for each prediction case a similar plot can be easily created by the user (see Appendix for instructions).

Clustering
The 1000 filtered models are subsequently grouped into clusters in the k-medoid clustering procedure. The clustering is run 100 times with different initial seeds and k=10. Consensus medoids are selected as the final models. The density of the clusters (defined as an average difference between cluster elements divided by the number of elements) is used to finally rank the models. Ligand-RMSD (root mean square deviation calculated on the peptides after superposition of receptor molecules) is used as the differentiation measure between cluster elements.

Reconstruction and local optimization of the final models
The last step of the CABS-dock protocol (also discussed in section 3.3) is the all-atom reconstruction of the final models using the MODELLER [47] program. 10 medoids serve as templates for the reconstruction and optimization procedure in the DOPE potential [48]. As a result, energy minimized all-atom representations of the final models are obtained.

Docking without prior knowledge of the binding site
The recent publication on the CABS-dock server [24] describes CABS-dock performance (with default server settings) over a large dataset of peptide-protein complexes, including docking to bound and unbound (when available) forms of protein receptors. For over 80% of bound and unbound cases we obtained high or medium accuracy models (see Figure  4). The quality assessment criteria are based on ligand-RMSD (root-mean-square deviation) between the predicted model and the experimental peptide structure after superimposition of the receptor molecules (high accuracy: ligand-RMSD<3 Å; medium accuracy: 3 Å ≤ ligand-RMSD ≤ 5.5 Å; low accuracy: ligand-RMSD > 5.5 Å). The percentages shown refer to the best quality models found in the set of all 10,000 generated models (all) and in the set of 10 top-scored models (top 10). More details can be found in our recent publication [24] and in online materials at http://biocomp.chem.uw.edu.pl/CABSdock/benchmark.
In the next subsections, detailed analysis of modeling cases selected from the benchmark dataset [24] is presented. All CABS-dock predictions for the entire benchmark dataset are available for download from http://biocomp.chem.uw.edu.pl/CABSdock/benchmark

An example when the accurate model is top ranked
The example described below has been obtained with the default CABS-dock server settings using the following input data:  Peptide sequence: SSRFESLFAG  Peptide secondary structure: CHHHHHHHHC  Receptor input structure: PDB ID: 2AM9, crystal structure of human androgen receptor in the unbound form Reference structure used for the calculation of ligand-RMSD values to the experimentally determined peptide-bound structure:  Peptide-receptor complex structure: PDB ID: 1T7R, crystal structure of human androgen receptor in complex with the peptide.
In 10 final models, peptides were docked to five different binding sites (see Figure 5). In two of the models (including the first, top-ranked model) the peptide was bound in the correct binding site. As shown in Table 1, the one top ranked (representative of the most dense cluster) is the most accurate. More details about cluster content can be displayed under the "Clustering details" tab (see Figure 6). The figure shows an experimental protein peptide-bound form (the receptor is colored in gray, peptide in magenta, PDB ID: 1T7R) together with CABS-dock-predicted peptide poses (colored in cyan). Top 10 CABS-dock models are presented, which were docked in five potential binding sites (one of the peptide models is not visible because it is docked at the opposite receptor surface). In the native binding site (marked by the rectangle), two models were docked. One of these models (which is the top-ranked first model, representative of the top ranked cluster, see Table 1) is presented on the right with the experimental peptide structure. Ligand-RMSD between the first model and the experimental peptide structure is 2.22 Å. Table 1. Example details of structural clustering. Clusters are ranked according to cluster density. In the present case the most dense cluster is the most numerous one (226 models out of 1000) and the most similar to the experimental model. The table shows data for the prediction case described in Figure 5. Ligand-RMSD values are also presented (root mean square deviation calculated on the peptides after superposition of receptor molecules).  In this case model number 828 from the first trajectory is shown, which is assigned to the first, the most dense cluster (see also Table 1). This model is the best among the 1000 top scored models (ligand-RMSD to the experimental peptide structure: 1.93 Å).

An example when a medium accuracy model is top ranked and a high accuracy model exists in the trajectory
Example 3.1.1 shows the most favorable situation when a high-accuracy model was identified as the first model in the final top 10. However, due to the high complexity of the problem, high-accuracy models may be missing among the top 10 models, but they may exist in simulation trajectories (in a set of 10,000 models).
The example described below was obtained with the default CABS-dock server settings using the following input data:  Peptide sequence: RRNLKGLNLNLH  Peptide secondary structure: CCCCCCCCCCCC  Receptor input structure: PDB ID: 2B9F, crystal structure of mitogen-activated protein kinase FUS3 in the unbound form Reference structure used for the calculation of ligand-RMSD values to the experimentally determined peptide-bound structure:  Peptide-receptor complex structure: PDB ID: 2B9H, crystal structure of mitogenactivated protein kinase FUS3 in the peptide-bound form.
In the 10 final models, peptides were docked to five different binding sites; however, only one model (representative of the second cluster) was close to the experimental peptide binding site with a ligand-RMSD value of 4.16 Å to the reference peptide structure in the 2B9H complex (see Figure 7 and Table 2). A more accurate model can be found in the set of all 10,000 models generated by CABS-dock with a ligand-RMSD value of 2.33 Å. In such modeling cases, external scoring methods may be useful for fishing out the best accuracy models from the large pool of generated models (see section 3.4). , together with CABSdock-predicted peptide poses (colored in cyan) and the most accurate prediction from the entire trajectory (in green). 10 top ranked models have peptides docked in five different areas. One of these areas is the native binding site (marked by the rectangle) with a medium-accuracy model (model 2, ligand-RMSD to the experimental structure: 4.16 Å). On the right, the native binding site is shown with the most accurate prediction from all simulation data (ligand-RMSD between the predicted and experimental peptide structure: 2.33 Å). Table 2. Example details of structural clustering (as displayed in the CABS-dock server, column containing ligand-RMSD to the experimental form added). Clusters are ranked according to cluster density. The cluster whose medoid is the most similar to the experimental model is the most numerous one, but ranked as second (according to cluster density). The table shows data for the prediction case described in Figure 7.

CABS-dock modeling with advanced options
Together with its basic functionality, CABS-dock enables advanced modification of docking simulation settings. This is done using two advanced options for: (1) increasing the level of flexibility for selected receptor fragments, (2) excluding user-selected binding modes from docking search.

Increasing the flexibility of receptor fragments
The advanced option for increasing flexibility for selected receptor residues is available from the main page by checking the "Mark flexible regions" option.
For each selected residue, the user may choose from two preset settings: moderate or full flexibility. Technically this is achieved by changing the default distance restrains (used to keep the receptor structure near to the input conformation). The assignment of moderate flexibility decreases the strength of restrains, while full flexibility assigned removes all the restraints imposed on the selected residue.
Below, we describe a practical example of using the "Mark flexible regions" option with the following input data: According to the experimental studies [49] the unbound form of biotin binding protein has a flexible loop close to the binding site. Using the CABS-dock "Mark flexible regions" option, we selected 10 residues (from 45 th to 54 th ) forming the flexible loop and assigned the "fully flexible" setting to those residues.
As shown in Figure 8, the initial position of the loop in the unbound form would prevent correct binding of the peptide. Assigning full flexibility to the loop residues enabled us to uncover the binding site and to obtain a high accuracy model (ligand-RMSD of the first topranked model to the experimental structure: 2.03 Å).

Excluding binding modes from docking search
In the default mode, CABS-dock allows peptides to explore the entire receptor surface. However, in certain modeling cases it is known that some parts of the protein are not accessible (for example due to binding to other proteins) and therefore it could be useful to exclude these regions from the search procedure. Such a feature is available by selecting the "Mark unlikely to bind region" option on the main page. This option works in two ways: 1) by selecting the residues to be excluded (available from the main page by checking the "Mark unlikely to bind regions" option) 2) by re-submitting a previously run job (a Resubmit button is available in the "Project information" tab) and selecting resulting models (binding modes) to be excluded from future results. Thus, in practice, the excluding option can be also used to force the CABS-dock algorithm to search for additional binding sites not found in the previous runs.
Below, we describe a practical example of the excluding option by re-submitting a previously run job. The first simulation run resulted in 10 top-ranked models having peptides bound mostly in a single area far from the native binding site (see Figure 9). Excluding these peptide poses (by re-submitting a previously run job) resulted in new predictions among which the first top-ranked model was consistent with experimental structure. , green (2 poses from the first run not excluded from the re-submitted job) and in cyan (result of the re-submitted job). In the native binding site (marked by the rectangle), two models were docked. One of these models (which is the top-ranked first model, representative of the top ranked cluster) is shown on the right with the experimental peptide structure. Ligand-RMSD between the first model and the experimental peptide structure: 2.89 Å.

All-atom reconstruction and local optimization 3.3.1 Accessing CABS-dock models in coarse-grained representation
The main output of the CABS-dock server is a set of 10 top scored models in all-atom representation. These models are representatives of a broad set of 10,000 initial models and are selected through stepwise filtering and clustering procedures (see Figure 1 and Methods section). From the "Docking predictions results" tab, a compressed archive containing the complete set of CABS-dock generated models in coarse-grained representation can be downloaded. The archive consists of the following sets of models (see the Appendix for a detailed description of the archive file):  model_*.pdb -10 final models, numbered from 1 to 10 (in the PDB file format and all-atom representation)  cluster_*.pdbcluster models (groups of models that have been classified in structural clustering to particular clusters), numbered from 1 to 10 (in the PDB file format and C-alpha representation). Cluster numbering corresponds to model numbering, i.e. model_7.pdb is a representative model of models grouped in the seventh cluster (ranked as seventh) (cluster_7.pdb).  trajectory_*.pdbcomplete set of 10 trajectories, numbered from 1 to 10 (in the PDB file format and C-alpha representation). Each trajectory contains 1000 models.  top1000.pdbtop 1000 models (selected for further clustering and analysis) from the 10 trajectories (in the PDB file format and C-alpha representation) All the sets of models in coarse-grained representation can be reconstructed to physically-sound all-atom models, as we demonstrated in [34,50]. Those reconstructed models can be later used in applications that require all-atom details (e.g. studies of energy properties using all-atom modeling tools, or as an input for all-atom analysis and visualization).

Strategies for reconstruction and local optimization
Various strategies can be employed for the reconstruction of protein models from Calpha trace to all-atom models. Successful approaches in the application to CABSgenerated models are presented in [34,50]. The reconstruction of atomic details is typically a two-stage process, where at first only the backbone atoms are reconstructed [51]. In the next step the remaining atoms of the side chains are added to the backbone and a complete model is subsequently optimized to remove structural inaccuracies, such as improper bond lengths and angles, or steric clashes.
There are several methods available for protein backbone reconstruction from C-alpha coordinates using fragments taken from known protein structures [51][52][53][54][55][56][57][58]. An important attribute of the method chosen for backbone rebuilding should be insensitivity for small (unphysical) local distortions of C-alpha distances present in CABS generated models [34,50]. Such distortions are for example well-handled using the approach of Claessens et al. [53] or the ModRefiner tool [59] (which unfortunately doesn't handle multimeric protein chains). For the second rebuilding step, side chain reconstruction, the SCWRL program can be recommended [60].
The reconstruction of models should be followed by local structural optimization [61]. Ideally, a fully reconstructed model should undergo a short MD simulation with explicit solvent. However, for practical applications such an approach may be computationally too demanding, especially when thousands or more models need to be processed. Therefore, efficient algorithms handling the entire reconstruction and optimization process have also been proposed [54,59,62].
Rebuilding models after protein-peptide docking raises another challenge: dealing with multiple protein chains. However, this is rather a technical than conceptual problem as most of the above-listed tools are not capable of optimizing the structure of a multi-chain protein complex. In the CABS-dock server, we have implemented a procedure based on the Modeller package [47] for combined reconstruction and optimization. The optimization step is done by the minimization of the DOPE (Discreet Optimized Protein Energy) statistical potential [48]. The script for the Modeller reconstruction and optimization procedure is available in the CABS-dock online tutorial at http://biocomp.chem.uw.edu.pl/CABSdock/tutorial.

Database of large sets of CABS-dock models in an all-atom representation
Large sets of CABS-dock generated models can be used for the all-atom scoring of protein-peptide interactions or for the development of new scoring potentials. For such exercises, we have rebuilt to all-atom representation sets of top-scored models (comprising 100 or 1000 models) for the entire benchmark set of protein-peptide complexes (103 bound cases and 68 unbound cases that have been used for validation of CABS-dock performance [24]). The benchmark sets of all-atom models are available for download at http://biocomp.chem.uw.edu.pl/CABSdock/benchmark.

Scoring of predicted models 3.4.1 Results of scoring exercises using all-atom MD with explicit solvent
One of the main problems in modeling protein interactions is the scoring problem [8,63,64]. In general terms, it is usually a problem of selection of predicted models that are closest to the native (real) complex out of a large set of diverse models. There are three kinds of scoring functions: physics-based [65], empirical [66], and knowledge-based [67]. The former two calculate binding energy as a sum of individual energy terms. Usually, in the physics-based approach, summation of Lennard-Jones and Coulombic potential functions from popular force-fields is used. Empirical scoring functions involve a sum of weighted uncorrelated terms, trained to reproduce the known experimentally determined protein-peptide binding affinities. Knowledge-based scoring functions are based on statistical observations of intermolecular interfaces found in known protein complexes. In many physics-based scoring functions [68][69][70], water is not treated explicitly, and the solvent effect is either neglected or treated as a continuum dielectric medium. The golden standard to account for the solvent effect is to use explicit solvent molecules in an all-atom MD technique. While using this technique involves a sampling problem (due to the large system size and consequent timescale limitations), it can be implemented as a short (computationally inexpensive) simulation procedure in combination with coarse-grained simulations [36].
For five cases of unbound docking (from the benchmark results [24]), we examined the possibility of scoring 100 models (top-scored by CABS-dock) using all-atom MD simulations in explicit solvent with different force fields (OPLS [71], CHARMM [72], AMBER99SB [73], and GROMOS41a [74]) and water models (SPC [75] and TIP3P [76]). The sets of 100 models were subjected to the following procedure: (1) reconstruction and optimization using Modeller (see section 3.3), (2) short MD simulations in explicit solvent (simulation setup details are given at the end of this subsection), (3) binding energy calculation. The results are summarized in Table 3 (in terms of the lowest ligand-RMSD found in 10 top scored models) and compared to CABS-dock scoring. In general, all-atom force-fields provided better scoring results than the CABS-dock scoring scheme (except for the GROMOS43a1 force-field that performed better in two cases, comparably in one case and worse in two cases).  (9) 1.88 (7) 1.88 (8) 1.88 (5) Additionally, for the 1N83 modeling case, in Figure 10 we present ligand-RMSD as a function of binding energy according to the coarse-grained CABS force-field ( Figure 10a) and different all-atom force-fields (Figure 10b) (note that the final CABS-dock ranking is essentially based on structural clustering, not energetic scoring). As shown in the figure, the coarse-grained force-field provides a similar ranking of different quality models as allatom force-fields in terms of general tendencies, i.e. a few of near-native models (having ligand-RMSD values lower than 5 Å) are on average ranked better than the other models of lower quality. However, all-atom force-fields are more efficient in discriminating the lowest ligand-RMSD models from the near-native models. As presented in Table 3, in the case of GROMOS and OPLS, the best quality model was ranked among the 10 top-scored models. Figure 10. Scoring results for the 1N83 case using interaction energy values. (a) ligand-RMSD versus CABS energy plot for the 1N83 system. (b) ligand-RMSD versus all atom energies plots for the 1N83 system. Red, green, blue and magenta points correspond to AMBER99SB, GROMOS43a1, OPLS and CHARMM force fields, respectively.
Although the above scoring results are promising, larger studies will be required using entire benchmark results and/or larger sets of CABS-dock generated models: 1000 or 10,000 (in which the best quality models were not identified using CABS-dock scoring [24]).

Details of the all-atom MD scoring procedure
In the described scoring exercises, we used four popular force fields for all-atom modeling: OPLS [71], CHARMM [72], AMBER99SB [73], and GROMOS41a [74] in conjunction with water models: SPC (Simple Point Charge) [75] (for GROMOS43a1, AMBER99SB and OPLS) and TIP3P [76] (for CHARMM). As all-atom MD modeling results might be force-field and water-model dependent [77,78], it is always advisable to check different force-field schemes.
Simulations were carried out using a double precision version of the Gromacs-4.6 package [79]. The simulation details are as follows. The receptor-peptide system was placed in a dodecahedral box of a size that the minimal distance between the system and any periodic box edge was 1.0 nm, followed by filling up the box with water molecules. An appropriate number of ions were added to neutralize system charge. To avoid improper structures, the whole system was minimized with the steepest-descent method, before being equilibrated at 300K with two successive molecular dynamics runs, 10 ps each; the first one at constant volume (NVT equilibration), the second at constant pressure (NPT equilibration). The positions of the receptor and ligand were restrained by a biasing potential during NVT and NPT equilibrations for restrained moves of the receptor and ligand and unrestrained moves of water molecules. The pre-equilibrated conformations were used as starting structures for 20 ps production MD simulations. During a production run the ligand restraints were released, while the receptor was kept restrained. Temperature (300K) and pressure (1atm) were controlled by a v-rescale thermostat [80] and a Berendsen barostat [81], respectively. We used periodic boundary conditions and calculated electrostatic interactions by means of the Ewald particle mesh method [82]. Non-bonded interaction pair lists are updated every 10 fs, using a cutoff of 1.4 nm. All bond lengths are constrained with the linear constraint solver LINCS [83] enabling the integration of the equations of motion with a time step of 2 fs. Initial velocities of the atoms were generated from Maxwell distribution at 300K. Data analysis was done using corresponding Gromacs tools.
All-atom interaction energy was calculated as a sum of Coulomb and Lennard-Jones interaction energies between the receptor and ligand and averaged over all saved configurations of a simulation run. During the simulation run atomic coordinates were saved every 0.2 ps. We obtained an average speed of ~0.5ns/day/CPU (Intel(R) Xeon(R) E5649 2.53GHz). Approximately 1 CPU hour is required for a single model 20 ps simulation run (for multiple CPUs, a speed of up to 5-6 can be achieved).

SUMMARY
In this paper, we demonstrated the performance and functionalities of the CABS-dock method and web server (server details are described in [24]). The unique CABS-dock capabilities of efficient docking of fully flexible peptides to flexible proteins without prior knowledge of the binding site make it a promising approach for modeling protein-peptide interactions. The promising extensions of the CABS-dock method include: 1. Combination of CABS-dock as a tool for initial peptide pose generation with the methods for high resolution refinement of protein-peptide interactions (e.g. HADDOCK [19] or Rosetta FlexPepDock [17]). 2. Extension of the CABS-dock procedure by exact scoring methods (as discussed in section 3.4) for better selection of accurate models out of large sets of generated models. 3. Incorporation of experimental data into the modeling process (such as NMR shifts, amide proton exchange [84,85], phage display peptide libraries and mutational alanine scanning [86], or combination with other bioinformatics approaches enabling identification of binding site or any protein-peptide interaction features). 4. Increasing the flexibility of appropriate receptor fragments (not user defined as it is possible now, but automatic using for example the CABS-flex approach [37,39] or bioinformatics-based identification of flexible receptor fragments).
The potential applications of CABS-dock embrace a vast array of strategies for the characterization of protein-peptide but also protein-protein interactions (PPIs), important for modern drug design. CABS-dock is a method for global and cost-effective search for interaction 'hot spots' or 'hot segments' [87], or binding areas that dominate the interaction. It is also possible to search for specific interaction sites by applying a fragment of the protein or peptide of known binding function as a query. Such an approach could be applied for peptide design that could specifically block PPIs with proven involvement in human disease and thus improve drug development [88,89]. Importantly, many proteins engaged in PPIs are considered "undruggable" [90]. Such status has been assigned due to the lack of an internal cavity in these protein structures (that could be occupied by a small organic molecule), however they interact through flat and expanded binding sites, and thus may be targeted by peptide/protein therapeutics. CABS-dock prediction capabilities (and particularly the advanced option of excluding user-selected binding modes from docking search) can also be useful in computational epitope mapping, where multiple binding sites have to be considered [91]. It must be emphasized that the development of epitope-based vaccines and other antigen-based drugs is not possible without clear identification of the antigenic region involved in binding [92]. In vivo epitope mapping can be challenging due to difficulties in expressing and purifying antigens such as membrane proteins and complexes [93]; thus, progress in reliable computational methods is greatly anticipated.

APPENDIX
The authors acknowledge Dr Michal Jamroz for his valuable support in programming modeling scripts, crafting online documentation, constant improving of the CABS-dock systems and for cheerful discussions.
The Appendix contains a tutorial for CABS-dock results visualization and analysis in VMD [25] (a molecular visualization program for displaying, animating, and analyzing large biomolecular systems using 3-D graphics and built-in scripting).

Introduction
This tutorial describes how to:  load all the necessary files into the VMD molecular graphics program  create simple graphic representations for viewing trajectories  align models from trajectories with a known receptor structure  calculate peptide RMSDs (root mean square deviation from the reference structure)  create plots from calculated RMSDs vs. energies from the CABS-dock results The following software is required:  VMD with an RMSD Trajectory Tool plug-in (included in VMD ver. 1.8.8 and higher).
VMD is user friendly yet advanced software for the analysis and visualization of structure and molecular dynamics trajectories of biological systems. In this tutorial some features are omitted, e.g. VMD for basic protein analysis: Sequence Viewer, Contact Map or Ramachandran Plot. Those plug-ins are very well documented on the VMD website.
 gnuplot . gnuplot is software for creating high quality plots from user data.
Both programs are free to use, available for major operating systems and fairly easy to install.

CABS-dock result files and reference structure
In this tutorial, we used an example prediction run described in the manuscript (in section 3.1.1). The example was created using the following input data:  Peptide sequence: SSRFESLFAG  Peptide secondary structure: CHHHHHHHHC  Receptor input structure: PDB ID: 2AM9, crystal structure of human androgen receptor in the unbound form (without a peptide) Reference structure used for the calculation of RMSD values to the experimentally determined peptide-bound structure:  Peptide-receptor complex structure: PDB ID: 1T7R, crystal structure of human androgen receptor in complex with the peptide Typically the structure of the bound complex is unknown, so one of the resulting models may be used as a reference (for example the top scored model: model_1.pdb).
First, from the "Docking predictions results" tab, download a compressed archive with the output data for further analysis. The archive contains the following files:  model_*.pdb -10 final models, numbered from 1 to 10 (in the PDB file format and all-atom representation)  cluster_*.pdbcluster models (groups of models that have been classified in structural clustering to particular clusters), numbered from 1 to 10 (in the PDB file format and C-alpha representation). Cluster numbering corresponds to models numbering, i.e. model_7.pdb is a representative model for models grouped in the seventh cluster (ranked as seventh) (cluster_7.pdb).
 trajectory_*.pdbcomplete set of 10 trajectories, numbered from 1 to 10 (in the PDB file format and C-alpha representation). Each trajectory contains 1000 models.
 top1000.pdbtop 1000 models (selected for further clustering and analysis) from the 10 trajectories (in the PDB file format and C-alpha representation)  input.pdbinput structure of the receptor  READMElog file with all information to recreate the simulation  energy.txtlog file with energy values (from the coarse-grained CABS force-field) for all resulting models; columns contain the following data (in the order listed): -

Changing graphical representations
This section describes how to change the default representation (see screenshot on the right) to a more convenient one.

Chain identification
First, it is necessary to find chain identifiers.

From the VMD Main window Graphics>select
Representations….
In the Graphical Representations window choose Selections tab >Keyword list >chain.
The characters that appear in the Values section are chain IDs of the molecules in the currently selected molecule (Selected Molecule list).

RMSD analysis
This part concerns calculating peptide-based RMSDs through 10,000 models of an example trajectory. It is also possible to view and save a simple RMSD vs. Frame plot by choosing Plot data.

Trajectory movie
After aligning receptor structures, it is easy to observe conformational sampling of the peptide.
In the OpenGL Display window click and hold the left mouse button to rotate representations to expose the peptide from the reference structure.
Buttons in the VMD Main window: Play forward and Play in reverse allow the user to view the whole trajectory as a movie. To adjust speed or frame skipping use the speed slider and step counter.
To find a particular frame use Step forward, Step in reverse or use the frame slider by clicking on it and holding the left mouse button to drag the frame marker.
gnuplot> set xlabel "RMSD" gnuplot> set ylabel "Interaction Energy" gnuplot> set title "Energy vs RMSD" gnuplot> plot "enermsd.txt" using 9:6 with dots notitle vs. RMSD", its X axis labeled "RMSD" and Y axis labeled "Interaction Energy". Numerical values in the last command indicate the columns used in format "X:Y", so changing it to 8:9 will create a RMSD vs. frame plot and to 8:6 will create an Interaction energy vs. frame plot. Do not forget to also change the appropriate filename, title and labels. For other options please check the gnuplot documentation.
See the aforementioned figures below: