Information-Driven Modeling of Biomolecular Complexes

Proteins play crucial roles in every cellular process by interacting with each other, with nucleic acids, metabolites, and other molecules. The resulting assemblies can be very large and intricate and pose challenges to experimental methods. In the current era of integrative modeling, it is often only by a combination of various experimental techniques and computations that 3D models of those molecular machines can be obtained. Among the various computational approaches available, molecular docking is often the method of choice when it comes to predicting 3D structures of complexes. Docking can generate particularly accurate models when taking into account the available information on the complex of interest. We review here the use of experimental and bioinformatics data in protein-protein docking, describing recent software developments and highlighting applications for the modeling of antibody-antigen complexes and membrane protein complexes, and the use of evolutionary and shape information.


Introduction
Macromolecules such as proteins and nucleic acids are involved in most cellular functions responsible for maintaining life, performing their tasks in most cases by interacting with other molecules. Understanding these interactions is fundamental, not only to gain insight into the molecular machinery of living organisms but also to gather high quality information to drive innovation in, for example, protein engineering and drug design.
Theoretical approaches to the study of macromolecular interactions, such as docking, take full advantage of robust spatial search algorithms like rigid-body minimization, swarm optimization, and grid-based search methods (to name a few) to probe the interaction space of the molecular components of a complex. These algorithms can generate hundreds to tens of thousands of possible conformations, models, which must be scored and ranked [1]. The scoring and ranking of models are crucial steps in docking. Good scoring functions must be able to identify which models are valuable, representing near-native conformations, and should be further analyzed. This can be done in different ways: Classical approaches use a molecular engine to evaluate energetic components such as inter-molecular van der Waals, electrostatic, desolvation energies, etc. Alternatively, one might use statistical potentials derived from the analysis of known complexes. More recently the field has seen a rise in the use of machine learning techniques for scoring [2][3][4][5][6].
Over the last years, computational methods to study macromolecular interactions have been steadily incorporating different types of data to guide, filter, or validate their predictions [7][8][9][10][11][12][13]. The use of various types of information in macromolecular docking is commonly referred to as integrative modeling and has been a convergence point in the field, being implemented in most software under active development [14]. Such information can be used a priori, to guide the spatial search step, or a posteriori, to aid in the scoring of models ( Figure  1). A perfect example is the Integrative Modeling Platform (IMP) [15], a renowned tool that can handle highly heterogenous information. IMP has been used to unveil the structure and functional anatomy of a nuclear pore complex [16], the 26S proteasome holocomplex [17], and the molecular architecture of the yeast spindle pole body core [18]. As in this review the main focus is on docking, IMP, which is formally not a docking platform, will not be discussed further.

Figure 1:
This figure illustrates the information-driven modeling of biomolecular comples, with in the central panel an illustration of various information sources, on the left a docking protocol that would only make use of the information in the filtering stage after sampling the interaction space, and on the right an information guided docking protocol that uses the data to bias the sampling and score the resulting models.
To drive molecular docking, data from a variety of experimental or computational sources can be used. To mention a few, hydrogen deuterium exchange experiments allow to identify regions of a target molecule that become protected from exchange upon binding, information which can be used to define the interface region [19]. Mutagenesis experiments can reveal if a residue is essential for the interaction, but give no specific information about its position in the interface or the contacts it makes [20]. Cross-linking experiments detected by mass spectrometry provide specific information between pairs of residues in the form of maximum distances, the length of which depends on the nature of the cross-linker reagent used [21,22]. Interaction between molecules can also be studied by measuring the Förster Resonance Energy Transfer (FRET) that occurs when two fluorescent-labeled proteins are in close proximity of each other [23]. FRET measurements can demonstrate the interaction between molecules in situ. Coupled with quantitative analysis they can provide valuable information for modeling protein structures and their complexes [23].
Interface and distance information from experimental methods can be incorporated into docking to make sure that the resulting models match the experimental information. In principle any method that can provide some kind of interface and/or distance information can benefit docking [1]. Since each technique has advantages, disadvantages, and limitations, most studies rely on information obtained with more than one experimental method. Whilst experimental data provide in principle higher quality information, access to experimental facilities and sample availability are often major limiting factors, especially for large-scale studies. In such cases, bioinformatics analysis can leverage large volumes of sequence information to yield valuable predictions about interfaces, as well as specific contacts. The latter can be extracted by statistical analysis of co-evolving residues in multiple sequence alignment [24].
In this review we focus on how different docking software use a variety of data in their predictions, as well as recent applications in the integrative modeling of biomolecular complexes, especially antibody-antigen and membrane complexes. We also discuss some recent developments in the use of evolutionary information and give an outlook on the use of shape information in biomolecular modeling.

Docking Software
HADDOCK (High Ambiguity Driven DOCKing) pioneered the use of experimental information in macromolecular docking [25]. It allows interface information to be entered as a set of active residues that have the highest probability of being part of the interaction interface and passive residues which are likely in the vicinity of the interface. This set of residues defines Ambiguous Interaction Restraints (AIR), associated with a maximum effective distance that draws the interfaces together without pre-defining their relative orientation. HADDOCK also supports the definition of specific distance restraints, e.g., from cross-linking MS experiments or co-evolution predictions. In the HADDOCK score used to rank the resulting models, a penalty is assigned for restraints that are not satisfied and combined with intermolecular energy terms.
Using the same format as for HADDOCK, a file containing restraint definitions can be submitted to ATTRACT [26]. ZDOCK accepts a list of contacting residues, which is used to filter rigid body docking solutions in which these are not near the other molecule [27]. This a posteriori filtering is also done by ClusPro, which accepts distance restraints that are used to select conformations that match the available data. The score used to rank the resulting models, however, does not include any restraint term [28]. pyDock [13], another rigid body docking method, offers a posteriori use of distance restraints by the pyDockRST module, which provides a score that combines the percentage of satisfied restraints with electrostatics and desolvation energies [29]. Distance restraints have also been implemented in the proteinpeptide docking software CABS-dock and its web server [30,31].
One of the newest members in the family of integrative docking software is LightDock [32]. This software uses a swarm-based algorithm to distribute initial positions of the ligand relative to the receptor and is able to take into account interface information in different ways [7]. Given a set of interface residues on the receptor, the ligand swarms are positioned only around the defined interface region; each swarm is then optimized using particle swarm optimization. If the interface on the ligand side is also known, the molecule is rotated in relation to the receptor so that the specified residues face the receptor in the starting swarm orientations. The scoring of the models during the docking also reflects the interface information. The resulting conformations are filtered to include only those that are closest to the defined interface. Rather than assigning a penalty for unsatisfied restraints, in LightDock a bonus is defined based on the percentage of interface residues that are in contact with the binding partner.
The Exhaustive Rotational Search based Docking (EROS-DOCK) [33] is another docking software that uses information to avoid sampling the subspace that does not satisfy a given restraint. It also belongs to the very exclusive group of docking software that can handle both restraints and multi-body docking (i.e., the modeling of complexes consisting of more than two components), together with the pioneer HADDOCK and ATTRACT [25,34]. EROS-DOCK, in contrast to grid-based docking methods, applies a quaternion !-ball that represents the space of all possible Euler rotations. This π-ball representation is systematically explored with the objective of locating the global minimum of pairwise docking energies, avoiding steric clashes. Here restraints can be defined as amino acid or atom pairs together with their maximum separation distance. These restraints are used to build a constraint !-ball which is then used to identify poses that will never satisfy the restraints and thus should be discarded. The application of this spatial sampling methodology to both Protein-Protein Docking Benchmark v4 [35] and a self-made multi-body benchmark resulted in higher quality models with than without restraints [9•].
Several docking software also support the use of density-or shape-related information, e.g., from cryo-Electron Microscopy (cryo-EM) or Small-Angle X-ray Scattering (SAXS). These are discussed in a separate section further down. A non-comprehensive list of docking software and which type of information each of them can use is presented in Table 1. a) Residue-level information about specific residues that are important for the interaction, but without specific information about the contact they make or their position within the interface b) Contacts / distances between specific pairs of residues or atoms. c) Density/shape information about the shape of the complex (e.g., from cryo-EM and SAXS) and topological information such as provided for example by the membrane for membrane-related complexes

Recent Developments in Antibody-Antigen Docking
To investigate the possibilities and limitations of antigen-antibody docking, Ambrosetti et al recently compared the performance of ClusPro, HADDOCK, LightDock, and ZDOCK [54], four commonly used software that allow to make use of the knowledge about the antibody hypervariable loops. The software was tested in three setups: using information about the hypervariable loops but no information about the epitope, with a low resolution definition of the epitope, and with the real interface observed in experimental structures. HADDOCK outperformed the other software both in quality of generated models and success rate and a detailed protocol has been made available [55].
The RosettaAntibody and Rosetta SnugDock methods for antibody structure prediction and antibody-antigen docking have also been made more robust, with a simplified user interface, an expanded and automated template database, options to model single-domain antibodies with a more generalized kinematics engine, and also new loop modeling techniques [50]. Very recently, an updated, extended and more diverse benchmark for antibody-antigen docking was published, including binding affinities. In relation to docking, authors compared ZDOCK, ClusPro and Rosetta SnugDock and highlighted the challenges posed by monoclonal antibodies, interactions with glycoproteins and camelid nanobodies [56•].

Moving into the Membrane
The majority of docking software has been developed and benchmarked on soluble complexes. Membrane proteins (MP) and their complexes, which are involved to great extents in vital biological processes, have received so far rather limited attention. Many MPs act as receptors and are involved in signal transduction pathways. Understanding how such proteins interact with other macromolecules is therefore of great value for drug discovery. The fast development of molecular crystallography techniques such as in situ data collection, microcrystallography, cryo-EM, and other state-of-the-art experimental techniques [57] have shed a new light on MPs [58].
In the meanwhile, macromolecular complexes involving MPs can now also be modelled by docking using experimental data and/or the information provided by the membrane itself. Several docking software have implemented specialized protocols for modeling membrane protein complexes, including DOCK/PIERR [38], Memdock[48], and RosettaMP [53]. In a recent publication, LightDock was combined with HADDOCK in a novel approach to model membrane-associated protein assemblies [47••]. This novel protocol proposes a way to study the interaction between a MP and a free ligand which is not bound to the membrane by using the "meta-information" of the membrane topology. The latter is taken from the MemProtMD database [59] which uses coarse-grained molecular dynamics simulations to produce a theoretical model of the membrane architecture around the protein. LightDock is able to leverage this "meta-information" by taking into account the simulated membrane topology, considering only the phosphate atoms and limiting the search space to search points that are outside the membrane. The membrane itself is also used to penalize models that would penetrate it.
While LightDock's membrane protocol is designed to dock soluble ligands to transmembrane proteins, JabberDock recently introduced an approach for modeling transmembrane dimers [46•]. This method uses surfaces derived from Spatial and Temporal Influence Density (STID) maps, which represent the dynamics and electrostatics of a protein based on a short Molecular Dynamics (MD) simulation. In the docking process, the surface complementarity of the two interaction partners is then maximized. For membrane proteins, the MD simulations are performed on the individual proteins embedded in a membrane, so that the properties of the TM region are captured in the resulting STID maps [46].

Evolution to the rescue
A large portion of software that support the definition of restraints uses the classical definition of pairwise distances, which, although being a reductive interpretation of highly complex experiments such as cross-linking and mutagenesis, have been producing high quality models. Distance information to be used in docking can also be extracted from co-evolution analysis of pairs of proteins [60,61]. A docking software that focuses on the integration of evolutionary information is InterEvDock [44,45,62]. In a recent work, the InterEvDock group used the targets from the community-wide Critical Assessment of PRediction of Interactions (CAPRI) [63] rounds 38-45 to explore the extent to which evolutionary information can be used to model protein-protein complexes [62•]. By deriving recurrent interface features from homologous interfaces, applying techniques that were used for covariation-based folding, and by using template-based docking they were able to generate acceptable or better models in the top 5 predictions for 11 targets. Template-based approaches also fall into the category of evolutionary information since they are based around homology. These have been discussed elsewhere [64].
Not only can co-evolution analysis be used to derive distance restraints for the modeling of complexes, but it can also be applied to predict which proteins interact on a proteome-wide scale, as recently demonstrated on E. coli genome sequence data [65]. In this work, a logistic regression model was derived using a set of true positive interacting pairs with known structures as well as non-interacting pairs, based on yeast-two-hybrid experiments. The focus of the analysis was cell membrane protein interactions, which include approximately 1.25 million potential pairs. Using their EVcouplings framework [24] the authors could reveal 529 novel protein interactions and their interacting residues. The latter were then used in HADDOCK to model the predicted complexes.
Additionally, co-evolving mutations usually represent key residues involved in physical coupling and these can be determined from an analysis of multiple sequence alignments and provide valuable evolution-based information for docking [66]. The concept is not novel and was introduced already in the '90s by Valencia and co-workers [67], but recent tools like pydca [61] and various web servers such as, for example, EVcouplings [24], RaptorX ComplexContact [68] are simplifying their use. Many of those servers have participated to the CASP contact prediction experiment whose results have been discussed in the related CASP assessment papers [69].

Use of shape information in macromolecular docking
Although cryo-EM provides increasingly high-resolution electron density maps, these are not always sufficient to obtain a complete model of a biomolecular complex at an atomic level and this is even more true in the case of cryo-Electron Tomography (cryo-ET). The use of EM densities has been implemented in several rigid-body fitting methods, from the gridbased tool CoLoRes in Situs [70] to MultiFit [71]. FlexEM, which can be run by the MODELLER software [39], combines rigid-body fitting to a cryo-EM density with refinement, where parts of the structure are kept rigid, for example the secondary structure elements [72]. FlexEM can also incorporate distance restraints like those described above. These methods usually fit one protein at a time into the density, thus neglecting the intermolecular interactions in this process.
Docking methods that actually account for the interface of a complex while guiding the modeling with cryo-EM data include ATTRACT-EM [36] and HADDOCK [11,42]. In ATTRACT_EM, the resolution of the density map is reduced for the initial fitting of the components, after which the top models are refined in the original map [36]. In HADDOCK, centroids are first placed within the density map and (ambiguous) distance restraints are used to draw each molecule into its predicted position within the density. During the following refinement steps, the molecules are then restrained by the EM density itself. These programs can simultaneously apply classical distance restraints in addition to cryo-EM restraints.
Lower resolution shape information can be provided by Small Angle Scattering (SAS) methods such as Small-Angle X-ray or Neutron Scattering (SAXS/SANS) [73]. Scattering data from such experiments can be used in several docking methods, including SASREF [74], pyDockSAXS [49,75], IMP's FoXSDock [40,41], [37] and HADDOCK [76]. All of these methods use SAXS data to filter models, selecting them based either directly on the fit (#) of their theoretical scattering curve to the experimental one, or by integrating the # value in a more generalized score. ClusPro and RosettaDock SAXS filter models in a similar way, but disregard χ in the final scores (and thus ranking) that they return [51,52]. Some can also incorporate a radius of gyration (derived from SAXS data) restraint, as implemented in HADDOCK.
SAXS scattering curves are also commonly translated into bead representations consisting of a set of dummy atoms to visualize the shape of a molecule or complex, with tools such as those available for example from the ATSAS software suite [77,78]. To our knowledge, the only method that makes use of such bead representations for protein-protein docking is ATTRACT-SAXS, where the search space is constrained by an atom density mask derived from a bead model [37]. Analogous applications have been reported in the field of smallmolecule data, where the binding pocket may be defined as a set of 3D points [73]. This information can be used to define restraints that guide the small molecule to the correct position. Such bead models could provide a very versatile manner of representing a variety of experimental data such as SAS, low to medium resolution EM data, or any kind of volumetric data, and we expect that they will find their way into macromolecular docking in the near future. Note that shape information has been used in IMP as low-resolution representations of components for which no high-resolution 3D structures are available [79], as well as in LightDock, which uses a very simplified bead representation on the membrane (see above).

Conclusions
An increasing variety of both experimental and predicted information can be used in the modeling protein-protein complexes. While classical (ambiguous) distance restraints work well in many scenarios, other creative ways of accounting for data on the interface as well as other characteristics of the complex are being developed. Protocols harvesting specific information for a given type of complex, such as those involving membrane-embedded proteins and antibody-antigen interactions, allow to generate models with increasing accuracy. With the explosion in genomic data, information extracted from sequence information is now increasingly used in several evolution-centered docking approaches. And finally, a growing number of methods now allow for shape-based information to be incorporated to ensure that the global shape of the generated models matches the experimentally derived shapes. All these developments clearly underscore the continuous increase in the use of information to drive the modeling of biomolecular complexes in the current era of integrative structural biology.
The authors tackle the modelling of transmembrane protein complexes docking, a category of interaction which has a low representation in the PDB database (4% of total structures) mostly due to its experimental determination challanges. JabberDock uses a shape representation of the membrane protein structure derived after short molecular dynamics simulations. On a self-made benchmark of 20 alpha-helix transmembrane helix proteins JabberDock achieves a success rate of 75%, the highest observed so far for transmembrane docking.
Using LightDock, a flexible framework for the determination of protein complexes based on the Glowworm Swarm Optimisation algorithm, the authors describe a protocol that allow to account for the topological information provided by the membrane to guide the docking process. The resulting models are refined with HADDOCK to remove clashes. This work expands the capabilities of LightDock as integrative modelling software (see also reference 7). The prediction of antibody-antigen complexes has been a challenge for the field of computational biology and its of great interest for the development of pharmaceuticals. In this publication an expanded benchmark is presented, containing more than double the amount of targets and binding affinities in comparison to previous benchmarks. The performance of several docking software and binding affinity predictors are compared.