Conformational Sampling in Template-Free Protein Loop Structure Modeling: An Overview

Accurately modeling protein loops is an important step to predict three-dimensional structures as well as to understand functions of many proteins. Because of their high flexibility, modeling the three-dimensional structures of loops is difficult and is usually treated as a “mini protein folding problem” under geometric constraints. In the past decade, there has been remarkable progress in template-free loop structure modeling due to advances of computational methods as well as stably increasing number of known structures available in PDB. This mini review provides an overview on the recent computational approaches for loop structure modeling. In particular, we focus on the approaches of sampling loop conformation space, which is a critical step to obtain high resolution models in template-free methods. We review the potential energy functions for loop modeling, loop buildup mechanisms to satisfy geometric constraints, and loop conformation sampling algorithms. The recent loop modeling results are also summarized.


Introduction
A loop, also called a coil, is a flexible segment of contiguous polypeptide chain that connects two secondary structure elements in a protein. The loop regions play critical roles in protein functions, such as involving in catalytic sites of enzymes [1], contributing to molecular recognition [2][3][4], and participating in ligand binding sites [5][6][7]. As a result, accurate prediction of the loop regions conformations in proteins is important for a variety of structural biology applications, including determining the surface loop regions in comparative modeling [8], defining segments in NMR spectroscopy experiments [9], designing antibodies [10], identifying function-associated motifs [11], and modeling the dynamics of ion channels [12,13].
According to the loop length distribution illustrated in Figure 1, 93.2% of loops have lengths ranging from 2 to 16 residues, although sometimes loops can stretch much longer. Nevertheless, due to their high flexibility, loops regions are usually more difficult to model and analyze than the other secondary structures such as helices or strands. Indeed, in many (complete) protein models derived from computational methods, the loop regions, particularly the long ones, are the places contributing a lot of error [77]. At the early attempt of loop modeling, Flory [14] assumed that the backbone torsion angles corresponding to one residue are random, more precisely, statistically independent from the backbone torsions of its neighbors. However, more and more experimental [15], evolutional [16], and statistical [17] data have shown that loops are far from random and the nearby residue neighbors in sequence are sufficiently strong to account for substantial changes in the overall structure of loops. Figure 2 shows the ϕ-ψ propensity maps of Leucine in loops when the hydrophobic residues (ILE and VAL) are presented as neighbors at different distances. One can find that the backbone dihedral angle conformations of Leucine have strong correlation with the types of residues at the nearest and second nearest neighboring positions. However, such influences from residues at further positions are much weaker. The ϕ-ψ propensity maps of Leucine with ILE and VAL as two positions away neighbors are almost indistinguishable to the one of singlet Leucine, indicating that influences from neighboring loop residues two positions or further away are negligible. Moreover, studies have demonstrated that the identical peptide segments can adopt completely different structures in different proteins [18,19]. Hence, in addition to the residues in a loop, the residues surrounding the loop structure are also important to determine its conformation, particularly for a loop deeply embedded in the protein structure. Furthermore, the distance between the anchor points in the rest of the protein that spans the loop likely influences the loop conformation as well, particularly when the loop is short.

CSBJ
Abstract: Accurately modeling protein loops is an important step to predict three-dimensional structures as well as to understand functions of many proteins. Because of their high flexibility, modeling the three-dimensional structures of loops is difficult and is usually treated as a "mini protein folding problem" under geometric constraints. In the past decade, there has been remarkable progress in template-free loop structure modeling due to advances of computational methods as well as stably increasing number of known structures available in PDB. This mini review provides an overview on the recent computational approaches for loop structure modeling. In particular, we focus on the approaches of sampling loop conformation space, which is a critical step to obtain high resolution models in template-free methods. We review the potential energy functions for loop modeling, loop buildup mechanisms to satisfy geometric constraints, and loop conformation sampling algorithms. The recent loop modeling results are also summarized.
In general, loop structure modeling methods can be categorized into template-based (database search) methods and template-free (ab initio) methods. The template-based methods [22][23][24][25] search PDB for loop structure templates that fit the geometric and topologic constraints of the loop stems. The template-based methods highly depend on the quality and number of known structures in the PDB. Due to the fact that the number of possible loop conformations grows exponentially with lengths, the template-based methods are limited to relatively short loops. In contrast, the template-free methods can avoid this problem by sampling loop conformation space guided by energy functions. In this mini-review, we focus on the template-free methods only.
The template-free loop modeling problem is regarded as a "mini protein folding problem" [47] under geometric constraints, such as loop closure and avoidance of steric clashes with the remainder of the protein structure. Similar to the protein folding problem, modeling steps including coarse-grained sampling, filtering, clustering, finegrained refining, and ranking are often found in most loop structure prediction methods. During the coarse-grained sampling step, guided by knowledge-or physics-based energy functions, the loop conformation space is explored to produce a large ensemble of reasonable, coarse-grained models satisfying geometric constraints. These coarse-grained models usually use reduced representations for loop structures, such as ϕ-ψ angles, backbone atoms, Cα atoms only [59], or side chain centers of mass [68]. Afterward, the coarse-grained models are filtered to eliminate the unreasonable ones in the ensemble [60] and then the representative models are selected by a clustering algorithm to reduce redundancy. These representative models are used to build fine-grained models in the refining phase, usually guided by a more accurate energy function associated with more structural information such as side chains and hydrogen atoms. Finally, in the ranking phase, the final models are assessed and the top-ranked ones are determined as the predicted results [62]. Among all these modeling steps, the coarse-grained sampling phase is of particular importance -if the sampling process cannot reach conformations close enough to the native, it is unlikely to obtain a high-resolution near-native model eventually. Moreover, the success of sampling relies on the underlying energy (scoring) functions, which are required to provide not only accurate, but also sensitive guidance to the sampling process to explore the protein loop conformation space.
There has been a lot of work done in modeling proteins loops since late 1960s. Limited by length, it is not our intention to provide a thorough review of loop modeling approaches in this mini review. Instead, we focus on the recent computational sampling approaches developed for protein loop structure prediction using template-free methods. We put our emphasis on the important factors that impact loop conformation sampling efficiency, including energy functions for modeling loops, loop buildup algorithms to satisfy geometric constraints, and coarse-grained sampling algorithms. The results of recent works in loop structure prediction are also summarized.

Energy Functions for Loop Modeling
According to Anfinsen's thermodynamics hypothesis [26], the native protein structure having the native structure conformation has the minimum Gibbs free energy of all accessible conformations. Similar to the general protein folding problem, many efforts of loop modeling focus on minimizing the protein potential energy described by physics-based energy functions. Zhang et al.   took advantage of the concept of "colony energy" by considering the loop entropy, an important component in flexible loops, as part of the total free energy.
Since the main goal in loop structure prediction is to model loop conformation with high accuracy instead describing the underlying physics [47], an alternative approach to assess the correctness of a loop conformation is knowledge-based energy functions. The rationale of knowledge-based energy function is to obtain "pseudo energy" based on statistical preferences of conformations for different geometries as obtained from the database of known protein structures. Compared to the physics-based energy functions, the knowledgebased energy functions have several attractive advantages. First of all, the knowledge-based energy functions implicitly capture interactions that are difficult to model in physics-based energy functions. Secondly, the knowledge-based scoring functions usually do not require all atom information of the loops, which is ideal to rapidly generate coarse-grained models. Thirdly, the knowledge-based potentials tend to be "softer" to tolerate structural imperfectionallowing better handling of uncertainties and deficiencies of the computer generated models.
Sippl's potentials of mean force approach [43] is one of the most notable methods to obtain knowledge-based energy functions. According to the inverse-Boltzmann theorem, the knowledge-based energy potential for a feature f is calculated as where k is the Boltzmann constant, T is the temperature, is the observed probability in the database of known structures, and is the probability of the reference state. Possible features to which a pseudo-energy term can be assigned include pairwise atom distances, torsion angles, amino acid contacts, side chain orientation, solvent exposure, or hydrogen bond geometry.  Although quite a few energy functions derived by different manners are available for loop structure modeling, currently there does not exist a superiorly accurate energy function that can always differentiate the near native structures from the other incorrect ones in all protein loops. Figure 3(a) [46], and backbone torsion potential using triplets [17]. None of these energy functions can correctly identify a near native decoy (< 1.0A) with the lowest energy value, although some near-native decoys exhibit low energy values in various energy functions. Figure 3(b) shows the loop decoy structures with the lowest energy values in different energy functions.

Loop Closure
The computer-generated loop models during the sampling process must satisfy the loop closure condition, i.e., the endpoints (C-and Nterminals) of a loop model must seamlessly bridge the anchored endpoints (C-and N-anchors) of the given protein structure. Figure  [49] generalized the applicability of the analytical solutions to 6 not necessarily consecutive torsion angles in peptides of any length while small perturbations in bond angles and peptide torsion angles are also allowed. The random tweak method [55] is carried out by applying small random changes to ϕ-ψ angles and then using an iterated linearized Lagrange multiplier algorithm to satisfy the loop closure constraints with minimal conformational perturbations. Wriggling [56] takes advantage of the linear dependency of every four angles of rotation to keep the combined motion of loop localized. The CCD algorithm [57] treats the loop closure problem as an inverse kinematics problem, which fixes one loop endpoint at the one anchor and then iteratively modifies the ϕ-ψ angles in sequential order to minimize the distance between the other loop endpoint and the target anchor. The Full CCD (FCCD) algorithm [59] extends the applicability of CCD to a reduced loop representation with Cα atoms only by using a singular value decomposition-based optimization of a general rotation matrix. The bi-directional inverse kinematics method [58] adopts the "meet in the middle" strategy by generating halfloops from both C-and N-anchors and then assembles the endpoints of the half loops, which is particularly suitable for modeling long loops. In [60], Soto et al. provided a comparison of effectiveness and computational performance among various loop closure algorithms.
The above methods ensure loop closure, however, without considering the other geometric constraints such as steric clashes. Several methods have been proposed to account for additional geometric constraints in loop modeling. Xiang et al.
[51] imposed a non-bonded energy term on the iterated Lagrange multiplier in the random tweak method to avoid steric clashes while satisfying loop closure simultaneously. Liu et al. [61] designed a self-organizing algorithm by performing fast weighted superimpositions of rigid fragments and adjusting distances between random atom pairs to resolve steric clashes, where not only loop closure, but also steric, planar, chiral, and even constraints derived from experiments can be satisfied simultaneously.

Loop Conformation Sampling
The loop conformation sampling is usually done by sampling backbone torsion angle conformations by deterministic or statistical sampling methods. In practice, it is not computationally feasible to sample all combinations of discretized torsion angles for a relatively long loop. Indeed, a large portion of these torsion angle combinations are infeasible due to steric clashes, unable to close, excluded volume for side chains, etc. In principle, both deterministic and statistical sampling techniques try to avoid these infeasible conformations as many as possible.
Deterministic sampling intends to find all possible loop conformations with reasonable but diversified structures. Galaktionov et al.  , an ensemble of conformations with pair-wise RMSD greater than 0.2A is collected using a round-robin algorithm, in which a suitable ϕ-ψ combination satisfying geometric constraints is selected to gradually grow the loops. Lee et al. [75] produced loop conformations by sequentially adding randomly chosen 7-residue fragments obtained from known structure database. Ring and Cohen [70] sampled loop conformations with Genetic Algorithms (GA). More popularly, the Markov Chain Monte Carlo (MCMC) method [27,28,48,49,52,61,63,64,67,68] is adopted to explore loop conformation space. The fundamental idea of MCMC is to perform local MC moves to propose new loop conformations satisfying loop closure and other geometric constraints without disturbing the rest of the protein structures and then decide the acceptance according to Metropolis acceptance-rejection criterion [ Generally, from algorithm point of view, GA is usually more effective than MC in terms of number of iteration steps to convergence, mainly due to better local minima escaping capability in GA when genetic operators such as crossover are employed [86]. However, in loop modeling, new conformations generated by crossover or mutation likely break the loop closure condition while potentially cause steric clashes. Additional quality control steps, potentially computationally costly, are necessary to correct these violations in geometric constraints [86]. In contrast, local MC moves in MCMC sampling guarantee satisfaction in geometric constraints and thus is more favorable in exploring loop conformation space.
After sampling, a set of coarse-grained loop models exhibiting good geometric properties are generated. Refining loop models, usually guided by a more accurate and sensitive energy potential Template-Free Protein Loop Modeling associated with more structural information such as side chain and hydrogen atoms, is needed to build fine-grained models. Similar to refining the complete protein structure, commonly used approaches to refine loop structures include local optimization [34], MC [81], and more often Molecular Dynamics (MD) simulations [82][83][84][85]. Furthermore, it is important to notice that coarse-grained and finegrained sampling can be combined together to enhance exploration of loop conformation space, as an example shown in [67] where MC and MD simulations are integrated by a replica exchange algorithm.
Each loop modeling method has certain inevitable inaccuracy due to the limitation of sampling methods, uncertainty in energy functions, numerical errors, etc. A new strategy is to integrate different modeling methods to account for different sources of inaccuracy. Deane and Blundell [76] generated consensus predictions from two separate algorithms based on real fragments and computer generated fragments, respectively. Li et al.
[72] developed a Pareto Optimal Sampling (POS) method based on the Multi-Objective Markov Chain Monte Carlo (MOMCMC) algorithm [73] to sample the function space of multiple knowledge-and physics-based energy functions to discover an ensemble of diversified structures yielding Pareto optimality. Jamroz and Kolinski [74] proposed a multimethod approach using MODELLER, Rosetta, and a coarse-grained de novo modelling tool, which leads to better loop models than those generated by each individual method. Table 1 summarizes the energy functions, sampling methods, and loop closure mechanisms and Table 2 lists the loop prediction accuracies in recently (since 2000) published works. Due to advances in computational loop modeling methods, highly accurate models with resolution comparable to experimental results have been achieved in quite a few methods shown in Table 2 for loops less than 8 residues. Several recent methods [37,49,72] can predict loop conformations within or close to 1A RMSD for loop targets up to 13 residues. Another important factor leading to loop modeling improvement is the stable growth of the number of known structures in PDB, which allows one to derive more sensitive knowledge-based energy functions, calibrate physics-based energy functions to achieve higher accuracy, and obtain richer loop fragments or rotamer libraries. Nevertheless, for the very long loops, for example, those over 18 residues, significant breakthrough has not been reported yet. According to Galaktionov et al. [65], modeling very long loops is a "different problem" due to their significantly higher flexibility compared to relatively short loops, which demands "different methodological approaches."

Recent Loop Prediction Results
It is also important to notice that Table 2 does not serve the purpose of comparing prediction accuracy between different methods. First of all, the prediction accuracies in different methods are reported on different loop targets. Some loop targets are significantly "harder" than the others due to strong external influences from ions, ligands, disulfide bonds, and/or interactions with external chains or other units in the crystallographic unit cell. Several difficult loop targets (1poa(79:83), 1eok(A147:A159), 1hxh(A87:A99), and 1qqp(2_161:2_173)) are analyzed in [62]. Secondly, different criteria have been used to measure the accuracies of their prediction results in different methods. The RMSD calculations may be adopted very differently -either based on all heavy atoms, backbone atoms, or Cα atoms only. Moreover, the RMSD comparison may be directly carried out between the predicted model and the native structure, between the model and the relaxed native structure minimized by a force field, or between structures after global superimposition. Thirdly, loops are modeled under different assumptions in different methods. For example, Rosetta repacks all side chains of the protein [48,49] while most of the other methods keep the native side chain conformations in the rest of the protein during the loop modeling process. Therefore, Table 2 does not form a fair base for comparing performance among different loop prediction methods, but is instead used to reflect the recent progress in loop modeling.

Summary
Loops play a critical role in performing important biological functions of proteins. However, due to their high flexibility and variability, modeling the 3D structures of loops is more difficult than other secondary structures. Loop structure modeling is regarded as a "mini protein folding problem" under geometric constraints such as loop closure and steric clashes. The computational loop modeling methods can be categorized into template-based and template-free methods. The template-based methods rely on database search, which is limited by the number of known structures in PDB, particularly when modeling relatively long loops. In comparison, the template-free methods can avoid this problem by diversely sampling loop conformation space to search for appropriate structures. Hence, sampling loop conformation space is the cornerstone of the templatefree methods. Successful sampling methods rely on accurate and sensitive energy functions, fast buildup mechanism to generate reasonable loop models satisfying geometric constraints, and efficient sampling algorithms.
There has been remarkable advancement in template-free loop structure modeling in the past decade, mainly due to new computational methods as well as increasing number of known structures available in PDB. Quite a few loop modeling methods with various strategies have successfully predicted short loops (< 8 residues) with resolution comparable to experimental results. Several recent methods have even achieved near sub-angstrom accuracy in longer loops up to 13 residues. However, modeling very long loops over 18 residues is a challenge remaining unaccomplished. Recent study by Raval et al. [87] on protein structure refinement using very long (>100μs) MD simulations has shown that inaccuracy in current force fields limits MD-based protein structure refinement. Similarly, given loop modeling as a "mini protein folding problem," for difficult or long loop targets, while sampling is no longer a critical issue [87], development of more precise energy functions is now the key.