Bridging the gap between single-template and fragment based protein structure modeling using Spanner

,


Background
Homology-based protein structural modeling plays an important role in biomedical research by linking genomics and structural biology. As the number of known protein sequences and structures grows, so do the number of sequences that can be modeled. Knowledge of even an approximate threedimensional protein structure can provide valuable information about its structural neighbors. This knowledge can, in turn, shed light on the protein's evolutionary history, biochemical and biological functions. For example, structural modeling was recently used to predict the Mg-dependent RNase activity of Zc3h12a, a protein essential for regulating inflammatory cytokines in toll-like receptor 4 signaling [1]. Here, we introduce a novel structural modeling method using a wider range of protein targets, including a representative set of all known antibody structures.
Currently, the most accurate methods for modeling protein structure are extensions of the fragment assembly method originally implemented in the program Rosetta [2], and now found in the successful TASSER program [3]. In this class of methods, short fragments of known structure are mapped on to the query sequence and then assembled by combinatorial optimization using structure-dependent scoring http://www.immunome-research.net/xxx © 2011 Lis et al; licensee Nikolai Petrovsky Publishing. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. functions to create a hybrid template model. The strength and drawback of this approach, at least in current implementations, is that the size of the effective conformational space is very large. In practice, this means that the optimization procedure takes a long time. Waiting times on the most popular servers can be weeks to months, and users are usually limited to one query at a time.
For this reason, single-template threading, using profile-based scoring functions, is more widely used for routine homology model building. Results can be computed in minutes to hours, which fits well with a typical researcher's timeframe. Unfortunately, the single-template methods typically result in a significant number of insertions and deletions for query-template pairs with low sequence homology. Large insertions present challenges for constraint-based modeling software, such as MODEL-LER [4], since the inserted sequence is effectively unconstrained within the gaps and can appear as a random coil in the final model, even when the insertion is predicted to be structured.
Here we introduce a novel modeling method, implemented in the program Spanner, which uses fragment assembly to extend an initial single template such that there are no insertions or deletions with respect to the query. Because Spanner starts with an initial 'anchor' template, the search for fragments is constrained by the geometry of the gap endpoints, resulting in an efficient optimization protocol. Crucially, the use of internal coordinates as a database index allows fragments matching the The three main phases of the Spanner pipeline are shown. In the hybrid template assembly phase, the selection of one fragment is illustrated. In the threading phase, dark regions indicate parts of the structure that undergo energy minimization. http://www.immunome-research.net/xxx anchor regions within a given tolerance to be retrieved efficiently using a single PGSQL query. Furthermore, since the fragments are selected based on sequence and secondary structure similarity to the query, the insertions are likely to be structured if the corresponding query segment is predicted to be so. Spanner makes use of native and 3 rd -party software, including utilities for populating and updating fragment relational databases, fragment scoring and assembly, sidechain replacement, and energy refinement. A web interface that supports 3D graphical visualization and export of the resulting model to the SeSAW functional annotation server [5] is available, along with source code and data for local installation.

Results
In this section we describe results for the Ig and for the CASP and ASH data sets using the fragment retrieval module and the full Spanner pipeline, respectively. Figure 3A contains results for the entire Ig set binned by loop length. From this figure we can see that, overall, the RMSD grows roughly linearly with the loop length, as has been reported before [7]. In general, the backbone (N, C , C, O) RMSDs of the Spanner loops lie below those of the loops built http://www.immunome-research.net/xxx with MODELLER. The loops in the Ig set that have been assessed previously using the RosettaAntibody program are shown in Figure 3B. For this sub-set of loops, the mean and standard deviation of the Spanner backbone RMSD (3 +/-1.5 Å) lies between that of MODELLER (4 +/-2 Å) and RosettaAntibody (2 +/-1 Å). We note that the Spanner results in Figures  3A and 3B are overall consistent with each other, so we can expect similar performance to that shown in Figure 3B on larger data sets.
For the Ig results, MODELLER jobs required an average of 3 +/-1 CPU hours, while the Spanner fragment retrieval module required 2.4 +/-0.6 minutes.

CASP and ASH sets: Spanner accuracy
Here, we examine the accuracy of the full Spanner pipeline using the CASP set, which allows us to evaluate the performance of Spanner with actual alignments generated for a range of query-template pairs using typical alignment tools. For benchmarking purposes we ran MODELLER using the automodel class with the same alignments. Note that in this exercise, we excluded any fragment from the DB if its sequence identity to the query was 30% or more, in order to make the comparison with MOD-ELLER, which does not make use of existing structural fragments, as fair as possible. Figure 4 shows the average RMSD (C and all-atom) within a range of sequence identity bins. From this figure we can see that, as expected, the average accuracy increased while the standard deviations decreased with sequence identity. There is a slight improvement in terms of RMSD for Spanner over MODELLER in some cases, but the differences are much smaller than the spread in the data. These results confirm that, on average, Spanner produces models that are at least as accurate as those of MODELLER, a stateof-the-art structural modeling tool.
The ASH set represents 'perfect' input for a set of low-homology query-template pairs. The results, shown in Figure 5, are consistent with the CASP alignment results. Here too, we see that the differences between Spanner and MODELLER are very small compared with the deviations for each program within a given sequence identity bin.
We also assessed the CPU times for Spanner and MODELLER for the CASP and ASH sets. In this case, MODELLER average CPU times (17 +/-14 s) were over 20 times shorter than Spanner (377 +/ 4 s). There are two reasons for the reverse trend here as compared to the Ig set. First, the MODEL-LER automodel class is much faster than the dope_loopmodel class. Second, the fragment retrieval module (used in the previous section) is much faster than the full Spanner pipeline, which, in addition to fragment selection, performs sidechain replacement and energy refinement.

Discussion and Conclusions
In this article we present an approach for utilizing the strengths of both single and multiple-template protein modeling. The results clearly demonstrate that for gap regions, a fragment-based approach is at least as good, on average as the template-free constraint-driven approach, at a much lower computational cost; however, the performance is not yet equal to that reported for RosettaAntibody. Whether this is due to the superior sampling in the Rosetta program or the use of a specialized fragment database and sequence rules to identify kinked conformations is not known. Nevertheless, the contrast between the Ig model results and those for the CASP and ASH sets suggests that when gap regions repre- sent a significant fraction of the alignment, local sequence and secondary structural information can be exploited to improve model accuracy. When gaps represented a small fraction of the alignment (CASP and ASH sets), we found that the differences between Spanner and MODELLER were less than the deviations within either program. In such cases, Spanner performed marginally better, in terms of accuracy, but at a higher computational cost. The average computational cost for Spanner (approx. 6 minutes) were, nevertheless, consistent with that of most profile-based threading methods, which typically finish within minutes to hours. Taken together, these results suggest that Spanner represents a rational and scalable approach to fragment-based structural modeling.

Methods
The modules in Spanner are arranged as a pipeline, as illustrated in Figure 1. The inputs to this pipeline are the query-template alignment, and template structure. The output is a model structure of the query. There are two main phases in the calculation: hybrid template assembly and gapless threading to the hybrid template. In the first phase, fragments are chosen based on their geometric similarity to the template at the anchor points and on their primary and secondary structural similarity to the query. The second phase involves sidechain replacement for the selected fragments and overall structural refinement. These individual steps are described in detail below.

Inputs
Spanner requires a template structure (in PDB format) and a template-query alignment (in FASTA format). In addition, the fragment selection process (described below) uses secondary structure information for the query; the secondary structure can be specified as optional input, or computed automatically using PSIPRED [8].

Definition of fragments
Spanner replaces continuous sequence segments (indels) in the template for every gap in the querytemplate alignment. The procedure is illustrated in Figure 2. First, the insertion point around each indel is established: for deletions (Figure 2A) in the alignment, the deleted residues are excised from the template; for insertions ( Figure 2B), the template sequence gap itself identifies the insert location. To allow for small geometrical differences between the template and the inserted fragment, a margin of 1 or more residues on each side of the insert location is also excised from the template. Adjacent insertion points separated by fewer than 4 residues-i.e., too short to support an anchor-are merged into one insertion point with a larger insertion sequence. For greater user control, the indels and margins can be specified as optional input. The four residues on each side of the insert point (the anchors) are used to efficiently search a fragment database for suitable insertion candidates, as described below.

Fragment storage
A representative set of protein chains is maintained using the cd-hit program [9] at 100% sequence identity. All continuous fragments of length 8 or more are regularly extracted from this set of chains and stored in a PostgreSQL relational database (RDB), indexed by the internal coordinates of the fragment endpoints. The internal coordinates consist of C -C distances between the following 4 residue pairs: first, last; first+1, last-1; first+2, last-2; first+3, last-3. In addition to the internal coordinates themselves, the PDB identifier, chain ID, sequence, secondary structure, as defined by STRIDE [10], and the beginning and ending atom indices of the fragment in the corresponding PDB entry, are stored in http://www.immunome-research.net/xxx the RDB. As an example, the first 10 fragments of length 12 for PDB entry 1nag, chain A are stored as: A separate RDB is prepared for each fragment length. Currently, fragments of length 8-60 are stored. In addition to the fragment RDB described above, two additional types of RDBs are created to store fragments used to fill N and C-terminal gaps. For N-terminal (C-terminal) gaps, the internal coordinates consist of all unique C -C distances pairs in the last (first) 4 residues of the fragment. Other fragment information is the same in the terminal RDBs.

Fragment retrieval
For a given fragment, a fragment index is generated from the template anchor residues. A tolerance in the fit to the anchor residues is used to specify a range of index values. The index range is used to generate a PostgreSQL query to the appropriate fragment database and all fragments satisfying the range of indices are returned. PDB entries that should be excluded from the RDB search can be specified by the user, a feature that was utilized in the present work in order to screen out close homologs when benchmarking the program. Since the number of returned fragments is sensitive to the tolerance in the fit to the anchor residues, the retrieval step starts with a small value (0.5 Å by default), and incrementally increases the tolerance until the required number of fragments (1000 by default) or a maximum tolerance (2.5 Å by default) is reached. Each of the above parameters can be modified on the command line.
The fragments returned from the RDB are then sorted by a simple score that is a function only of the primary and secondary structure similarities between the query and the candidate fragment ( 1 ) where S seq is proportional to a log-odds sequence substitution matrix score derived from a large number of structure alignments [11] and S sec is pro-portional to a secondary structure substitution matrix score [12]. A specified number of candidate fragments (100 by default) is then retained. These retained candidates are then re-scored using a more sensitive function that takes structure into account and is given by ( 2 ) where S clash is a weighted sum of clashes between the fragment and the rest of the template structure, excluding residues that are to be replaced by the fragment. Since side chains are not expected to fit perfectly at this point, the weight of sidechainsidechain and sidechain-backbone clashes is set to 1/6 that of backbone-backbone clashes. Also, only severe clashes (interatomic distance < 2 Å) are counted at this point. RMSD fit is given by the rootmean square deviation of C atoms in the fitted anchor residues. The user-specified number of topscoring fragments (1 by default) is then output. The weights and number of resulting models can be adjusted on the command-line.

Threading to the hybrid template
After the model's backbone has been established by splicing in all of the indels, as described above, the query sequence is 'threaded' onto the template and the resulting structure is optimized via energy minimization. We use the term threading loosely, as the alignment is trivial (there are no gaps); the procedure involves only the sidechain replacement and relaxation steps of threading.
First, the sidechains from the query sequence are placed on the template structure's backbone and their rotamers optimized by using either the deadend elimination (DEE) algorithm [13,14] or the SCWRL4 rotamer selection algorithm [15]. To allow the inserted fragments maximum flexibility, their sidechains are first replaced by amino-acids that do not have rotamers: prolines and glycines are used where they appear in the query sequence, and all other sidechains are replaced with alanines (A/G/P representation). Because the inserts will not exactly fit the backbone, they are next allowed to relax in the A/G/P representation via conjugate gradient energy minimization. The minimization is carried out using either the PRESTO ver 3 [16] molecular dynamics package with AMBER force field parameters (default) or Gromacs [17]. To close the gaps between the insert ends and the template backbone, Spanner freezes all residues in the model except the inserted fragments, and runs the minimization for 1000 steps. Once the inserted backbones have been