The Use of Structural Templates in Protein Backbone Modeling

The procedures used to model a protein structure are well established when the novel protein has high sequence similarity to a protein of known structure. Many proteins of interest have low (i.e. <50%) sequence similarity to any known structure. In these cases new approaches to prediction of structure are required. The use of sequence profiles which relate sequence to known structure has been proposed as one method to assign local regions of structure. As a first stage, templates or “icons” of the many relevant substructural motifs found in proteins must be defined. The sequences which gave rise to these structures are then aligned and a weighted profile obtained. Average structures of the 8 and 12 residue helix-turn and turn-helix motifs have been prepared. These coordinate templates were then used to scan through the Brookhaven protein structural database for similar, superimposable fragments. A composite template of 100 similar fragments for each element was found to be internally consistent to a rmsd=0.92 Å for HT8, 1.54 Å for HT12, 0.41 Å for TH8 and 1.40 Å for TH12. All of the sequences, from these structures, were then used to create an overall sequence profile. The four sequence profiles were scanned against the amino acid sequences of the proteins in the Brookhaven database: tertiary structure was correctly identified only about 10% of the time. This value is too low for predictive purposes. However, it could be increased by checking for multiple occurrences of the template in one protein.

The procedures used to model a protein structure are well established when the novel protein has high sequence similarity to a protein of known structure. Many proteins of interest have low (i.e.<50%) sequence similarity to any known structure. In these cases new approaches to prediction of structure are required.
The use of sequence profiles which relate sequence to known structure has been proposed as one method to assign local regions of structure. As a first stage, templates or "icons" of the many relevant substructural motifs found in proteins must be defined. The sequences which gave rise to these structures are then aligned and a weighted profile obtained.
Average structures of the 8 and 12 residue helix-turn and turn-helix motifs have been prepared. These coordinate templates were then used to scan through the Brookhaven protein struc-tural database for similar, superimposable fragments. A composite template of 100 similar fragments for each element was found to be internally consistent to a rmsd=0.92 A for HT8, 1.54 A for HT12, 0.41 A for TH8 and 1.40 A for TH12. All of the sequences, from these structures, were then used to create an overall sequence profile.
The four sequence profiles were scanned against the amino acid sequences of the proteins in the Brookhaven database: tertiary structure was correctly identified only about 10% of the time. This value is too low for predictive purposes. However, it could be increased by checking for multiple occurrences of the template in one protein.

Introduction
The process of protein modeling relies upon the database of structures determined principally by xray crystallography or, more recently, 2-D NMR techniques. As a first step in modeling, the degree of sequence similarity of a novel protein is compared to all proteins of known structure. Given high sequence similarity (>50%) the techniques of homology modeling will certainly be used [1][2][3][4][5][6][7]. The effectiveness of this process has been demonstrated in the construction of models of insulin-like growth factor [8], t-PA [9], and immunoglobulin variable domain [10] to name a few. However, many proteins of interest have a lower degree of homology or obvious insertions or deletions in their sequence. Any methods which can be used to predict the structure of these proteins are of great interest to experimentalists and theoreticians alike.
The secondary structure of a protein can be predicted with methods such as Chou-Fasman but only to some 65% accuracy [11,12]. To improve upon this, the use of sequence specific profiles has been proposed [1, 13,14], The sequence specific requirements of/3 turns [15], N-cap, C-cap a helices [16] and proline-kinked a helices [17] have been Volume 94, Number I, January-February 1989 Journal of Research of the National Institute of Standards and Technology previously defined. Also, the sequence requirements of large domains are known for the globin fold [18,19], and the immunoglobulin fold [20].
A major assumption in this procedure is that certain linear amino acid sequences give rise to specific structural elements [21][22][23]. Many different approaches have been taken to identify zones in proteins which are very closely packed [24][25][26][27][28][29][30][31][32], Most methods are computationally intensive; one simple method is to count the number of residues which he within a sphere of a given radius around any atom. To prepare a profile, the relevant fragments are extracted from all proteins of known structure and aligned in space. The amino acid types are then checked at each residue position and a weighted sequence profile determined. Any novel amino acid sequence can then be checked against a bank of such known profiles and the most likely tertiary fragments identified. This procedure differs from the standard predictive methods of secondary structure in that it attempts to assign specific three-dimensional structure on the basis of sequence and not just regions of secondary structure.
In this work, two examples of both turn-helix and helix-turn structures were chosen for study. These structures were previously identified by Zefus as highly compact structures which were repeated throughout many protein structures [31]. The purpose of this work is to outline some of the steps involved in the identification of relevant templates and their application to structure prediction.

Methodology
All programs were written in Fortran 77 and run on a VAX 11/750 under the VMS rev 4.7 operating system.

Preparation of Stage I Templates
The number and identity of residues which surround each residue in the protein lysozyme (Brookhaven code ILZl) were determined. The radius of the sphere checked around each atom was over the range of 3.0 to 8.0 A.

Identification of Average Structural Template Coordinates
For the purposes of this work four structural units of a known compact nature were used. These were the 8 residue helix-turn (HT8), 12 residue he-lix-turn (HT12), 8 residue turn-helix (TH8), and 12 residue turn-helix (TH12) domains as assigned by Zefus [31].

Preparation of Stage I Templates
The backbone coordinates of each member associated with a structural template were superimposed using a conjugate gradient rotation/translation function. The root mean square deviation (rmsd) of each member to every other member was calculated for both the main-chain and side-chain atomic positions.
If a particular member appeared to be significantly different from all the other members it was discarded from further consideration. The mean X, Y, Z coordinates of the main-chain atoms were calculated from the fragments under consideration. This coordinate set was identified as a stage I template.

Preparation of Stage II Templates
Only proteins in the Brookhaven database (release October 1987) with a resolution of better than 2.5 A were used in this work [33]: 82 non-homologous proteins, 177 proteins in total were used in this subset of the database. The 100 fragments with the lowest rmsd to the stage I template were rank ordered and the average coordinate set calculated. Finally, the average of the standard deviation of the errors in the X, Y, and Z coordinates was determined. This new coordinate set was identified as stage II template.

Amino Acid Sequence Profiles
The amino acid sequences used to prepare the stage II template were assembled with the programs of the University of Wisconsin Genetics Computer Group (Ver 5.2) [34]. A sequence profile was prepared with the program PROFILE [13]. The Protein Identification Resource/NBRF (PIR) (Rel 15.0) database [35] of amino acid sequences was scanned with the program PRO-FILESEARCH and alignments calculated with PROFILESEGMENTS. A subset of the PIR database, which corresponded to the proteins used in the Brookhaven database, was also checked for alignments to the calculated profiles.

Results
For the purposes of modeling or structure prediction it is necessary to clearly define substructural elements. A number of canonical structures such as a helices, ^ sheets or larger super-secondary elements such as Greek keys, or a-/3-a units are well known. However, irregular or compound elements can have a very high packing density. Inter-residue contact plots are a convenient method for identification of both the contiguous and discontinuous zones of high density (data not shown).
The number of contacts which a particular residue makes with its neighbors increases in a linear way with the size of the probe distance [36]. As shown in figure 1 for lysozyme (ILZl) beyond a shell size of 4.0 A the shape of the compact domain did not change; there was an increase only in the number of residues involved. Two of the structural templates under investigation exist in the lysozyme structure and occur in regions of high packing density. Neither of the motifs in lysozyme were used to generate the stage I templates.
The fragments used for the preparation of stage I templates are given in table 1. A number of elements originally identified by Zefus as compact turn-helix 8 motifs were rejected for use in the preparation of the stage I TH8 template. Rejection was based upon an average rmsd, of the fragment to all other members of the test set (main-chain atoms only), of 1.5 A greater than the average rmsd for all residues in the NxN test set.   Superimposition of the coordinate sets was based solely upon the backbone atoms. Those side-chain atoms which had equivalent atom names at superimposed residues were checked for structural homology. For example, if the backbones of alanine and cystine were superimposed the rmsd was determined for the CyS atom position. On average, 1.5 side-chain atomic positions could be superimposed at each residue over all the paired coordinate sets.
The turn-helix 8 stage I template had the greatest degree of structural homology for both main-chain and the superimposable side-chain atoms. In each stage I template the greatest diversity occurred in the turn region: the helix was well defined. This may relate to actual differences in the structure and partly to the difficulty of building the original protein structure into x-ray density associated with irregular elements such as these turns. Alternatively, this may indicate that average rmsd error is a relatively insensitive indicator of similarity between protein fragments.
The Brookhaven protein database was scanned for the best 100 fragments which could be superimposed onto the stage I template. Due to the existence of multiple forms and multiple chains in a protein the database has significant redundancy. However, these redundant fragments had minor variations in three dimensional structure. Keeping and averaging these redundant forms reduced the structural error associated with the motif as found in any one particular crystal structure. Table 2 indicates the average rmsd values of the top 50 and top 100 fragments which were found in this manner for each template type. The average structure of the HT8 stage II template is shown in figure 2, HT12 in figure 3, TH8 in figure 4 and TH12 in figure 5. The sphere centered at each atom represents 50% of the standard deviation error in atomic position at that atom between all members used to generate the stage II template. The templates were relatively structurally homologous. The helix atoms in both 8 residue tem-    plates had an error (0.30 ±0.1 A) close to the experimental error of the protein coordinate sets whereas the atoms associated with the turn were less well defined (0.4±0.2 A). The longer 12 residue templates were less accurate with an average error of 0.7 ±0.3 A in the turn regions, double that of the helix region (0.3±0.1 A). The associated X, Y, Z coordinates are given in Appendix 1: phi, psi backbone angles of each template are given in table 3. Residues in the turn did not correspond to any of the standard yS turn types.
A capital letter (one letter amino acid code) signifies a weighting factor of > 0.5; lowercase is weighting > 0.3 and < 0.5. '' hpl-hydrophilic amino acids. "^ hpb-hydrophobic amino acids. ''.-no amino acids had a weighting factor > 0.3. ^ The amino acid set a, b, d, e, t g, k, p, s, t all had a 0.3 weighting.
The PIR database of amino acid sequences was scanned for sequences which had a close alignment to that of each sequence profile. The alignment of the profile to an amino acid sequence was scored on the basis of the Dayhoff evolutionary metric matrix with a penalty factor for each gap [37].
One restriction of the PROFILESEGMENT program, as currently implemented, is that only the "best" alignment found for each protein is reported. Consequently, the procedure does not report multiple occurrences of a close alignment to the profile in one protein. Table 5 shows the alignment scores of each profile to the database. The score for TH12 was significantly better for the best 100 hits to the PIR database versus the entire database. This was due to a single segment of hemoglobin as identified by the TH12 profile. Since there are more than 100 variants of hemoglobin in the PIR database this search score was artificially high. ■' Score is based upon alignment metric matrix of the number of conserved residues less a penalty for introduced gaps. '' Average score of all 6862 sequences in release 15.0 of the PIR database. '■ Average score for the 100 sequences which matched closest to the profile. '' Average score for the 82 sequences which are the non-homologous sequences corresponding to known structures in the Brookhaven database of better than 2.5 A resolution.
The ability of the profiles to correctly identify structural elements in amino acid sequences is summarized in table 6. The 12 residue templates had, on average, a higher discriminatory power than the 8 residue templates. In neither case were the profiles useful for predictive purposes. The number of sequences which were incorrectly identified as the "best" hit by PROFILEGAP was high at some 50%. Since only one hit is reported it is uncertain if any of the segments classified under "Multiple" in table 6 could be correctly identified by this procedure.

Discussion
The ability of a given protein sequence to rapidly and reproducibly adopt a single major backbone fold is believed to be inherent to its linear amino acid code. However, the initial sequencespecific signals which are associated with the initiation of the folding process are still unknown. Routes or pathways of folding have been proposed for a number of proteins [13]. Certain sites (e.g., certain turns stabilized by a few hydrogen bonds) have a higher degree of structural compactness and may be the primary cores at which folding was originated. The events associated with subsequent side-chain/side-chain stabilizations and further main-chain hydrogen bonds are only open to speculation at this point.
To make the transition between a novel linear amino acid sequence and a three-dimensional structure the protein modeler will need to be able to identify the critical sites necessary for the determination of the overall fold of the protein. This requires, however, the availability of coordinate sets for compact structures and the range of amino acids which can be used to create these sequences.
It is difficult, at this time, to assign structural elements from a protein to an average coordinate template from a family of possibilities. In this work, a rather arbitrary cutoff of a high rmsd of mainchain atoms was chosen. This may not be a very sensitive indicator of structural homology. Application of cluster analysis to side-chain atom contact plots, or to side-chain rmsd values, along with solvent accessibility values at each residue may be useful to help further categorize the fragments and thus better define the template [38], The accuracy of the turn-helix 8 template in the turn region as compared to the relative diffuseness at the turn region of the turn-helix 12 template illustrates this point well. Also, template definition may be improved during the superimposition procedure. In this work a rigid body rotation/translation algorithm was applied. An alternative would be to use a dynamic algorithm which could allow for breaks in the backbone chain during superimposition [39]. This will be of particular importance for the preparation of larger domain templates.
Once a particular structural template has been defined all sequences which give rise to it can be readily identified. The variability of the amino acids at each residue position over the template region is known as its sequence profile. These profiles are dependent upon the correct sequence alignment among many proteins. Obviously, knowledge of the structure is the ultimate check of the sequence alignment. Application of the standard Needleman-Wunsch algorithm to a small number of sequences will continue to suffer from the well-known alignment problem in which residues that occupy the same three-dimensional volume are often not equated. As a rule of thumb, if the structure is unknown but some 20-f homologous sequences are known, the correct alignment can probably be achieved.
In the absence of structure, a diagnostic sequence profile can still be prepared for certain elements. For example, the consensus profile for the DNA binding zinc finger motif has been defined [13,40].
The metric matrix of Dayhoff (based upon evolutionary relationships) which is used during the sequence alignment procedure may not be appropriate in all cases. It has been shown, in certain structural elements, that otherwise conservative replacements are not possible. For example, the replacement of aspartic acid by glutamic acid is not possible at the N-cap position of an a helix [16].
The identification, preparation, and application of these profiles is still a matter of some debate [41]. For example, if the domain of interest is large, as in the case of a globin fold, it is a reasonably straightforward matter to achieve a correct sequence alignment among many homologous sequences. To be useful for the modeling of proteins de novo, significantly shorter domains or substructural elements must be accurately identified: the profile sequences of elements such as a helices or /3 turns may not be sufficiently specific to discriminate their existence in a sequence. The procedure may thus be limited to finding only a few very specific substructural elements or large folded domains.
If a specific element or fold has been identified from a given structure, a statistically large sample of sequences relating to the template will be required to show the range of residues which can occupy any particular site. The databases of structure and sequences may still be too small to allow for statistical certainty at this time [41].
In the next stage of model building the zones of known structure are joined together to create a range of folding possibilities [42,43]. All the residues are set to alanine except for glycine and proline: this restricts the number of degrees of freedom in the folding problem. Distance geometry or combinatorial approaches can be used to fold the backbone [44]. Thi^ is a severely underdetermined system and additional information is certainly needed to constrain the system. The principal restrictions used to restrain the system can be understood easily enough: no atomic overlap; residues should be closely packed; hydrogen bonds are often formed [45]; charged residues are most often found on the surface [46]; restricted conformational possibilities for disulfide bonds [47] and proline residues [48]; sequence dependant statistical data [49,50] such as (flexibility, hydrophilicity, surface accessibility); side-chain volumes; average number of contacts for residues in given substructural regions [36]; Ramachandran plot preferences for phi, psi angles; and any known biochemical information such as disulfide bonding patterns, or specific residues which come together to form an active site.
A major assumption of this approach is that interactions between defined sub-structural domains will affect primarily the details of the side-chain packings [51]: the backbone configuration will remain relatively constant during subsequent model building steps. The placement of side-chains de novo is clearly a very difficult job. However, various models have hand-built the core of a protein with surprising ease [52,53]. The methodology to discriminate between competing core packing motifs is still under development. This level of precision, in the preparation of models, is beyond the scope of this work.
These models will be of interest from a variety of standpoints. First, by comparing the variety of ways of joining structural fragments it may be possible to identify why certain motifs are favoured in nature. That is, certain amino acids at specific points may lead to one particular fold. This can be seen most clearly with the role of glycine in allowing certain turn types to exist. Also, the refinement of x-ray crystal structures can also benefit from this approach. A current version of the graphics program FRODO incorporates a library of fragments which can be laid into the electron density map and thus help speed the process of interpretation and refinement [54].
A library of average secondary and super-secondary templates and their associated sequence profiles is currently in preparation. Due to the small size of the databases, the discriminatory power of these profiles may be low. However, the average coordinate sets will still be very useful for general modeling purposes.

Acknowledgments
The author thanks Dr. Shoshana Wodak for the preprint and Drs. Steve Bryant, Bob Bruccolerri, and John Moult for helpful discussions.