Prediction of Protein Helices with a Derivative of the Strip-of-Helix Hydrophobicity Algorithm*

The strip-of-helix hydrophobicity algorithm was de- vised to identify protein sequences which, when coiled as a or 3,0 helices, had one axial, hydrophobic strip and otherwise variably hydrophilic residues. The strip-of-helix hydrophobicity algorithm also ranked such sequences according to an index, the mean hydro- phobicity of amino acids in the axial strip. This algorithm well predicted T cell-presented fragments of antigenic proteins. A derivative of this algorithm (the structural helices algorithm (SHA)) was tested for the prediction of helices in crystallographically defined proteins. For the SHA, eight amino acid sequences, 2 cycles plus one amino acid in an a helix, with strip-of-helix hydrophobicity indices greater than 2.5, were selected with overlapping segments joined. These selections were terminated according to simple “capping rules,” which took into account the roles of N-terminal Asn or Pro and C-terminal Gly in the stability of hel- ices. In analyses of 35 crystallographically defined proteins with known a and 310 helices, the predictions with the SHA overlapped (had overlap indices x 2 0.6) with 34% of known helices, touched (had overlap in- dices 0.5 z x > 0) or overlapped with 66% of known helices, or were neighboring (came within 6 residues) or touched or overlapped with 82% of known helices. At each level of judging the quality of prediction, the SHA was usually less sensitive (correct predictions/ total number of known helices) and more be one helical segment. Therefore, we defined a match in three ways, one of which (overlap) made that one-segment strategy a poor method of prediction. The definitions are nested. In other words, any prediction which overlaps a true structure will also touch and be near.

eign antigenic proteins and mediating helper or cytotoxic T cell recognition are likewise amphipathic structures. Such T cell-presented sequences are frequently buried within proteins in contrast to antibody-recognized sequences, which are exposed, hydrophilic determinants (3). Ii, the electrophoretically invariant glycoproteins associated with class I1 MHC' a and @ chains from time of synthesis until intracellular cleavage either to p25 or to p21/p10, has been suggested to regulate class I1 MHC antigen-presenting function (4-7). Transfection of the Ii gene to fibroblasts with class I1 MHC a and @ chains greatly improves the efficiency of those cells' presentation of protein antigens (8). One specific hypothesis is that Ii (Phe'46-V a P ) could fill the class I1 MHC desetope until replaced by a structurally analogous foreign peptide (9, 10). The desetope of class I MHC molecules is composed of two helices placed on a hydrophobic &pleated sheet (11). A similar structure is proposed for class I1 MHC molecules (12) and is supported by some structural studies (13).
The striking feature of Ii(146-164), when its amino acid sequence is displayed in a sheet projection, is the narrow, axial, very hydrophobic strip Phe'46-Le~'60-Le~163-Met1s'- . The remainder of the helix is variably hydrophilic with two salt bridges and several hydrogen bonds between side chains of adjacent loops of the putative helix. Hypothesizing that the principal allele-nonspecific force in desetope binding of such Ii(146-164)-like foreign peptides to be the energy of desolvation of that axial, hydrophobic strip, we designed the strip-of-helix hydrophobicity algorithm (SOHHA) to look for similar structures in antigenic proteins. The SOHHA averages Kyte-Doolittle hydrophobicity values of amino acids in strips from positions n, n + 4, n + 7, n + 11, n + 14, n + 18, and n + 22 in primary sequences of proteins, up to 6 cycles of an a helix (9). The SOHHA in a computed form for 2-6 cycles of a or 310 helices with tabular and graphical displays is useful in identifying class 11-and class I-restricted T cell-presented epitopes (10, 14).
We have found that the structural helices algorithm (SHA) derivative of the SOHHA predicts helices in a series of crystallographically defined proteins. The potential to form an axial, hydrophobic strip might promote nucleation of both segments of a folding nascent protein and fragments of T cellpresented protein antigens.

MATERIALS AND METHODS
Program-The program, originally created by Stille (lo), was rewritten to display in tabular form ranked SOH hydrophobicity indices for six helical segments/100 amino acids at 2-6 cycles of a and 310 The abbreviations used are: MHC, major histocompatibility complex; SHA, structural helices algorithm (for prediction of helices in native proteins); SOH, strip-of-helix; SOHHA, strip-of-helix hydrophobicity algorithm (for the identification of sequences with the potential to coil as an amphipathic helix for T cell presentation).
helices. Overlapping 8-amino acid a-helical predictions with SOH hydrophobicity indices greater than 2.5 were joined. Termini of these unions or freestanding segments were "capped" or modified according to the following rules derived from the observations of Richardson and Richardson (15). Within the first four N-terminal amino acids, the most C-terminal Asn became the N terminus. Within the first four amino acids or at position -1 or -2, the position preceding the most C-terminal Pro became the N terminus. Because a proline usually permits hydrogen bond formation between the amide nitrogen of the amino acid preceding it and the carboxyl of the second amino acid following it (16) if that second amino acid is within a predicted helix, the amino acid preceding the proline could be the first member of the proline-induced turn initiating that helix. The first Gly or Pro following an N terminus became the C terminus. In the absence of any modification of a terminus, it stood as selected in the unions or freestanding segments. A final selection had to be at least five amino acids long. This program was written in Fortran and run on the Harris 1000 mainframe of the computing facility of the University of Massachusetts Medical Center. The graphical display of the sequence and the various indicators of segments predicted by each method were generated as a two-dimensional array, facilitating both extensions to more complex displays and portability. Copies of the program are available upon request.
Protein Sequences and Helices-Amino acid sequences were obtained from the Swiss Protein Sequence and the Protein Identification Resource of the National Biomedical Research Foundation data banks. Structural helices were defined to be those used by Richardson and Richardson (15) and for additional proteins as identified by Presta and Rose (17). The  and  predictions were found with The Protein Program (Comprehensive Microcomputer Systems for Molecular Biology, DNASTAR Inc., Madison, WI).
Statistical Analyses-A correct prediction was defined in terms of the overlap of the predicted segment with the true segment, with arbitrary definitions of the required amount of overlap. If merely touching the true segment counted as a correct prediction, then a perfect score would result from predicting the entire protein to be one helical segment. Therefore, we defined a match in three ways, one of which (overlap) made that one-segment strategy a poor method of prediction. The definitions are nested. In other words, any prediction which overlaps a true structure will also touch and be near.
1) Overlap occurred when the intersection of the true and predicted segments was 50% or more of the union of the two segments. For example, if the true segment occupied sequence positions 1-10 and the prediction occupied positions 4-13, then the union of the two covered positions 1-13, while the intersection covered positions 4-10 (seven positions), and 7 out of 13 exceeded 50%. With this definition, only one prediction could match a true segment. If two predicted segments overlapped with one true helix, then the two predictions must have touched and would have been merged. The other two definitions of a correct prediction did not have this property.
2) Touching of the prediction with the true segment was defined as a nonempty intersection of the two segments less than overlap. Two predictions can touch one true segment.
3) Nearness of the prediction to the true segment was defined as the two segments being at least six amino acids of each other. Again, two predictions can be near a true segment.
We compared the rates of correct predictions for the SHA, Chou-Fasman (18),  methods. The rates were compared by means of the t test and the X-square test. For the second and third definitions of a match if several predictions matched a true segment, then only one correct prediction resulted. The extra predictions were not counted as errors. Similarly, when one prediction matched two (or more) true segments, then two (or more) correct predictions were recorded.

RESULTS
Development of the Algorithm-Rather than progressively refining our algorithm to fit a data base, we started with a biological model and then tested how well the predictions with the algorithm derived from the model actually fit a set of established data. Simply put, the model stated that, during protein folding, nucleations of helices could be catalyzed by surfaces which recognized and stabilized the axial, hydrophobic surfaces of those helices. The axial, hydrophobic strips of nucleated helices could either associate against each other through interdigitation of hydrophobic amino acids or bind to hydrophobic regions of a growing protein structure. In addition, the initial nucleation could catalyze the extension of helical structures into adjacent segments of the protein to be preserved when an axial, hydrophobic surface also exists in such evolving helical regions or when other favorable interactions exist. Although the initial nucleating region might be the one with the strongest axial, hydrophobic strip, touching or neighboring helices with weaker axial, hydrophobic strips could also be present. Because analyses of protein structure had reported a high frequency of 2-cycle, (Y and 310 helices (20), our basic window for evaluation was eight amino acids (2 cycles + 1 amino acid, or 3 turns) of an a helix.
Unions of these short helices could form longer helices. Finally, local structural interactions with Asn or Pro at N termini or C termini and/or interactions with the peptide backbone of the helix could promote stability of the termini, as described by Richardson and Richardson (15) and Presta and Rose (17). These principles were reflected in the algorithm formally presented under "Materials and Methods." Quantitatwn of the Quality of Predictions-Predictions with the SHA, Chou-Fasman, and Garnier-Robson methods were compared to known helices. Representative examples are presented in Table I. 49 of the 180 predicted helices had indices of overlap with known helices greater than or equal to 0.5 and were termed "overlapping" predictions (Table 11). 48 of the remaining 131 predicted helices overlapped with known helices and were called "touching," and 22 of the remaining predicted helices were separated by six or fewer amino acids from a known helix and were called "neighboring." The remaining 61 predicted helices were not within six amino acids of a known helix and were termed "wrong" predictions. 27 of the 146 known helices which were not within six amino acids of a predicted helix were termed "missed" predictions.
The SHA method made 26% fewer predictions than the Chou-Fasman method and 7% fewer than the Garnier-Robson method and, as a result, identified fewer of the true segments. On the other hand, in terms of the proportion of correct predictions, the SHA method was better. These comparisons were presented in Table 11. The Chou-Fasman method predicted 6.9 segments/protein averaging 10.4 amino acids in length. The Garnier-Robson method predicted 5.5 segments/ protein averaging 10.8 amino acids in length. The SHA predicted 5.1 segments/protein averaging 10.2 amino acids in length, significantly fewer segments than the Chou-Fasman method ( p < 0.02). The Chou-Fasman method made more correct predictions than the SHA method under all three definitions of a match, significantly more ( p < 0.01) when touching or nearness was also used to define a match. However, the SHA method was more efficient than Chou-Fasman under all three definitions. It was significantly better when comparing neighboring, touching, and overlapping predictions ( p < 0.01). The Garnier-Robson method was never first in any of the comparisons. Evaluation of the Propensity of the Initial Helix to Extend or Migrate-The initial site for helical nucleation could be selected according to the strength of the axial, hydrophobic strip. However, that initial helix might migrate or extend to adjacent regions and such regions might be preserved in the final structure if stabilized. To evaluate the influence of the strength of axial, hydrophobic strips on helix formation in neighboring regions, a second program was generated changing only the SOH hydrophobicity index threshold to 2.0. With this second program, the number of predictions increased and several of the predictions which were originally classified as  touching or neighboring became overlapping or touching, in this study, some T cell-presented peptides had been experrespectively. Also, previously missed relationships became imentally determined. Such identifications did not come from predicted. However, such a lower threshold also resulted in exhaustive surveys of all peptides to be presented by multiple more wrong predictions. Overall, the lower threshold did not MHC alleles but, rather, from studies of principal determiimprove the efficiency or sensitivity of this approach. nants recognized in one or a few mouse strains (21-27).
Prediction of T Cell-presented Sequences-For four proteins Nevertheless, several peptides which were T cell-presented were found to originate in helices and were predicted by the SHA (Table I). T cell-presented staphylococcal nuclease peptides 11-30, 66-78, and 89-97 were predicted to be structural helices with high SOH hydrophobicity indexes but were not helices in the native protein. This observation supported the view that binding to class I or class I1 MHC desetopes depends on the potential of a peptide to coil as a helix with an axial, hydrophobic strip after proteolytic excision from a protein. Alternatively, such "potential" helical peptides acquire their helical conformation while in the presence of MHC molecules.

DISCUSSION
The SOH hydrophobicity index of the SOHHA predicted in a ranked fashion, T cell-presented sequences in antigenic proteins (10, 14). A derivative of this procedure (with the addition of simple rules to cap termini), the SHA, also predicted many helices found in proteins of known crystallographic structure. In comparison to the helix-predicting methods of  and , the SHA predicted fewer helices at an overlapping or touching level of quality than did the Chou-Fasman algorithm. The SHA is also computationally simpler than the methods of , Cornette et al. (30), and Finer-Moore and Stroud (31).
The fact that many of our predictions overlap or touch known helices supports the hypothesis that a generic feature of the axial, hydrophobic strip is a propensity to catalyze helix formation. In studies of the folding of cytochrome c by Roder et al. (32), nucleation of helices at the N and C termini was indicated by early protection of protein backbone amide protons from deuterium exchange, presumably by hydrogen bonding in a helix. Other than for retention of some residual structure about the heme group in the denatured form of cytochrome c, those helices were the earliest structural elements to demonstrate deuterium exchange protection. The Nand C-terminal helices made contact with each other about Gly6 and Tyrg7, thereby stabilizing the intermediate and leading to a general condensation of the remainder of the molecule. A similar study of the folding of ribonuclease S by Udgaonkar and Baldwin (33) demonstrated early protection of @-pleated sheet amide protons. The core of ribonuclease is composed of two such sheets without helices. In another study of bovine pancreatic trypsin inhibitor, Oas and Kim (34) demonstrated (through analysis of both synthetic peptides and the protein) that a C-terminal a helix, a central antiparallel @-pleated sheet, and the hydrophobic core between them were stabilized by the formation of a disulfide bond between residues 30 and 51 in the @-pleated sheet and a helix, respectively. Our finding that putative axial, hydrophobic strips predicted structural helix formation supports the view that distinct intermediates with local helical structure stabilized by hydrophobic interactions and, in some instances, by disulfide bridge formation can form in the early time course of protein folding.
Not all predicted helices with strong axial, hydrophobic strips were found to be actual helices in the native proteins. This observation was consistent with the potential extension or migration of newly formed helices to adjacent regions with axial, hydrophobic strips. The finding of known helices flanked by touching and/or neighboring predictions supported this view. In attempting to design a migration indicator based on analysis of axial, hydrophobic strips with SOH indices 2.5 > x > 2.0, we found that although some helices improved their quality of prediction, e.g. going from the touching to overlapping category, other helices did not improve and some wrong selections were made. In general, we would not decrease the stringency of initial selections to include putative helices with SOH hydrophobicity indices less than 2.5. Nevertheless, because the SOH index is a continuum of values, an investigator focusing on a single protein, for example, in the context of determining a crystallographic structure, might wish to explore such additional calculations for clues which could be evaluated in the light of an evolving structural model.
The finding that a method to predict helices in proteins also identified T cell-presented epitopes relates to the mechanism of antigen processing and presentation. DeLisi, Berzofsky, and colleagues (2, [35][36][37] initially found that amphipathicity correlated to potential for T cell presentation in sequences of antigenic proteins. Kaiser and K6zdy (1) observed that amphiphilicity was a generic feature leading to the binding of many polypeptide hormones to their receptors or lipid bilayers. The P-pleated sheet floors of the MHC class I desetope and of the deduced MHC class I1 desetope are composed of hydrophobic or uncharged amino acids with allelic variations conserving the hydrophobic character of the floor with few exceptions (only in positions 114 and 116 of HLA A2 and 29 of E; (11,12)). The principal allele-nonspecific force in binding of T cell-presented peptides could be hydrophobic interaction of the axial, hydrophobic strip with the hydrophobic floor of the desetope. It is not known whether, in addition, the hydrophobic strip catalyzes folding of the T cell-presented peptide so that the resulting helical dipole orients and attracts the peptide to the helices bounding the desetope or permits scavenging of digested peptides by molecules, which could protect antigenic epitopes from further proteolysis and/or catalyze their transfer to desetopes (38). The view that helical coiling of digested peptides is a step in processing of T cell antigenic sequences is supported by the finding that some sequences predicted both with this structural helices derivative of the SHA and with the parental SOHHA algorithm are T cell-presented, whereas, in fact, they are actually not helices in the native protein.