Journal of Molecular Biology
Volume 316, Issue 2, 15 February 2002, Pages 341-363
Journal home page for Journal of Molecular Biology

Regular article
Wavelet transforms for the characterization and detection of repeating motifs1

https://doi.org/10.1006/jmbi.2001.5332Get rights and content

Abstract

The role of repeating motifs in protein structures is thought to be as modular building blocks which allow an economic way of constructing complex proteins. In this work novel wavelet transform analysis techniques are used to detect and characterize repeating motifs in protein sequence and structure data, where the Kyte-Doolittle hydrophobicity scale (HΦ) and relative accessible surface area (rASA) data provide residue information about the protein sequence and structure, respectively. We analyze a variety of repeating protein motifs, TIM barrels, propellor blades, coiled coils and leucine-rich repeat structures. Detection and characterization of these motifs is performed using techniques based on the continuous wavelet transform (CWT). Results indicate that the wavelet transform techniques developed herein are a promising approach for the detection and characterization of repeating motifs for both structural and in some instances sequence data.

Introduction

Repeating sequences and structures are common in nucleic acids and proteins. A recent survey indicates that 14% of proteins contain sequence repeats, half the number which are contained in nucleic acid sequences.1 Protein repeats come in considerable variety, ranging from repeats of a single residue, through heptad repeats in coiled coils, motif repeats (e.g. propellor blades) and finally to the repetition of homologous domains of 100 or more residues. Here, we are primarily concerned with motif repeats which are by definition secondary or supersecondary structural units: α-helices, β-strands, β-sheets, Rosmann folds, etc., connected together by short, sometimes variable, lengths of peptide in a repeating pattern. In terms of tertiary structure, protein motif repeats can be viewed as modular building blocks which allow an economic way of constructing complex protein topologies.2 Motif repeats are commonly observed in proteins as a single motif repeated in tandem fashion along the protein sequence. A compendium of repeats is displayed in Figure 1(a) to (f).

Current protein repeat detection methods from sequence utilize standard sequence comparison algorithms adapted to find repeats. Andrade et al.10 use optimal and sub-optimal score distributions from profile analysis to find homologous families for 11 kinds of tandem repeats, which once detected, may be used to identify additional repeats in any other sequence. Repeats are identified based on the probabilities of finding matches of different sub-optimal alignments when compared to random sequences. Pellegrini et al.11 utilize multiple alignment techniques, based on a modified version of the Smith-Waterman dynamic programming algorithm,12 where a sequence is aligned against itself enabling internal repeats to be found. Heger & Holm13 use a similar but more refined technique which can validate distant repeats by profile alignment and optimizes repeat borders to yield a maximal integer number of repeats. For these methods detection of repeats is straightforward when the repeat in question is perfect. However, detection is complicated when evolution erodes any sequence similarity and when insertions, deletions and substitutions corrupt the regularity of the repeating pattern. Furthermore, repeats may be incomplete, widely spaced and be of multiple types interspersed throughout a sequence. As a consequence of these complications some repeats are not detected by current methods. To our knowledge, there is not yet even an automated method to assign repeats from 3D structure, which would provide valuable comparative data for assessing the performance of sequence-based structural repeat predictors.

In this work an alternative approach to repeat detection is adopted. A suite of continuous wavelet transform analysis techniques will be used to detect and characterize a selection of repeating protein motifs from both sequence and structural data. For sequence data wavelet transform analysis can be considered as an ab initio approach to motif detection and characterization. Introduced in the early 1980s, wavelets have become a popular signal analysis tool due to their ability to elucidate simultaneously both spectral and temporal information within the signal. This overcomes the basic shortcoming of Fourier analysis, which is that the Fourier coefficients contain only globally averaged information, thus leading to location specific features in the signal being lost.14 Applications of wavelet analysis are now widespread and cover many fields of scientific research, including medical science, geophysics, engineering testing, image analysis, financial signal analysis and the topic of interest herein, proteins, where the dimension of “time” becomes that of sequence distance.

The body of literature concerning wavelet transform analysis and proteins (and DNA) is relatively small and comparatively recent. For proteins, wavelets have been used to predict hydrophobic cores from hydropathy data,15 the location of highly conserved residues in the hormone prolactin from electron ion interaction potential data,16 the structural families of protein hydrophobicity sequences17 and the location and topology of helices in transmembrane proteins.18 Other wavelet-based research has focused on DNA, where wavelets have identified regular features in nucleotides,19 the genome of Chinese hamster cells,20 transcriptive yeast cell cycle microchip data,21 and non-coding sequences.22 We now look at three of the above references in more depth. Hirakawa et al.15 define a wavelet-based method to predict the hydrophobic cores of globular proteins from hydropathy sequence data with 70% accuracy. This method predicts hydrophobic cores by thresholding the smallest wavelet scale to eliminate hydrophilic/neutral regions. It is worth noting that for this data sequential alignment techniques can predict hydrophobic cores with 76% accuracy,23 but tend to perform poorly when there are no homologues or low sequence similarity, for these cases the wavelet-based method can predict cores at nearly 70% accuracy. Mandell et al.17 apply the continuous wavelet transform to protein hydrophobicity sequences. This technique suggests which structural family the sequence belongs to, for one example each of α, β and αβ proteins. The protein structure and it’s fractal dimension are used as reference criteria for the analysis. Lio & Vannucci18 have developed a discrete wavelet threshold technique to predict the location and topology of helices in transmembrane proteins. Predictions are made by discrete wavelet thresholding, a new propensity scale generated from 1087 transmembrane domain sequences. This method works by wavelet transforming the data to generate wavelet coefficients, then coefficients below a certain size being shrunk or set to zero. A denoised signal is recovered by inverse transforming these thresholded coefficients. When compared to empirical methods based on hydrophobicity and/or helical propensity data for a test set of 83 proteins it was found that this method permits an improvement in the automatic location of transmembrane helices.

Section snippets

Wavelet theory

In this work the continuous wavelet is the preferred wavelet representation; justification for its use can be found in the Appendix. A brief summary of continuous wavelet transform theory and a description of some wavelet tools which assist interpretation of wavelet coefficients is presented; more details are given in the Appendix. Both wavelet theory and tools are illustrated by simple examples which outline some of the key concepts of this technique.

The continuous wavelet transform (CWT)

The wavelet transform of a continuous

Data types

The data utilized in this study are relative accessible surface area (rASA) and simple hydrophobicity (HΦ) which, for each residue of a protein, provide information derived from the protein structure and sequence, respectively. rASA is generated using Hubbard’s NACCESS program,27 which implements Lee & Richards accessibility calculation.28 This measures the relative accessibility of every residue to solvent in the 3D protein structure. More specifically rASA calculates the accessible surface of

Propellor

In this section wavelet-transformed rASA and HΦ data for the C-terminal four-bladed propellor domain of rabbit serum haemopexin (RCSB PDB code 1hxn, Faber et al.31), are used to detect the location of each propellor blade in the protein and provide information about the topology within the blade. This protein “is a serum glycoprotein that binds heam reversibly and delivers it to the liver where it is taken up by receptor mediated endocytosis”31 (see Figure 4(a)). This domain consists of a

Discussion

Protein repeats are important as they are very common and clearly reflect the evolutionary development of stable proteins. Here wavelet transform analysis for the detection and characterization of repeating motifs from rASA and HΦ data can be considered a success for all but the coiled coil of cortexillin I (using the HΦ data), although these data provide information on the location of interhelical salt bridges. Furthermore, by investigating high energy scales smaller than the motif scales it

Acknowledgements

We thank David Jones, Sheena Radford and Adrian Shepherd for useful discussions. This work is supported by the BBSRC.

References (47)

  • M. Altaiski et al.

    Wavelet analysis of DNA sequences

    Genet. Anal.

    (1996)
  • H. Wako et al.

    Use of amino acid environment-dependent substitution tables and conformational propensities in structure prediction from aligned sequences of homologous proteins. I. Solvent accessibility classes

    J. Mol. Biol.

    (1994)
  • J. Kyte et al.

    A simple method for displaying the hydropathic character of a protein

    J. Mol. Biol.

    (1982)
  • H.R. Faber et al.

    1.8 Å crystal structure of the C-terminal domain of rabbit serum haemopexin

    Structure

    (1995)
  • P. Burkhard et al.

    The coiled-coil trigger site of the rod domain of cortexillin I unveils a distinct network of interhelical and intrahelical salt bridges

    Struct. Fold. Des.

    (2000)
  • P. Fey et al.

    Cortexillin I is required for development in polysphondylium

    Dev. Biol.

    (1999)
  • A. Lupas

    Predicting coiled-coil regions in proteins

    Curr. Opin. Struct. Biol.

    (1997)
  • J. Walshaw et al.

    Socketa program for identifying and analysing coiled-coil motifs within protein structures

    J. Mol. Biol.

    (2001)
  • D.W. Banner et al.

    Atomic coordinates for triose phosphate isomerase from chicken muscle

    Biochem. Biophys. Res. Commun.

    (1976)
  • C.A. Orengo et al.

    SSAPsequential structure alignment program for protein structure comparison

    Methods Enzymol.

    (1996)
  • E.M. Marcotte et al.

    Census of protein repeats

    J. Mol. Biol.

    (1998)
  • R.A. Sayle et al.

    RASMOLbiomolecular graphics for all trends

    Trends Biochem. Sci.

    (1995)
  • J. Sondek et al.

    Crystal structure of a G-protein beta gamma dimer at 2.1 Å resolution

    Nature

    (1996)
  • Cited by (0)

    1

    Edited by G. von Heijne

    View full text