Journal of Molecular Biology
Volume 295, Issue 3, 21 January 2000, Pages 613-625
Journal home page for Journal of Molecular Biology

Regular article
Identification of related proteins on family, superfamily and fold level1

https://doi.org/10.1006/jmbi.1999.3377Get rights and content

Abstract

Proteins might have considerable structural similarities even when no evolutionary relationship of their sequences can be detected. This property is often referred to as the proteins sharing only a “fold”. Of course, there are also sequences of common origin in each fold, called a “superfamily”, and in them groups of sequences with clear similarities, designated “family”. Developing algorithms to reliably identify proteins related at any level is one of the most important challenges in the fast growing field of bioinformatics today. However, it is not at all certain that a method proficient at finding sequence similarities performs well at the other levels, or vice versa.

Here, we have compared the performance of various search methods on these different levels of similarity. As expected, we show that it becomes much harder to detect proteins as their sequences diverge. For family related sequences the best method gets 75 % of the top hits correct. When the sequences differ but the proteins belong to the same superfamily this drops to 29 %, and in the case of proteins with only fold similarity it is as low as 15 %. We have made a more complete analysis of the performance of different algorithms than earlier studies, also including threading methods in the comparison. Using this method a more detailed picture emerges, showing multiple sequence information to improve detection on the two closer levels of relationship. We have also compared the different methods of including this information in prediction algorithms.

For lower specificities, the best scheme to use is a linking method connecting proteins through an intermediate hit. For higher specificities, better performance is obtained by PSI-BLAST and some procedures using hidden Markov models. We also show that a threading method, THREADER, performs significantly better than any other method at fold recognition.

Introduction

As the genome projects proceed, we are presented with an exponentially increasing number of protein sequences without any knowledge of their structure or biochemical function. Since structure and function determination is a non-trivial task even for a single protein, the best way to gain understanding of all these sequences is if we can relate them to other proteins with known properties by searching databases. Improving these algorithms is one of the fundamental challenges in bioinformatics today. By determining how sequences are related to known proteins we can make predictions of their structural, functional and evolutionary features. The relationships between proteins span a broad range, from the case of almost identical sequences to apparently unrelated sequences sharing only rough 3D structure. This poses different challenges to the detection algorithms used, a method excellent at finding sequence similarity might not perform very well in the case of only structural relationship, or vice versa.

Here we have compared the performance of various recognition methods on different levels of similarity, extending earlier studies by (i) completely separating relationship levels; (ii) including more categories of sequence recognition methods; and (iii) including fold recognition (threading) methods. This has resulted in new insight how to best utilize evolutionary and structural information as well as ideas on the relative merits of different methods.

During the last few years several excellent studies have changed the view of the best methods to detect relationship between proteins. These studies differ in detail but have one common nominator, they use the Scop classifications by Murzin et al. (1995), to create a benchmark used in evaluating the performance of different recognition methods. Scop is a hierarchical scheme where each protein domain is classified into a family which in turn belongs to a superfamily that is a subclassification of the fold category. The Scop database is to a large extent hand tuned by Alexei Murzin, giving the following explanations to the levels: proteins sharing family have a “clear evolutionary relationship”; those within a superfamily are of “probable common evolutionary origin”; while the fold level is characterized by “major structural similarity”. This manual classification of the proteins makes Scop independent of any specific sequence or structure comparison algorithm and thereby ideal for comparison between such methods, further Scop is considered to be of a very high quality.

Three earlier studies (Abagyan and Batalov 1997, Brenner et al 1998; L. Arvestad et al., unpublished results) of sequence comparison methods have resulted in a rather clear picture with some important conclusions: (i) The common method of describing similarity as fraction identical residues should be abandoned. (ii) The exact choice of parameters such as gap-penalties are crucial when choosing the best methods; and (iii) heuristic methods such as FASTA Pearson and Lipman 1988, Pearson 1995) and BLAST2 (Altschul et al., 1997), do not perform as well as methods using the optimal alignments. These studies differ in the hierarchical level of Scop used: Brenner et al. (1998) chose the superfamily classification, Abagyan & Batalov (1997) the fold level, while L. Arvestad et al. (unpublished results) studied both fold and family classifications. In practice, however, all three studies compared the power of methods to identify relationships within a family. This is because the hits on, e.g. fold level include pairs on superfamily and family level. For obvious reasons it is much easier to identify proteins within a common family; therefore, this part will dominate the correctly identified pairs. In several other studies the Scop classification has also been used to compare different fold recognition methods Rice and Eisenberg 1997, Hargbo and Elofsson 1999. In these cases all hits within a family (or superfamily) were ignored. To extend our understanding of different search methods we have made a complete separation of different levels in the classification. We have thus discarded family hits when studying superfamilies and ignored both family/superfamily hits when judging fold level performance. This makes it possible to distinguish methods performing well on different levels in the Scop hierarchy.

Even within a group of proteins with a common origin, a single pair of sequences can differ quite substantially in composition. For many years it has been assumed that the use of evolutionary information to create a multiple sequence alignment helps detecting such distant relationships. However, it is only recently this was clearly shown to be true using a large and comprehensive benchmark Park et al 1997, Park et al 1998. In the latter of these studies it was shown that including evolutionary information detects three times as many remote homologues at the same false positive rate. They also concluded the best way of employing multiple sequence information was to use it in the iterative SAM-T98 hidden Markov model. In a similar study, not using Scop but CATH, Orengo et al. (1997), as the reference databases, Salamov et al. (1999a) found results similar to the ones by Park et al. (1998).

There are several quite different ways of using evolutionary information (see Figure 1). One possibility, used in e.g. PFAM (Sonnhammer et al., 1997), is to start from a family already aligned and then search for more members belonging to the same set. PSI-BLAST and the hidden Markov models used by Park et al. (1998) utilize an iterative approach that starts from a single sequence, find all related sequences and create a multiple sequence alignment. From this alignment a new iteration is started and this procedure is repeated until it converges. A third alternative is to consider two proteins to be related if they are identified by the search algorithm directly or if they both are found to be related to a third protein domain (Holm and Sander 1997, Park et al 1997, Abagyan and Batalov 1997, Salamov et al 1999b; L. Arvestad et al., unpublished results). All these methods have different strengths; the direct search method is clearly fastest with a runtime linear to the size of the database, while the iterative method should be at least n times slower, n being the number of iterations made. Often the method is even slower as for the iterative searches a larger database is used. The last method should only be a factor two slower, but also in this case a larger database for the intermediate search is common, making it substantially slower than the direct approach.

There are many proteins with similar structure where no obvious homology has been detected. Methods developed to identify this structural relationship are often referred to as fold recognition (or threading) methods. They can roughly be divided in two categories; prediction-based methods, Sheridan et al 1985, Fischer and Eisenberg 1996, Rice and Eisenberg 1997, Di Francesco et al 1997, Rost et al 1997, Hargbo and Elofsson 1999 and structural methods Bowie et al 1991, Jones et al 1992, Flockner et al 1997. Besides these two categories it is of course possible to use purely sequence-based methods even for fold recognition (Karplus et al., 1997), or combine several approaches (Elofsson et al., 1996).

The structure-based methods differ from all others described here, since they do not directly use any sequence information to detect whether two proteins share a fold or not. Instead they create an energy function describing how well a probe sequence matches a target fold. The energy function is often obtained from a database of known protein structures and may for instance describe the environment of each residue (Bowie et al., 1991), or the probability of finding two residues at a certain distance from each other Jones et al 1992, Flockner et al 1997.

Proteins having a similar fold by definition also have very similar secondary structure, meaning that even when amino acid compositions are unrelated the secondary structure should largely be the same within a fold. Since secondary structure can be predicted with an accuracy of more than 70 % today (Rost & Sander, 1993), several attempts have been made to use this information to improve fold recognition methods Fischer and Eisenberg 1996, Rice and Eisenberg 1997, Rost et al 1997, Hargbo and Elofsson 1999. These methods add a positive score to the sequence alignment score if the predicted secondary structure for a certain residue agree with the secondary structure state of the residue.

Every two years, starting in 1994, the Casp conference has been organized to evaluate the ability to blindly predict the structure of proteins (Moult et al., 1997). The blind predictions was deemed necessary to evaluate the different methods, as it was considered difficult to avoid creating a biased benchmark. One important outcome from the Casp process regarding fold recognition is that several groups using fundamentally different methods consistently perform very well. An extreme example is the excellent predictions by Murzin in Casp2 where no fold recognition methods were used but mainly biochemical knowledge (Murzin & Bateman, 1997). However, a complication in Casp when evaluating fold recognition methods is the mix of fold and (easier) superfamily level targets.

It is thus our belief that a complementary way of assessing fold recognition methods is to use a complete benchmark while simultaneously separating the levels of similarity by ignoring hits also present in lower levels. Unfortunately, few fold recognition methods are publicly available, and others are still very time consuming. Therefore, we have only used three different methods, all showing some success in the latest Casp process: THREADER (Jones et al., 1992), SAM-T98 (Karplus et al., 1997) and ssHMM (Hargbo & Elofsson, 1999). The results from the fold recognition methods were compared with results from standard sequence alignment methods.

Section snippets

Results

The results of the all against all comparison of the 976 protein sequences are summarized as spec-sens curves in Figure 2, Figure 3, Figure 4 and top ranks in Table 1, Table 2, Table 3.

Starting on family level, Table 1 shows that the best method, the linking algorithm, finds 75 % of the sequences in top rank and that all methods except THREADER find more than 65 % sequences in first place. Figure 2 shows that at 100 % specificity the best sensitivity is obtained by HMMER-PSIBLAST with 40 %,

A total of 40% of family level pairs but only 4% of superfamily pairs can be detected reliably

The most common use of sequence comparison methods is to search in databases to find proteins belonging to the same family, i.e. those with similar function and clear evolutionary relationship. Both the results from ranking (see Table 1), and from the spec-sens curves (see Figure 2), indicate that the best performance is obtained by BLAST-LINK. The exception is at specificities above 97 %, where PSI-BLAST and HMMER-PSIBLAST perform better. It should be noted that all methods using sequence

Conclusions

Detecting related proteins is of extreme importance as the genome projects proceed, as this is the best method to assign structural, evolutionary and functional knowledge to a gene. For many years the standard method for detecting relationships between two proteins was to use a pairwise sequence alignment method. It was generally assumed that using multiple sequence alignments helps to find more proteins, but only limited large scale benchmarking was done until recently. A few years ago things

Benchmark database

In order to assess the performance of protein recognition algorithms it is important to use a large and broad set of related and unrelated protein domains with few errors. We created our benchmark by starting from the PDB40d set of Scop version 1.37. This database consists of a Scop subset where no two proteins have more than 40 % sequence identity (Brenner et al., 1998). Since some of the algorithms needed the secondary structure and multiple sequence alignments, we used the definition in the

Acknowledgements

This work was supported by grants form the Swedish Natural Sciences Research Council and the Swedish Research Council for Engineering Sciences to A.E. We thank Jeanette Hargbo, Björn Larsson, Erik Wallin and Gunnar von Heijne for valuable discussions and help.

References (33)

  • A. Bairoch et al.

    The SWISS-PROT protein sequence data bank and its new supplement TREMBL

    Nucl. Acids Res.

    (1996)
  • J.U. Bowie et al.

    A method to identify protein sequence that fold into a known three-dimensional structure

    Science

    (1991)
  • S.E. Brenner et al.

    Assessing sequence comparison methods with reliable structurally identified evolutionary relationships

    Proc. Natl Acad. Sci. USA

    (1998)
  • V. Di Francesco et al.

    Fold recognition using predicted secondary structure sequences and hidden Markov models of proteins folds

    Proteins: Struct. Funct. Genet.

    (1997)
  • S.R. Eddy

    Profile hidden Markov models

    Bioinformatics

    (1998)
  • D. Fischer et al.

    Protein fold recognition using sequence-derived predictions

    Protein Sci.

    (1996)
  • Cited by (160)

    • Learning Proteome Domain Folding Using LSTMs in an Empirical Kernel Space

      2022, Journal of Molecular Biology
      Citation Excerpt :

      CNN-BGRU-RF+and FoldHSpherePro are also ensemble approaches which use a random forest model to incorporate not only their single-model similarity score, but also their pair-wise similarity score as well as the DeepFR score. In the original studies, these methods were trained on the SCOPe Database with 16,133 protein domains and applied to the Lindahl dataset25 as the benchmark.26,17 The family, the superfamily, and the fold level structure types are used to test the ability to predict remote homology with respectively increasing difficulty.

    • MLDH-Fold: Protein fold recognition based on multi-view low-rank modeling

      2021, Neurocomputing
      Citation Excerpt :

      In order to rigorously simulate the protein fold recognition task, two widely used datasets based on the SCOP are used to evaluate the performance of different methods, including: LE and YK, which are described as follows: The widely used LE dataset was constructed based on the SCOP [47]. The sequence identity between any sequences pair is less than 40%.

    • Complete genome analysis of Glutamicibacter creatinolyticus from mare abscess and comparative genomics provide insight of diversity and adaptation for Glutamicibacter

      2020, Gene
      Citation Excerpt :

      The presence of virulence genes in the genome was identified using BLASTp (Altschul et al., 1990) against the Virulence Factor Database (VFDB) (Chen et al., 2004). The parameters considered were an E-value of 1e-5, a minimum identity percentage of 50% and minimum coverage of 70%, between the query and subject sequences (Lindahl and Elofsson, 2000; Yang and Honig, 2000). The functional annotations were obtained from the categories provided by the Virulence Factor Database (Chen et al., 2004).

    View all citing articles on Scopus
    1

    Edited by F. C. Cohen

    View full text