Long indels are disordered: A study of disorder and indels in homologous eukaryotic proteins

https://doi.org/10.1016/j.bbapap.2013.01.002Get rights and content

Abstract

Proteins evolve through point mutations as well as by insertions and deletions (indels). During the last decade it has become apparent that protein regions that do not fold into three-dimensional structures, i.e. intrinsically disordered regions, are quite common. Here, we have studied the relationship between protein disorder and indels using HMM–HMM pairwise alignments in two sets of orthologous eukaryotic protein pairs. First, we show that disordered residues are much more frequent among indel residues than among aligned residues and, also are more prevalent among indels than in coils. Second, we observed that disordered residues are particularly common in longer indels. Disordered indels of short-to-medium size are prevalent in the non-terminal regions of proteins while the longest indels, ordered and disordered alike, occur toward the termini of the proteins where new structural units are comparatively well tolerated. Finally, while disordered regions often evolve faster than ordered regions and disorder is common in indels, there are some previously recognized protein families where the disordered region is more conserved than the ordered region. We find that these rare proteins are often involved in information processes, such as RNA processing and translation. This article is part of a Special Issue entitled: The emerging dynamic view of proteins: Protein plasticity in allostery, evolution and self-assembly.

Highlights

► Disordered residues are more frequent among indels than among aligned residues. ► Disordered residues are particularly common in longer indels. ► Ordered non-terminal indels are short. ► The longest indels, ordered and disordered, occur toward the termini of the proteins.

Introduction

A number of different genetic mechanisms cause mutations in coding genes, ranging in size from point mutations, through insertions and deletions (indels) of a few residues, to rearrangements of protein domains and fusion of entire genes. In general, mutations occur at random but are under selective pressure. One general result of this is that residues in the core of a protein are more likely to be maintained through evolution compared to those on the surface of the protein [1]. Further, short indel events are more likely to occur in loops than in secondary structures.

Short indels occur by, for instance, DNA replication slippage during replication or repair [2]. Longer extensions can occur through the conversion of 3′ UTRs into coding regions [3] and through cassette duplications of protein domain repeats, a feature that is particularly common in higher eukaryotes [4]. Novel coding regions may also be created through tandem repetitions of short nucleotide sequences (microsatellites) within the coding region [5].

As some regions of proteins are less crucial to the functionality to the protein than others it is safe to assume that indels within some regions are less likely to be deleterious than indels in other regions. Short indels that become fixed in the population preferentially occur in solvent accessible loop regions [6]. Longer indel events involve the insertion or deletion of entire protein domains, primarily at the N- and C-termini of proteins [7] but also, when it comes to repeated domains, within the central parts of a protein [7]. The selective pressure acting on these longer indel events is less well understood. However, in the case of repeated proteins it is clear that the duplication of particular domain combinations are strongly favored [8]. The large length variation caused by indels of several protein repeat domains affects binding properties of the proteins, i.e. longer indels events are often associated with functional changes [9].

During the last decade it has become evident that while most proteins contain folded domains, and indeed most proteins contain more than one domain [10], some proteins are partially or even fully disordered [11], [12], [13]. These sequences are characterized by two primary features; (i) a low level of hydrophobicity which precludes the formation of a stable globular core; (ii) a high net charge which favors an extended structural state due to electrostatic repulsion [14]. These properties lead to that intrinsically disordered proteins are, in general, more expanded in native conditions than foldable proteins [15].

One important observation concerning intrinsically disordered regions is the fact that they are not at all as common in prokaryotes as in eukaryotes [16], suggesting that disorder could be a component required for higher complexity [17], although it is possible that another reason for this finding is the compactness that characterizes prokaryotic genomes [18]. Intrinsically disordered regions are in general fast evolving, but there are also examples of highly conserved intrinsically disordered regions [14], [19]. Further, many intrinsically disordered regions are important for binding [13] and intrinsically disordered regions are a common feature of the hubs in protein–protein interaction network of Saccharomyces cerevisiae [20], [21].

Here, we present an investigation into insertions and deletions within disordered regions. We show that indels, here defined as regions that are aligned against gaps, contain much more disordered residues than aligned positions. Further, the longer the indel, the more likely that it is disordered. Finally, among the proteins where the disordered region is at least as conserved as the ordered region, we find an overrepresentation of proteins that are involved in processes related to translation.

Section snippets

Results and discussion

We have applied two disorder predictors, Iupred [22] and Disopred, to analyze the evolutionary patterns of disordered residues in particular with respect to indels. There are many flavors of protein disorder [13], [23]. For instance, short and long disordered regions appear to perform different functional roles, where the short disordered regions often serve as loops in otherwise structurally ordered proteins [16]. Such regions are less conserved than their structured surroundings [24], whereas

Conclusion

Here, we have studied the homologous proteins from C. elegans and D. melanogaster, as well as homologous fungal proteins, with regard to the disorder content of indels. Due to the difficulty of aligning distantly related proteins, even using state of the art HMM–HMM alignment methods, in particular disordered proteins, the results should be regarded with a measure of caution. However, given that the results remain essentially the same irrespective of disorder prediction method and dataset used,

Orthologous protein pairs

Orthologous protein pairs between C. elegans and D. melanogaster were retrieved from pre-computed homology clusters from InParanoid (version 7) [35]. Additionally, an evolutionary distance filter was applied (Tree-Puzzle [36] distance ≤ 4) to avoid inclusion of non-homologs. In total, 3,736 protein pairs were included. Orthologous protein pairs between Saccharomyces cerevisiae and five other fungal species (Candida albicans, Candida glabrata, Debaryomyces hansenii, Kluyveromyces lactis and

Acknowledgements

This work was supported by grants from the Swedish Research Council (VR-NT 2009-5072, VR-M 2010-3555), SSF, the Foundation for Strategic Research, Science for Life Laboratory. The EU 6th Framework Program is gratefully acknowledged for support to the GeneFun project, contract no: LSHG-CT-2004-503567 and the 7th framework through the EDICT project, contract no: FP7-HEALTH-F4-2007-201924. Funding for SL was provided by BILS, Bioinformatics Infrastructure for Life Science.

References (43)

  • G.A. Reeves et al.

    Structural diversity of domain superfamilies in the cath database

    J. Mol. Biol.

    (2006)
  • M. Remm et al.

    Automatic clustering of orthologs and in-paralogs from pairwise species comparisons

    J. Mol. Biol.

    (2001)
  • D. Jones

    Protein secondary structure prediction based on position-specific scoring matrices

    J. Mol. Biol.

    (1999)
  • M. Kalman et al.

    Quality assessment of protein model-structures using evolutionary conservation

    Bioinformatics

    (2010)
  • G. Levinson et al.

    Slipped-strand mispairing: a major mechanism for DNA sequence evolution

    Mol. Biol. Evol.

    (1987)
  • M.G. Giacomelli et al.

    The conversion of 3’ UTRs into coding regions

    Mol. Biol. Evol.

    (2007)
  • A.K. Björklund et al.

    Expansion of protein domain repeats

    PLoS Comp. Biol.

    (2006)
  • W.J. Guo et al.

    Significant comparative characteristics between orphan and nonorphan genes in the rice (Oryza sativa L.) genome

    Comp. Funct. Genomics

    (2007)
  • R. Kim et al.

    Systematic analysis of short internal indels and their impact on protein folding

    BMC Struct. Biol.

    (2010)
  • A. Pascual-García et al.

    Quantifying the evolutionary divergence of protein structures: the role of function change and function conservation

    Proteins: Struct. Funct. Bioinform.

    (2010)
  • V.N. Uversky et al.

    Why are "natively unfolded" proteins unstructured under physiologic conditions?

    Proteins

    (2000)
  • Cited by (0)

    This article is part of a Special Issue entitled: The emerging dynamic view of proteins: Protein plasticity in allostery, evolution and self-assembly.

    1

    Contributed equally.

    View full text