MITOS: Improved de novo metazoan mitochondrial genome annotation

https://doi.org/10.1016/j.ympev.2012.08.023Get rights and content

Abstract

About 2000 completely sequenced mitochondrial genomes are available from the NCBI RefSeq data base together with manually curated annotations of their protein-coding genes, rRNAs, and tRNAs. This annotation information, which has accumulated over two decades, has been obtained with a diverse set of computational tools and annotation strategies. Despite all efforts of manual curation it is still plagued by misassignments of reading directions, erroneous gene names, and missing as well as false positive annotations in particular for the RNA genes. Taken together, this causes substantial problems for fully automatic pipelines that aim to use these data comprehensively for studies of animal phylogenetics and the molecular evolution of mitogenomes. The MITOS pipeline is designed to compute a consistent de novo annotation of the mitogenomic sequences. We show that the results of MITOS match RefSeq and MitoZoa in terms of annotation coverage and quality. At the same time we avoid biases, inconsistencies of nomenclature, and typos originating from manual curation strategies. The MITOS pipeline is accessible online at http://mitos.bioinf.uni-leipzig.de.

Graphical abstract

Highlights

► High quality de novo annotation of Metazoan mitochondrial genomes. ► MITOS is available as fully automatic web server. ► Consistent reannotation of available mitogenomes.

Introduction

A reliable and standardised genome annotation is an indispensable prerequisite for a systematic comparative analysis of genomic sequence data. This is true in particular for phylogenetic reconstruction, studies of the mechanisms of genome rearrangements, and the investigation of the effects of sequence variation. The need for accurate and unbiased annotations becomes even more pressing when automatised pipelines are employed to process the increasingly large amounts of data that are becoming available in the wake of new sequencing technologies.

At present, complete sequences of mitochondrial genomes are available for more than 2000 metazoan species from a wide variety of taxonomic groups. Metazoan mitogenomes are (with few exceptions) circular molecules with an average length of approximately 16,500 nt with extreme length values such as 11,423 nt (Paraspadella gotoi NC_006083) and 43,079 nt (Trichoplax adhaerens NC_008151). Mitochondrial genomes have a well preserved gene content usually comprising 13 protein coding genes, 22 tRNAs, two rRNAs, and one non-coding region containing most of the regulatory elements (Wolstenholme, 1992). This simple structure makes animal mitogenomes an attractive target for large-scale comparative studies.

Mitochondrial genes usually consist of a single continuous exon, although in some clades exceptions have been reported in protein coding genes as well as in rRNAs (Beagley et al., 1996, Dellaporta et al., 2006, Wang and Lavrov, 2008) and conserved frameshifts exist in some sauropsid groups (Mindell et al., 1998). In several cases there is also evidence for some duplication and deletion events (e.g. SanMauro et al., 2006, Fujita et al., 2007). A peculiarity of mitogenomes is their use of deviant genetic codes, the presence of overlapping genes, and incomplete stop codons (Wolstenholme, 1992, Jühling et al., 2012), see Bernt et al. (2013b) for a more detailed overview. Taken together, all these issues complicate the task of genome annotation and made extensive manual “expert curation” indispensable. In this process a multitude of different tools have been used by different curators. As discussed e.g. by Boore (2006), this entails a number of problems: (a) tools used in older annotations may be outdated, i.e. improved methods are already available, (b) sequences used as basis for homology annotation can be either wrong or incomplete, and (c) no generally accepted guidelines exist for the annotation.

The most comprehensive and up-to-date resource for mitochondrial genomes and their annotation is NCBI RefSeq (Pruitt et al., 2007). Despite substantial efforts by the curators of RefSeq to improve the quality of the data several inconsistencies and errors in the annotations have remained that cause problems for automatised analysis pipelines. This includes missing or incorrect information of the reading direction (strand), erroneous gene designations, missing gene annotations, mistaken identity of trnL1/trnL2 and trnS1/trnS2 tRNAs, and inconsistencies in gene names (see Supplementary Table 1 for selected examples).

Boore (2006) suggested a number of possible solutions to overcome these problems: Systematic error screening, standardisation of gene names, anticodon labelling of tRNAs, standards for gene and gene boundaries designation, and standards for accepting the reality of a gene assignment. Several data bases, reviewed in more detail in Bernt et al. (2013a), aim at providing improved annotations for RefSeq mitogenomes along these lines. METAMiGA (Feijao et al., 2006) and OGRe (Jameson et al., 2003) incorporate manual improvements of the data based on expert knowledge. Systematic semi-automatic error screening using a list of rules based on tRNAscan-SE (Lowe and Eddy, 1997), ARWEN (Laslett and Canback, 2008), and BLAST (Altschul et al., 1990) searches as well as expert knowledge is used for MitoZoa (Lupi et al., 2010), a recently released new data base.

De novo annotation with a consistent set or pipeline of methods is a promising alternative to evaluating and improving existing annotations. DOGMA (Wyman et al., 2004) is a semi-automated pipeline of methods dealing with both mitochondrial and chloroplast genomes. It uses BLAST to identify coding and non-coding genes. COVE (Eddy and Durbin, 1994) is employed by DOGMA to identify tRNAs candidates based on secondary structure. MOSAS (Sheffield et al., 2010) is a set of methods that has its focus on the organisation of sequence data and annotation and was originally intended for insect mitogenomes. It employs ARWEN and tRNAscan-SE for tRNA prediction. BLAST is used by MOSAS to search for open reading frames and rRNAs based on a local data base of query sequences (currently from insects only). The need for user-defined cutoff values and manual improvements of the predictions makes this approach difficult to apply to large data sets and limits the comparability of the predictions.

The MITOchondrial genome annotation Server (MITOS) provides access to a fully automated pipeline for the de novo annotation of metazoan mitochondrial genomes. It uses a novel strategy based on aggregating BLAST searches with previously annotated protein sequences to identify protein coding genes (Section 2.1), thereby avoiding the need for a built-in data base of specifically curated protein models. Both tRNAs and rRNAs are annotated using specific covariance models for each of the structured RNAs (Section 2.2). In this contribution we apply MITOS for the de novo annotation of all animal mitogenomes contained in RefSeq 39, focusing on a careful evaluation of the quality of the results (Section 3).

Section snippets

Materials and methods

MITOS requires only a sequence file in FASTA format and the corresponding genetic code as input. The pipeline proceeds in two stages, first identifying candidate sequences for each gene, then reconciling these to derive a final annotation. In the following we provide a detailed description of the individual components of MITOS.

Results and discussion

In order to assess the quality of the MITOS predictions we employed our pipeline for a de novo annotation of RefSeq 39. By showing that the default parameters chosen in MITOS are suitable for the entire metazoan data set, we hope to relieve the user of the tedious empirical work of choosing appropriate cutoff values and parameters.

In the following we will refer to RefSeq or MITOS annotations as “genes”.

Conclusion

MITOS is an automated pipeline that tackles the problem of reliable metazoan mitochondrial genome annotation, using state of the art methods. Protein coding genes are annotated by means of a sophisticated aggregation procedure based on BLAST searches, which allows for the detection of frameshifts, duplication events, and split genes. Structural conservation is utilised for non-coding RNA annotation by employing novel covariance models. MITOS allows for a systematic error screening, the

Acknowledgments

We thank the Center for Information Services and High Performance Computing (ZIH) of the TU Dresden (http://tu-dresden.de/zih/) for providing computational facilities.

This work was supported by the Deutsche Forschungsgemeinschaft [SPP-1174 – Deep Metazoan Phylogeny projects STA 850/2 and STA 850/3-2]; Centre National de la Recherche Scientifique (CNRS); Université de Strasbourg; Association Française contre les Myopathies (MNM1 2009); ANR MITOMOT (ANR-09-BLAN-0091-01); French-German PROCOPE

References (34)

  • C.T. Beagley et al.

    Two mitochondrial group I introns in a metazoan, the sea anemone Metridium senile: one intron contains genes for subunits 1 and 3 of NADH dehydrogenase

    Proc. Natl. Acad. Sci. USA

    (1996)
  • Boore, J.L., 2001. Mitochondrial genome rearrangement guide. Version...
  • J.L. Boore

    Requirements and standards for organelle genome databases

    OMICS

    (2006)
  • S. Breton et al.

    Novel protein genes in animal mtDNA: a new sex determination system in freshwater mussels (Bivalvia: Unionoida)?

    Mol. Biol. Evol.

    (2010)
  • P.J.A. Cock et al.

    Biopython: freely available python tools for computational molecular biology and bioinformatics

    Bioinformatics

    (2009)
  • S.L. Dellaporta et al.

    Mitochondrial genome of Trichoplax adhaerens supports placozoa as the basal lower metazoan phylum

    Proc. Natl. Acad. Sci. USA

    (2006)
  • S.R. Eddy et al.

    RNA sequence analysis using covariance models

    Nucl. Acids Res.

    (1994)
  • Cited by (3369)

    View all citing articles on Scopus
    View full text