& Evolutionary Biology The Mitochondrial Genome of Erpobdella octoculata (Hirudinida: Arhynchobdellida), as the First Representative for the Suborder Erpobdelliformes

Erpobdella octoculata Linnaeus, 1758 (Hirudinea: Arhynchobdellida: Erpobdellidae) is a very common and morphologically variable macrophagous predators of aquatic invertebrates. Here we determined the complete mitochondrial DNA (mtDNA) sequence of this species, as the first representative of the suborder Erpobdelliformes. This genome is 14,407 bp in length with an A+T content of 71.55%, containing 37 typical animal mitochondrial genes and a non-coding region (NCR). All genes were arranged in the same order as most reported annelid. A remarkable strand bias found for all protein coding genes (PCGs) was positive AT-skew and negative GC-skew. The models of secondary structures for the two rRNA genes of this species were predicted. rrnL consisted of six domains and 46 helices, while three domains and 34 helices for rrnS . The secondary structures of some transfer RNA genes were relatively special. Bayesian phylogenetic and maximum likelihood analyses, based on concatenated nucleotide sequences of 12 PCGs, placed E. octoculata within the suborder Hirudiniformes, and the relationships depicted within that order were not quite in agreement with previous morphological classifications. The complete mtDNA sequence of E. octoculata provides important genetic markers for identification, ecological and evolutionary studies of leeches.


Introduction
Leeches (Clitellata: Hirudinida) are hermaphrodite worms and abundant predators or ecto-parasites in various freshwater habitats. Many leeches have a significant adverse impact on freshwater culture and animal health [1][2][3]. Erpobdella octoculata Linnaeus, 1758 (Hirudinea: Arhynchobdellida: Erpobdelliformes) is one of the most common leech species in freshwaters [4]. Its assemblage typically dominates the lowlands in lake, streams and rivers as well as in urban park ponds [5][6][7]. Despite widely distributed, the current genomic knowledge of leeches is scant and taxon diversity is poorly constructed [8,9].
Animal mitochondrial DNA (mtDNA) is a single circular duplex molecule typically coding for 13 protein-coding genes, 2 rRNA genes and 22 tRNA genes [10]. The mitochondrial genome has been extensively used to study the phylogenetic relationships at several taxonomic levels [11,12]. Compared to other metazoan animals, only five complete mt genome sequences of leech species for the suborder Hirudiniformes have been sequenced and deposited in GenBank to date, and no mt genome for the suborder Erpobdelliformes has been reported [13,14]. In this study, we sequenced the complete mitochondrial (mt) genomes of E. octoculata, analysized its mitochondrial gene arrangements, gene compositions, translation and initiation codons and codon usage as well as the structure of rRNA and tRNA genes. In addition, we used the currently available nucleotide sequences to reconstruct the phylogenetic relationship among the included annelids. The new mt genome sequence may provide novel and useful mtDNA markers for investigating genetic composition of suborder Erpobdelliformes, as it is the first reported complete mtDNA sequence in this suborder.

Samples and DNA extraction
Specimens of E. octoculata were sampled from Hubei province, China. After morphological identification according to existing keys and descriptions [4], they were fixed in 95% (v/v) ethanol and stored at -20°C. Total genomic DNA was extracted from this specimen using standard phenol/chloroform extraction [15].

PCR amplification and sequencing
The mt genome of E. octoculata was generated using amplification of overlapping PCR fragments with each fragment overlapped upstream (5') by ~ 200 bp. First, 1-2kb fragments were amplified using primers designed from conserved region of published sequences of W. pigra and H. nipponia mt genome. Then, specifically designed primers based on the known sequence were used for the secondary PCRs (primer sequences list see Table S1). PCRs were performed with TransStart FastPfu DNA Polymerase (TransGen Co., Beijing, China) under the following conditions: initial denaturation cycle of 95°C for 2min, followed by 30 cycles of 95°C for 20 s, annealing of 50-60°C for 15 s, and extension at 72°C for 20-40 s, with a final elongation at 72°C for 5 min after the last cycle. After end repairing, all of the amplified products were inserted into T-vector pMD18-T (TaKaRa Co., Dalian, China) and sequenced in both directions. Individual clone sequences were analyzed with the DNASTAR software package (http://www. DNASTAR.com).
computer program Clustal X 1.83 [16] to identify gene boundaries. Protein coding regions were identified via ORF Finder (http://www. ncbi.nlm.nih.gov/gorf/gorf.html) with invertebrate mitochondrial genetic codes. Nucleotide composition, amino acid composition, and Relative Synonymous Codon Usage (RSCU) values were analyzed with MEGA5.0 [17]. AT-and GC-skew were calculated using the following formula: AT skew=[A-T]/[A+T], GC skew=[G-C]/[G+C] [18]. The two rRNA genes were predicted by comparison with those of other metazoan animals previously reported [13,14] and the secondary structure drawn by the software of XRNA 1.2.0b. The secondary structures of two rRNA genes were inferred based on models developed for other metazoan animals [19][20][21][22]. To infer the rRNA gene secondary structures, we used a commonly accepted comparative approach to correct for unusual pairings with RNA-editing mechanisms that were well known in mt genomes [23,24]. For analyzing tRNA genes, putative secondary structures of tRNA genes were identified using tRNAscan-SE [25] and the ARWEN program [26]. For tRNAscan-SE, the following parameters were used: Source=Mito/Chloroplast, Search Mode=tRNA scan-only, Genetic Code=Invertebrate Mito, Cove score cutoff=1. The remaining tRNA genes were identified by recognizing potential secondary structures and anticodon sequences by eye.

Phylogenetic analyses
All the currently available 14 Annelida complete mt genomes data and the sequences from this study were collected for phylogenomic analyses. The mt genome of Phascolosoma esculenta and Sipunculus nudus [27] served as out-groups. The nucleotide sequences for all 12 PCGs (excluding atp8) were aligned using MAFFT7 [28] with the default settings and concatenation. The concatenated set of amino acid sequences from all 12 PCGs was then used for phylogenetic analyses. Bayesian inferences (BI) and maximum likelihood (ML) analysis performed with MrBayes 3.2 [29] and PHYML online web server [30] with optimal model determined by Mrmodeltest 2.3, respectively. According to the Akaike information criterion, the GTR+I+G model was selected for analysis. The BI analyses were conducted under the following conditions: 1,000,000 generations, four chains (one cold chain and three hot chains), a burn-in step for the first 25% (2,500) trees and the remaining trees were used to calculate Bayesian posterior probabilities (BPP). For ML phylogeny construction, the confidences were estimated by 1000 bootstrap replicates.

General features of the mt genome of E. octoculata
Genome composition and gene arrangement of E. octoculata are summarized in Figure 1 and Table 1, and the sequences have been deposited in GenBank with the accession No. KC-688270. The complete mt genome of E. octoculata is 14,407 bp in length, which is relatively smaller than those reported annelids mt genomes (e.g., Lumbricus terrestris, Platynereis dumerilii). This difference is mainly due to the total length of non-coding sequence. This circular mt genome ( Figure  1) is typical of annelid mt genomes, including 13 PCGs (atp6, atp8, cox1-3, cytb, nad1-6 and nad4L), two ribosomal RNA genes(rrnL and rrnS), 22 transfer RNA genes and a large non-coding region ( Table 1). All mt genes of E. octoculata are transcribed in the same direction and the gene arrangement is consistent with those of the W. pigra and H. nipponia. In this species, protein-coding gene nad3 overlaps tRNA-SerAGN, tRNA-SerAGN overlaps nad2 by 15 nucleotides, respectively (Table 1, Figure 1). Nad4L and nad4 share an overlap by 7 nucleotides. Such overlaps are also found in reported Hirudinea mtDNA genomes [13,14] and common to annelids [31][32][33][34]. The mt genome of E. octoculata contains 54 bp of intergenic sequences which spread over 10 locations, ranging from 1 to 15 bp in size. The longest intergenic regions are located between the rrnS and tRNA-SerUCN, tRNA-Arg and tRNA-Gln genes. The Hirudinea leech mtDNA of W. pigra and H. nipponia also contained intergenic regions (1-61 bp in length) in 12 locations.
A total of 3,660 amino acids are encoded by the E. octoculata mt genome. Similar to other leech species, the nucleotide composition of E. octoculata mt genome is also present a clear bias, with the base contents of A, T, G, C in the entire mt genome are 4,456(30.93%), 5,852(40.62%), 1,910 (15.88%) and 1,807 (12.54%), respectively. Among the 13 PCGs, the lowest A+T content is 65.36% in cox1, while the highest is 78.29% in nad6 (Table S2). The AT-skew and GC-skew across 13PCGs and rRNA are shown in Figure 2. Within 13PCGs, the AT-skews ranging from -0.26 in nad3 to -0.07 incox2, and the average AT-skew in mt genomes is -0.14. This kind of base bias of nucleotides composition has been generally related to asymmetric mutational constraints in the process of replication [36]. The nucleotide bias toward AT was also reflected in the codon usage ( Table 2). The relative synonymous codon usage (RSCU) values also showed a biased use of A and T nucleotides in E. octoculata (Table-2). The high incidence of anticodons NNA and NNU indicated partiality for AT in the anticodon of PCGs. Leu2 (11.31%), Phe (8.57%), Ile (8.28%) and Met (7.74%) were the most frequent amino acids in E. octoculata mtDNA PCGs.

Ribosomal RNA genes
The rrnL (16S ribosomal RNA) and rrnS (12S ribosomal RNA) genes of E. octoculata were identified by sequence comparison with other annelids. These two genes are separated by tRNA-Val. The sizes of the rrnL and rrnS genes are 1145 and 732 bp respectively, which are the lowest among the leeches studied to date (Tables 1and S2), and their A+T contents are 74.79% and 70.86%. respectively. The secondary structure of rrnL and rrnS, which was inferred according to the models proposed for these genes in other species. The secondary structure of E. octoculata rrnL consisted of six structural domains ((labeled I, II, III, IV, V and VI) (Figure-3) and included 46 helices. Overall, the highest level of invariable positions was located on domain V-IV, while lowest level was on domains I-II compared with other leech species. Some helices (e.g., H25, H27, H30 and H37) were highly conserved in both sequence and secondary structure among the leech mtDNA, and only few nucleotides differences were found either in the terminal loops or in the terminal couplets of helices. The secondary structure of rrnS contained three domains (labeled I, II, III) and 34 helices (Figure 4). Conservative sites were mainly in domain III and some helices (e.g., H10, H11 and H16) in domain II.

Transfer RNA (tRNA) genes
A total of 22 tRNA sequences (ranging from 59 to 83 nucleotides in length) were identified in the E. octoculata mt genome. Except for tRNA-Gly (TCC) and tRNA-Pro (TGG) which lack DHU arms, most tRNA genes appeared to exhibit the standard cloverleaf structure. The predicted secondary structures of glycine and proline tRNA contained the DHU arm but lack the TψC arm, which was not common in previously reported annelids. For the serine tRNA, a four-armed structure with variable loop formed a 6-bp--Extra"arm" was feasible scanned by tRNAscan-SE, while a three-armed structure with a DHU-replacement loop was also possible ( Figure 5) when using the ARWEN program.

Phylogenetic analyses
In this study, the nucleotide sequences of the 12 PCGs were concatenated to construct phylogenetic relationships, which might result in a more complete analysis than analyzing each sequence separately. We incorporated species from Echiura, Oligochaeta, Polychaeta and Hirudinea in the phylogenetic analysis. Sipuncula species Phascolosoma esculenta and Sipunculus nudus were included as out-groups. Analysis using BI and ML resulted in two trees with the identical or similar topology except for some variation in node confidence values ( Figure  6). Numbers at each node indicate bootstrap support, percentages of Bayesian posterior probabilities(left value) and ML bootstrap support values (right value). In both cases, the mtDNA data supported the hypothesis that Sipunculans and Annelids (including echiurans) formed a monophyletic group. While in Annelids, phylogenetic tree indicated that echiurans were derived from annelids, the species from Oligochaeta and Hirudinea formed a monophyletic group Clitellata, while the class Polychaeta showed polyphyletic based on the mt genome ( Figure 6). The above results were consistent with previous research based on morphological classifications and some molecular studies [37][38][39][40][41][42][43][44]. Within the Hirudinea clade, the statistical data revealed that E. octoculata (Arhynchobdellida: Erpobdelliformes) was more closely related to W. pigra (Arhynchobdellida: Hirudiniformes) than to H. nipponia (Arhynchobdellida: Hirudiniformes), indicating that family Erpobdellidae was more closely related to family Haemopidae. The above result based on the nucleotide sequences of the 12PCGs was not consistent with previous classification based on their morphological features and using LSU and SSU rDNA and cox1 sequences [45,46]. Apparently, more taxon sampling and additional markers for further phylogenetic inference are required to resolve the controversy on the phylogeny of the Arhynchobdellida.

Conclusion
In the present study, the complete mtDNA sequences of E. octoculata was the first sequenced high quality mt genome of nonblood feeding leech of the suborder Erpobdelliformes. The availability of the E. octoculata mtDNA sequences can provide novel and useful genetic markers for investigating genetic composition, population genetics studies and phylogeographics of leech species. Phylogenetic  analyses using concatenated nucleotide sequences of 12 mt PCGs indicated that E. octoculata (Erpobdellidae) was more closely related to the Haemopidae than to Hirudinidae families.