The First Mitochondrial Genome of the Sepsid Fly Nemopoda mamaevi Ozerov, 1997 (Diptera: Sciomyzoidea: Sepsidae), with Mitochondrial Genome Phylogeny of Cyclorrhapha

Sepsid flies (Diptera: Sepsidae) are important model insects for sexual selection research. In order to develop mitochondrial (mt) genome data for this significant group, we sequenced the first complete mt genome of the sepsid fly Nemopoda mamaevi Ozerov, 1997. The circular 15,878 bp mt genome is typical of Diptera, containing all 37 genes usually present in bilaterian animals. We discovered inaccurate annotations of fly mt genomes previously deposited on GenBank and thus re-annotated all published mt genomes of Cyclorrhapha. These re-annotations were based on comparative analysis of homologous genes, and provide a statistical analysis of start and stop codon positions. We further detected two 18 bp of conserved intergenic sequences from tRNAGlu-tRNAPhe and ND1-tRNASer(UCN) across Cyclorrhapha, which are the mtTERM binding site motifs. Additionally, we compared automated annotation software MITOS with hand annotation method. Phylogenetic trees based on the mt genome data from Cyclorrhapha were inferred by Maximum-likelihood and Bayesian methods, strongly supported a close relationship between Sepsidae and the Tephritoidea.


Introduction
The mitochondrion (mt), descended from an alpha-proteobacterium, is one of the fundamental eukaryotic organelles [1][2][3] and retains a remnant, bacterial-like genome. The mt genome has been an extensively used marker in phylogenetic studies across broad taxonomic scales [4][5][6][7][8] and in a wide range of taxa since mt genome sequences are often more phylogenetically informative than shorter sequences from individual genes commonly used for shallow or species-level studies [9][10][11][12]. Since the first insect mt genome was published in 1985, there has been a rapid accumulation of sequenced insect genomes. Insects have been comprehensively sampled at higher taxonomic levels and mt genomes are available from every insect order [2]. Diptera is one of the most extensively sequenced orders amongst the Insecta, with 93 complete or near-complete Diptera mt genome sequences available on GenBank (as of July 2014), including 57 cyclorrhaphan species (46 complete genomes, 10 near-complete genomes without full control regions, and one partial genomes) representing 13 families (Table 1).
Sepsidae is a global distributed fly family with more than 320 described species [37]. Sepsid flies are important insect models for sexual selection research for three main reasons: 1. pronounced sexual dimorphisms (strongly modified male forelegs and movable abdominal appendages) [38][39][40]; 2. complex courtship behaviors (male display, female choice, and sexual conflict) [38,[41][42][43]; and 3. easily bred under lab conditions (they utilize rotting plant material or animal feces as breeding substrates) [44]. Recently, the transcriptome of a sepsid species Themira biloba has been assembled and analyzed [45] expanding the range of genetic resources for this family, however, no mt genomes are available from this family.
The procedures and software used for insect mt genomes annotation have recently been reviewed by Cameron (2014b) [3] noting that accurate annotations of mt genomes are necessary for all downstream analysis. Since the online implementation of the tRNA prediction software tRNAScan-SE [46] plus alignment with homologous genes is relatively efficient, there are few problems in identifying gene boundaries for tRNAs. However, despite protein-coding genes (PCGs) being used in virtually every phylogenetic and evolutionary biology study of mt genomes, gene boundaries of some PCGs are often difficult to identify. For example, the start codons of CO1 are wildly inconsistent and there are some inaccurate annotations in the GenBank (e.g. 132 incorrect annotations across 36 species of lepidopteran mt genomes [3]). Studies of expression profiles of mt genes should be the most effective way to identify gene boundaries [12,[47][48], however there are few RNAseq datasets for insect species whose mt genomes have also sequenced. In the absence of RNAseq data, comparative alignments of homologous mt genes from all the mt genomes available for a particular taxonomic group is also reliable [3].
Here, we sequenced the complete mt genome of the sepsid fly Nemopoda mamaevi Ozerov, 1996. We annotated with this genome using procedures and quality control methods proposed by Cameron (2014b) [3] and compared these annotation results with the automated annotation software MITOS [49]. We also re-annotated the mt genomes of all Cyclorrhapha species deposited on GenBank, based on comparative analysis of homologous genes, and undertook a statistical analysis of start and stop codons positions in their PCGs. We aligned and analysed two intergenic sequences across Cyclorrapha, which were highly conserved 18-bp motifs for the binding site of mtTERM. The mt genome contributes to reconstruction of the taxonomic positions and evolutionary relationships of the Sepsidae, and will help selecting optimized primer for atypical regions in further molecular research of related taxa. Phylogenetic trees based on the mt genome data from Cyclorrhapha were inferred by both Maximum-likelihood and Bayesian methods, which strongly supported a close relationship between Sepsidae and the Tephritoidea.

Ethics statement
No specific permits were required for the insects collected for this study. The specimen was collected by using light trap. The field studies did not involve endangered or protected species. The species herein studied are not included in the "List of Protected Animals in China".  [50]. Nemopoda mamaevi Ozerov is newly recorded from China. Whole genomic DNA was extracted from the thoracic muscle tissues using TIANamp Genomic DNA Kit (TIAN-GEN). The quality of PCR products was assessed through electrophoresis in a 1% agarose gel and stained with Gold View (ACME).

PCR amplification and sequencing
The Nemopoda mamaevi mt genome was amplified in 20 overlapping PCR fragments using NEB Long Taq DNA polymerase (New England BioLabs, Ipswich, MA). The PCR primers used follow Zhao et al (2013) [6]. Several species-specific primers were designed from initial sequencing with the amplification primers and used for internal PCRs (S1 Table). PCR cycling consisted of an initial denaturation step at 95°C for 30s, followed by 40 cycles of denaturation at 95°C for 10s, annealing at 42-55°C (depending on the primer pair used) for 50s, elongation at 65°C for 1 kb/min (depending on the size of target amplicon) (S1 Table), and a final elongation step at 65°C for 10 min. PCR products were evaluated by agarose gel electrophoreses.
All amplicons were sequenced in both directions using the BigDye Terminator Sequencing Kit (Applied Bio Systems) and the ABI 3730XL Genetic Analyzer (PE Applied Biosystems, San Francisco, CA, USA) using amplification primers and internal primers developed by primer walking.

Bioinformatic analysis
Sequences were proof-read and aligned into contigs in BioEdit version 7.0.5.3 [51]. After fully sequencing the mt genome, it was annotated using both automated annotation methods and by hand. For the automated annotation, we used MITOS [49]. Hand annotation method followed the procedures proposed by Cameron (2014b) [3]. Quality control of the hand alignments was performed by comparing with homologous sequences from previously sequenced Cyclorrhapha mt genomes to identify several tRNAs apparently absent from the N. mamaevi mt genome, and to verify PCGs and rRNAs annotations.

Phylogenetic analysis
A total of 20 species of brachyceran insects were used in phylogenetic analysis, including 18 cyclorrhaphans and two outgroup species from Tabanidae and Nemestrinidae. Details of the species used in this study were listed in Table 1.
Sequences of 13 PCGs, two rRNAs and 22 tRNAs were used in phylogenetic analysis. Each PCG was aligned individually based on Hand annotation method followed the procedures proposed by Cameron (2014b) [3]. The sequences of tRNAs and rRNAs were aligned respectively using MEGA 5.0 [52], ambiguous positions in the alignment of RNAs were filtered by hand. Individual genes were concatenated using SequenceMatrix v1.7.8 [54]. We assembled four datasets for phylogenetic analysis: 1) nucleotides of 13 PCGs (PCG123) with 11,058 residues, 2) nucleotides of 13 PCGs, two rRNAs and 22 tRNAs (PCG123RNA) with 14,226 residues, 3) nucleotides of 13 PCGs exclude the third codon sites (PCG12) with 7,372 residues and 4) nucleotides of 13 PCGs exclude the third codon sites, two rRNAs and 22 tRNAs (PCG12RNA) with 10,540 residues. PartitionFinder v1.1.1 [55] was used to select the optimal partition strategy and substitution models for each partition. We created an input configuration file with 63/29 (with 3 rd codon positions/ without) pre-defined partitions of the dataset, and used the ''greedy" algorithm with branch lengths estimated as ''unlinked" and Bayesian information criterion (BIC) to search for the best-fit scheme (S2 Table).
We performed maximum likelihood (ML) and Bayesian inference (BI) using the best-fit partitioning schemes recommended by PartitionFinder (S2 Table). For ML analysis, we used RAxML 8.0.0 [56] with 1,000 bootstrap replicates and using the rapid bootstrap feature (random seed value 12345) [57]. The Bayesian analysis was conducted with MrBayes 3.2.2 [58]. Two simultaneous runs of 2 million generations were conducted for the dataset, each with one cold and three heated chains. Samples were drawn every 1,000 Markov chain Monte Carlo (MCMC) steps, with the first 25% discarded as burn-in. When the average standard deviation of split frequencies was below 0.01, we considered the stationarity was reached and stopped run.

Genome Organization and Structure
The complete mt genome of N. mamaevi is a typical circular, double-stranded molecule (Gen-Bank accession number: KM605250; Fig 1) 15,878 bp in length. Mt genome length is mediumsized when compared to the mt genomes of other Diptera, that range from 14,922 bp (Sarcophaga peregrine, Sarcophagidae, NC_023794) to 19,517 bp (Drosophila melanogaster, Drosophilidae [26]). The mt genome of N. mamaevi contains the 37 genes, including 13 PCGs, 22 tRNA genes, two rRNA genes and a large control region, that are usually present in bilaterian animals [3]. The gene order is the same as the inferred ancestral insect mt genome pattern, which is conserved amongst all cyclorrhaphan flies sequenced to date. Twenty three genes are encoded on the majority strand (J-strand), while the minority strand (N-strand) encodes the remaining 14 genes.
There are 9 gene boundaries where sequence overlapped between neighboring genes, ranging from 1 to 8 bp in length, and 29 bp in total. The length of intergenic sequences (excluding the control region) was 1-18 bp found at 14 gene boundaries, and totalling 82 bp ( Table 2). The longest overlapping sequence belonged to both tRNA Trp and tRNA Cys . The longest intergenic sequence located between tRNA Glu and tRNA Phe .

Base composition and codon usage
Similar to mt genome of other sequenced cyclorrhaphan flies, and indeed most insects, the nucleotide composition of the N. mamaevi was biased toward A and T (A = 38.0%, T = 36.8%, G = 10.0%, C = 15.2%; Table 3). The overall A+T content of the mt genome was 74.8%, average amongst all reported cyclorrhaphan flies, which range from 67.2% (in Bactrocera minax, Tephritidae [20]) to 82.2% (in Drosophila melanogaster, Drosophilidae [26]). A comparative analysis of nucleotide composition (A+T%, G+C%) versus skew (AT-and GC-skew) across the Cyclorrhapha is shown in Fig 2. The average AT-skew of the cyclorrhaphan mt genomes was 0.033, ranging from -0.004 (Simosyrphus grandicornis, Syrphidae [13]) to 0.131 (Bactrocera  Table). The average GC-skew of cyclorrhaphan mt genomes was -0.190, ranging from -0.315 (in Bactrocera minax, Tephritidae [20]) to -0.124 (in Haematobia irritans, Muscidae, NC_007102), while the N. mamaevi mt genome exhibited a marked GC-skew (-0.206) (S3 Table). AT-and GC-skews of Cyclorrhapha mt genomes are consistent with the strand skew biases found in most metazoan mt genomes (weakly positive AT-skew and strongly negative GC-skew for the J-strand). Within insects the only exceptions are found in three families: Philopteridae (Phthiraptera), Aleyrodidae (Hemiptera) and Braconidae (Hymenoptera), which have positive GC-skew and negative AT-skew on the J-strand [60] and in termites (Isoptera) which have strongly positive AT-skew on the J-strand [61]. In insects, the degree of AT-skew is related to gene direction, replication and codon positions, whereas the degree of GC-skew is affected by reversals in replication orientation [60]. A and T bias is also reflected in relative codon usage by the PCGs. In the mt genome of N. mamaevi, A+T rich codons, such as ATT (Asn), TAA (Leu), AAA (Lys), ATA (Met), TTT (Phe), TAT (Tyr), are more frequently used than G+C rich codons. Among both strands, NNAform codons were the most frequently used and NNC-type the most infrequently used codons. On J-strand NNA were preponderant and NNG were the least frequent with the reverse found in N-strand encoded PCGs (NNU commonest, NNC least common) (Fig 3, S4 Table).

Protein-coding Genes
The overall A+T content of the 13 PCGs in N. mamaevi was 72.7%, with individual PCGs ranging from 65.8% (CO1) to 82.5% (ND6) ( Table 2). The A+T content of the third codon positions (85.1%) were much higher than either the first (66.8%) or second codon (66.4%). There was moderate negative AT-skew for the PCGs as a whole (-0.16) driven by strongly negative ATskew at second codon positions (-0.39), with weak skew at first, and third codon positions (-0.08 and -0.06 respectively). The absence of significant GC-skew across the PCGs as a whole (0.01) masks strong skews at each codon position with first codon positions strongly positive (0.23) balanced by strongly negative skews at second and third codon positions (-0.16 and -0.11 respecively) (Table 3).
ATN, GTG (Val) and TTG (Leu UUR ) are accepted as the canonical start codons for invertebrate PCGs [61], while TCG (Ser UCN ) has been proposed to be an additional start codon commonly found in flies [13]. All 13 PCGs in the N. mamaevi mt genome used canonical start codons. ND2, ND5 and ND6 started with ATT (Ile); ATC (Ile) were found in ATP8 and ND3; CO1 and ND1 start with TCG (Ser UCN ) and TTG (Leu UUR ), respectively; and ATG (Met) were used in the remaining six PCGs as start codons (Table 2).
Among the cyclorrhaphan flies sequenced to date, ATG (Met) is the most frequently used start codon followed by ATT (Ile). ATG (Met) is almost exclusively used in the ATP6, CO2,  as ATA (Met) and ATC (Ile) are also found in ATP8, CO3, ND3, ND5 and ND6 in a minority of fly species. TTG (Leu UUR ) is used in ND1 in most species and TCG (Ser UCN ) is used in CO1 in most species. (Fig 4, S5 Table). In addition, three non-canonical start codons GTG (Val), CCG (Pro) and GAG (Glu) have been proposed. GTG (Val) is found as the start codon for ATP8 in six Bactrocera species (Tephrididae) and for ND1 in two Liriomyza flies (Agromyzidae). CCG (Pro) is in the start codon of CO1 in Drosophila sechellia and D. simulans (Drosophilidae) [29]. GAG (Glu) was proposed as the start codon for ND1 in Rutilia goerlingiana (Tachinidae) [5]. ATP8, CO1, ND1, ND3 and ND6 therefore collectively utilise three to five different start codons. In contrast, the remaing of the PCGs utilise no more than two and often only one, start codon (Fig 4, S5 Table). The stop codons most commonly used in N. mamaevi are TAA (ATP6, ATP8, CO1, CO2, CO3, ND1, ND2, ND4L, ND6), or TAG (CYTB, ND3) found in 11 of the 13 PCGs. The remaining two PCGs (ND4, ND5) utilise the partial stop codon T, which has been found in many insect mt genomes and is completed to a full TAA stop codon via post-transcriptional polyadenylation [63]. In all reported cyclorrhaphan flies studied to date, TAA is the most common stop codon used found in every PCG in at least one species, most notably in the ATP6, ATP8, CO3, ND4, ND4L and ND6 genes from all species sequenced. TAG has been found in CO2, CYTB, ND1, ND2, ND3, ND4 and ND5. The incomplete stop codons TA and T are founded in CO2, CYTB, ND1, ND2, ND3, ND4 and ND5 as well as being commonly found in CO1 (Fig 4, S5 Table).
By aligning the homologous genes PCGs among closely related species, we detected an apparent sequencing error in the published mt genomes of Bactrocera minax (Tephritidae) [20]. Base position 896 bp of ND1 in B. minax, a C, is apparently an insertion as it results in a frameshift in the downstream amion-acid sequence that is significantly different from other cyclorrhaphan flies; removal of this C restores the correct reading frame resulting in the same amino acid sequence as in other cyclorrhaphans and the same stop-codon. In addition, CO2 in Procecidochares utilis (Tephritidae) (NC_020463) has an apparent deletion at position 678, resulting in a missing "A" base. As a result, the version of this genome on GenBank is 5 bp too long at the 3' end.

Intergenic sequences
Two conserved intergenic sequences blocks of genes coded on different strands (7 bp between ND1 and tRNA Ser(UCN) , 5 bp between tRNA Glu and tRNA Phe ) were found in typical insect mt genome [64][65]. Here, we revealed two 18 bp conserved sequences from both spacers across Cyclorrhapha.
The annotated sequence length and stop codons of ND1 varies considerably among sequences from cyclorrhaphan flies. For example all Calyptratae sequenced to date have canonical stop codons (TAA or TAG). Most of the sequenced Acalyptratae species lack complete stop codons, resulting in significant length variability. Moreover, some species lack even incomplete stop codons e.g. the bases immediately preceding tRNA Ser(UCN) are neither T nor TA. We aligned the region from ND1 to tRNA Ser(UCN) of all sequenced Cyclorrhapha, and re-annotated ND1 gene based on relatively well conserved sequences in this intergenic spacer and despite large variation in spacer length (e.g. up to 102 bp Hypoderma lineatum, Oestridae [35]). Sequences from 57 species/subspecies populations were aligned and analysed (excluding Bactrocera minax due to the sequencing errors noted above). Although the intergenic sequences between ND1 and tRNA Ser(UCN) ranges from 15-102 bp in size, there is a 16 bp highly conserved sequence, found in all Cyclorrhapha (except Drosophila simulans and Rutilia goerlingiana in which there is a 1 bp deletion). Species that lack complete stop codons in ND1 have two

Transfer RNAs
The typical complement of 22 typical tRNAs found in the arthropod mt genomes were found in the N. mamaevi mt genome, ranging in size from 63 bp (tRNA Arg ) to 72 bp (tRNA Val ) and 1465 bp or 9.23% of the total genome. The overall A+T content the tRNAs was 76.2%, while single genes ranged from 64.8% (tRNA Lys ) to 86.6% (tRNA Asp ). Fourteen genes are encoded on the J-strand and the remains are encoded on the N-strand. Most tRNAs could be folded into First Mt Genome of Sepsidae and Mt Genome Phylogeny of Cyclorrhapha the typical clover-leaf structure (Fig 6), whereas the tRNA Ser(AGN) was an exception lacking a DHU arm as has been observed for this gene in other metazoan mt genomes [62].
A total of 17 G-U pairs and two mismatched base U-U pairs were found in N. mamaevi mitochondrial tRNA secondary structures, no A-A pairs or A-C pairs were found. The G-U pairs are located in the AA arm (7 bp), DHU arm (8 bp), AC arm (2 bp), respectively. In contrast, all mismatched base U-U pairs are located in the TψC arm.

Ribosomal RNAs
The rRNAs of N. mamaevi were 1,321 bp for lrRNA and 783 bp for srRNA in length. Their A +T content were 80.6% and 77.6%, respectively. Because rRNA genes lack functional annotation features, analogous to the start and stop codons of PCGs, it is impossible to determine the boundaries from DNA sequence alone [11,68]. Hence, they were assumed to extend to the boundaries of flanking genes. As in other dipteran species, the lrRNA gene is flanked by tRNA-Leu (CUN) and tRNA Val , while the srRNA gene is flanked by tRNA Val and the control region.
We inferred secondary structures of lrRNA and srRNA of N. mamaevi using the published rRNA secondary structures of a leafminer Liriomyza sativae (Agromyzidae), the only dipteran so analysed to date, as a model [16]. The lrRNA had 49 helices in five structural domains (I-II, IV-VI, domain III is absent as in other insects), similar to L. sativae [16] and typical of arthropods [69]. The secondary structures of srRNA in N. mamaevi included three domains and 33 helices, again similar to other Diptera [16] (Figs 7 and 8).

The control region
The control region of N. mamaevi is 1067 bp long and located at the conserved insect position between srRNA and tRNA Ile . Control region A+T content was 86.4%, second only to tRNA Asp (86.6%) as the highest A+T content region in the N. mamaevi genome. Of the five conserved structural elements identified by Zhang and Hewitt (1997) [70] across insect mt control regions, we could detect three of them: (1) a 21 bp of poly-T stretch, which may be associated with the control of transcription and replication ( Fig 9A); (2) a (TA) n like stretch; (3) a stemloop structure at 3'-end of control region, however the 5' 'TATA' and a 3' 'G(A) n T' consensus regions were apparently absent (Fig 9C). We identified two additional, structural elements, not conserved in other insects and apparently unique to N. mamaevi: (1) two poly-A stretches (10 bp and 22 bp in length), ( Fig 9A); and (2) two non-tandem macro repeats: 5'-AAAAAAT ACCAGTAGCTGTTTTAAAACATAAATCTTCATT-3' (from 313 to 352, from 993 to 1,032), and 5'-GAACTAAATTTAATAAAATTT-3' (from 372 to 392, from 498 to 518). Both macro repeats could be folded into stem-loop structures (Fig 9B) and the second unit of each repeat type overlaps the poly-A and poly-T stretches, respectively (Fig 9A).

Automated annotation method with MITOS
MITOS (http://mitos.bioinf.uni-leipzig.de/index.py) [49] is the most advanced automated mt genome annotation pipeline yet produced, and is able to be operated over an external server allowing the batched submission of mt genomes without tying up the user's computer. However the accuracy of its annotations of protein-coding genes, in particular the location of start and stop codons, has been criticised [3]. We used MITOS to the mt genome of N. mamaevi, and compared the results with our hand annotation results.
MITOS identified all 37 genes in the correct gene order and direction on the mt genome. However not all the PCGs were correctly annotated, only the start codons of ATP8, CO2, CO3, CYTB, ND6 and stop codons of ND1, ND4, ND4L, ND5 were correct (S6 Table). For the tRNA genes, there was only one incorrect annotation: tRNA Ala lacked a base A at 3'-end (S6 Table). In contrast an older, widely used tRNA prediction package tRNAscan-SE Search Server v.1.21 (http://lowelab.ucsc.edu/tRNAscan-SE/) [46] proposed 25 potential tRNAs, and while 20 were correct, tRNA Arg and tRNA Ser(AGN) were not detected, and three spurious tRNA genes were predicted (S7 Table). These results are consistent with previous comparisons in the order Lepidoptera, that found MITOS outperformed tRNAScan for tRNA prediction, but was inferior to hand annotation of PCGs [3].
Phylogenetic analyses of mitochondrial genome data are known to be susceptible to compositional bias, particularly of third codon positions in the protein-coding genes [2] particularly in the Diptera [5,13], and it is likely that the results found here of unstable relationships within the Calyptratae and a paraphyletic Oestroidea are an example of this phenomena. Even if comparisons between the present study and the previous ones by Wiegmann et al. [77] and Kutty et al. [78] are restricted to comparing trees that find a monophyletic Oestroidea, there is considerable variation in inferred relationships with Oestridae + Tachinidae, the only interfamilial relationships found in more than one study (present study and [78]). Considerable additional data is required to reliably resolve relationships within the Oestroidea, while our results suggest the reliability of mt genome data in inferring phylogenetic relationships within the Cyclorrhapha generally, caution is warranted for its use within the Oestroidea given that we have demonstrated susceptibility to third codon biases.
Supporting Information S1