Comparative mitogenome analyses of twelve non-biting flies and provide insights into the phylogeny of Chironomidae (Diptera: Culicomorpha)

The family Chironomidae is represented by seven subfamilies in China, among which Chironominae and Orthocladiinae are the most diverse. To gain a better understanding of the architecture and evolution of the mitogenomes of Chironomidae, we sequenced mitogenomes of twelve species (including two published species) of the two subfamilies Chironominae and Orthocladiinae, and comparative mitogenomic analyses were performed. Thus, we identified highly conserved genome organization of twelve species with regard to genome content, nucleotide and amino acid composition, codon usage, and gene characteristics. The Ka/Ks values of most protein-coding genes were far smaller than 1, indicating that these genes were evolving under purifying selection. Phylogenetic relationships between the family Chironomidae were reconstructed using 23 species representing six subfamilies, based on protein-coding genes and rRNAs using Bayesian Inference and Maximum Likelihood. Our results suggested the following relationship within the Chironomidae: (Podonominae + Tanypodinae) + (Diamesinae + (Prodiamesinae + (Orthocladiinae + Chironominae))). This study contributes to the mitogenomic database of Chironomidae, which will be significant for studing the mitogenome evolution of Chironomidae.

Sequencing and genome assembly. Paired-end sequencing libraries with an insert size of 350 bp were constructed from purified DNA extracts according to a standard protocol for Illumina DNA library construction. Sequencing was carried out using an Illumina NovaSeq 6000 platform (Illumina, San Diego, CA, USA) by a commercial service provider (Origingene, Shanghai, China). A quality assessment of raw FASTQ files of the two libraries was carried using FastQC v0.11.8 (http:// www. bioin forma tics. babra ham. ac. uk/ proje cts/ fastqc/). Adapter sequences were removed using Trimmomatic v0. 30 29 , and approximately 2 Gb clean data were retained per library after trimming. The original sequences were filtered to obtain high-quality clean sequences which were then assembled using spades v.3.11.1 30 and Getorganells 31 .
Mitogenome maps were produced using CG View server 1.0 35 . MEGA X 36 and Codonw 1.4.4 were used for statistical analyses of base composition, codon usage, and relative synonymous codon usage (RSCU). MISA 37,38 was used for the detection of simple sequence repeats (SSRs) throughout the genomes. DnaSP 6.12.03 39 was used for the analysis of non-synonymous substitution rates (K a ) and synonymous substitution rates (K s ). Nucleotide composition bias was calculated according to the AT-skew = (A−T)/(A+T) and GC-skew = (G−C)/(G+C), as previously reported 40 . The AT-Skew and GC-Skew data were normalized for visualization using R (package pheatmap).
Phylogenetic analyses. Phylogenetic analyses were performed in Bayesian Inference and Maximum Likelihood using all PCGs and rRNA genes in PhyloSuite 41 with several plug-in programs: the first procedure is to standardize synonymous gene names and identify problematic annotation features, followed by sequence extraction; thereafter, nucleotide and amino acid sequences of genes were individually aligned using Mafft v7.407 42 with the normal alignment mode and trimmed using Gblocks v0.91 43 with default settings to remove ambiguously aligned regions. Alignments of 13 PCGs and 2 rRNAs were then concatenated into a supermatrix with the function 'concatenate sequences' , additionally, this function can record the index of each gene during concatenation and generate a partition file, which can be used in the PartitionFinder software. Then, PartitionFinder2 44 selected best-fit partitioning schemes and models for the concatenated sequences based on the AICc criterion. Maximum likelihood (ML) analyses using partition mode were performed using IQ-TREE 1.6.8 45 , bootstrapping using ultrafast 46 , and the nodal support values were calculated with 5000 bootstrap replicates. Bayesian inference (BI) analysis was performed using MrBayes 3.2.6 47 with the partition models. Markov chain Monte Carlo runs of 200,000 generations were conducted, and trees were sampled every 1000 generations with the first 25% of the trees discarded as burn-in. Anopheles quadrimaculatus (NCBI accession NC000875), Wyeomyia confusa (MK575492), Simulium variegatum (NC033348), Simulium maculatum (NC040120), Forcipomyia sp. (MK000395), and Culicoides arakawae (NC009809) were used as outgroups. Tree topology was visualized using iTOL 48  www.nature.com/scientificreports/

Results
General structure, organization, and composition of mitogenomes. www.nature.com/scientificreports/ Dicrotendipes sp. to 16,184 bp in Chironomus nipponensis ( Table 1). Most of the size variation was due to differences in the control region, and some genomes showed additional noncoding regions within the coding region. Among these genes, four PCGs (ND1, ND4, ND4L, and ND5), eight tRNAs (trnQ, trnC, trnY, trnF, trnH, trnP, trnL, and trnV), and two rRNAs (12S rRNAs and 16S rRNAs) were encoded by the minority strand (N strand), while the other 23 genes were located on the majority strand (J strand) (Appendix S2). ATP8-ATP6 and ND4L-ND4 overlapped by seven nucleotides (ATG ATA A and ATG TTA A, respectively) in all twelve Chironomidae species. All mitogenomes examined here showed base composition biases as are typically observed in insect mitogenomes 49 . The AT content and skew statistics are shown in Table 1, indicating a pronounced A + T bias (75.55%-79.45%). The control region showed the highest A + T content, varying from 95.39% (Chironomus nipponensis) to 99.14% (Einfeldia sp.), whereas the first and the second codon positions of PCGs had the lowest A + T content, varying from 66.77 to 71.01%. With regard to complete mitogenomes, the AT-skew values were positive, varying from 0.010 (Einfeldia sp.) to 0.057 (Polypedilum nubifer), whereas GC-skew values were negative in all species, varying from − 0.257 (Polypedilum nubifer) to − 0.147 (Limnophyes sp.) (Fig. 2).
Analysis using MISA 37 showed the presence of 5-19 unequal SSRs in each of the 12 mitogenomes (Fig. 3), and the SSRs were mainly dominated by single nucleotide repeats; the number of dinucleotide repeats of each species was generally 1 or 2 (except for Dicrotendipes sp., Einfeldia sp.). Trinucleotide repeats occurred only in Nilodorum tainanus and Smittia aterrima. The single nucleotide was repetitive for A/T, and the dinucleotide repeated only AT/TA types. Trinucleotide repeats showed only ATT repeated, and no CG/GC types were detected. All SSRs contained high ratios of A or T bases.
Protein-coding genes. The mitogenomes comprised 13 classical PCGs, of which 7 were NADH dehydrogenase subunit genes (nad), 3 were cytochrome c oxidase subunit genes (cox), 2 were ATP synthase subunit genes (atp), and 1 was a cytochrome b gene (cytb). The total length of PCGs ranged from 11,106 to 11,220 bp, and the A + T content in this region ranged from 72.55 to 76.54%. The A + T content of the third codon position (82.32-91.98%) was higher compared to that of the first (67.58-71.01%) and second codons (66.77-68.33%) ( Table 1).
Six types of start codons (ATG, ATC, GTG, ATT, ATA, and TTG) were observed (Fig. 4). Among species, the start codons of the same PCG differed; the atp8 gene started with ATA, ATT, GTG, or ATC in different species; the coxI gene started with TTG, ATA, ATT, or ATG; the nad1 gene started with TTG, ATA, ATT, or GTG; the nad2 gene started with ATA, ATT, or ATG; the nad3 gene started with ATT or ATC; the nad5 gene started with ATT or GTG; the nad6 gene started with ATA, ATT, or ATC; the atp6, coxII, coxIII, cytb, nad4, and nad4L genes started with ATG. The presence of alternative start codons in mitochondrial genomes can produce implications for gene expression, translation efficiency, and translational accuracy. For example, a TTG start codon variant was identified for the coxI gene, rather than the conventional ATN start codon. This variation may led to a decrease in translation efficiency, thus causing incomplete protein expression.
Additionally, there were two types of stop codon: TAA and TAG. The nad4 gene terminated with TAA or TAG, other genes terminated with TAA (Appendix S2). The length of each PCG was consistent among species (Fig. S1), of which the nad5 gene was the longest (1600-1800 bp), followed by the coxI gene (1500-1600 bp), and the ATP8 gene was the shortest (< 200 bp).
The total number of codons ranged from 3702 to 3740 in the 13 protein-coding genes excluding the stop codons (Table 1). Codons encoding Ile, Leu2, Phe, and Ser2 (492-543, 390-469, 368-435, and 213-228, respectively) were more abundant compared to codons encoding other amino acids. The codons encoding Trp, Cys, and Met were fewer (5-15, 29-40, and 19-40, respectively) (Fig. 5). The 12 mitogenomes shared the same codon families and had similar characteristics of RSCU. For each amino acid, the most prevalent usage codons were NNA and NNU ( Fig. 6), which was consistent with the higher A + T content of the third codon of PCGs.
The ratio of K a /K s (ω) was used to investigate the evolutionary rates of mitochondrial PCGs (Fig. 7). The results showed that the ω values of the 12 PCGs (except nad5) were all below 1.0, indicating a strong repair mechanism against deleterious mutations in most PCGs. However, the K a /K s ratio differed significantly among individual genes, implying that the region comprised varying functional constraints. The K a /K s value of nad5 was the highest, implying the least purifying selective pressure. coxI exhibited the lowest K a /K s value, indicating the strongest purifying selective pressure. tRNAs, rRNAs, and control regions. Twenty-two typical tRNAs were found in twelve Chironomidae species. The number of base pairs in the 22 tRNAs ranged from 64 to 74 bp (trnK) (Appendix S2). The AT content of tRNA genes was A + T-biased, ranging from 79.21% (Limnophyes sp.) to 81.11% (Polypedilum nubifer). All tRNAs exhibited positive AT-skew and positive GC-skew (Table 1, Fig. 2).
The two genes encoding the ribosomal subunits were located between trnV and the control region and between trnL1 and trnV, respectively. The length of the rrnL genes ranged from 1299 to 1430 bp, and that of the rrnS genes ranged from 785 to 837 bp (Appendix S2). Both 12S rRNA genes and 16S rRNA genes were on the N-strand. The A + T content of 12S rRNA genes ranged from 82.29% (Smittia aterrima) to 84.32% (Polypedilum nubifer), and that of the 16S rRNA genes ranged from 83.76% (Smittia aterrima) to 85.67% (Polypedilum nubifer). The 12S rRNA genes exhibited negative a AT-skew in Dicrotendipes sp., Limnophyes sp., Polypedilum nubifer, and Tanytarsus formosanus and a positive AT-skew in the other species. The 16S rRNA genes showed a positive AT-skew in Einfeldia sp., Nilodorum tainanus, and Tanytarsus formosanus and a negative AT-skew in the other species. The 12S rRNA genes exhibited a positive GC-skew in all 12 mitogenomes. The 16S rRNA genes showed a negative GC-skew in Tanytarsus formosanus and a positive GC-skew in the other mitogenomes ( www.nature.com/scientificreports/ www.nature.com/scientificreports/  www.nature.com/scientificreports/ The control region is rich in A and T bases. It contains the initiation and regulatory elements of mitochondrial DNA transcription and replication and is the largest non-coding region in the mitochondrial genome 4 . The Chironomidae family is a highly diverse group of aquatic insects that possess complex control regions in their mitochondrial DNA. These control regions comprise of various regulatory elements, including conserved sequence blocks, promoter regions, stem-loop structures, polyadenylation sites, and other essential regulatory components. This complex arrangement of regulatory elements offers novel insights into the efficient regulation of mitochondrial function, including transcription and replication, which are necessary for sustaining mitochondrial physiology in Chironomidae and other organisms. In the present study, the length of the control region varied markedly between species, ranging from 69 bp (Chironomus kiiensis) to 682 bp (Dicrotendipes pelochloris), and it was located between rrnS and trnI. Its size can impact replication efficiency, and in some cases, smaller control regions may enhance replication efficiency by reducing the time required for the replication machinery to unwind the DNA and initiate replication. Analyses of the control region showed that the proportions of A, T, G, and C were 36.76-48.80%, 47.79-60.29%, 0.48-2.63%, and 0-2.67%, respectively. The control region showed a positive AT-skew in Nilodorum tainanus, Polypedilum nubifer and Tanytarsus formosanus and a negative ATskew in the other species. The control region exhibited a negative GC-skew in Dicrotendipes sp., Glyptotendipes tokunagai, Limnophyes sp., Nilodorum tainanus, Smittia aterrima, and Tanytarsus formosanus and a positive GC-skew in the other species (Table 1, Fig. 2).

Phylogenetic analyses.
Mitogenomes contain multiple types of valid information appropriate for phylogenetic and evolutionary analyses, such as the amino acid sequence, the secondary structure of RNA, and the arrangement order of genes [50][51][52] . In the present study, phylogenetic analyses based on the data set of PCGs and rRNAs using ML and BI methods showed similar phylogenetic relationships of six subfamilies within the Chironomidae in terms of topology. Phylogenetics trees (Fig. 8) showed distinct subfamilies and confirmed monophyly, apart from Orthocladiinae and Prodiamesinae. This result strongly supports that the monophyletic Propsilocerus of Orthocladiinae is a sister taxon to Monodiamesa + Prodiamesa of Prodiamesinae, which  www.nature.com/scientificreports/ is consistent with previous results 13,20 . According to Cranston et al. 13 , Propsilocerus probably belongs to the Prodiamesinae rather than to the Orthocladiinae; however, Lin et al. 20 considered Prodiamesinae a subgroup of Orthocladiinae. Based on our mitogenome phylogenies, it would be appropriate to transfer Propsilocerus as a subgroup of Prodiamesinae to make Orthocladiinae monophyletic. In addition, we compared the morphological characters between Propsilocerus and species in Prodiamesinae based on previous studies 53, 54 and identified a number of synapomorphies: medium to large species, color generally brown to black, abdomen typically with setae in paler spots; acrostichals typically absent, when present long and starting near scutal projection; R 4+5 ending distal to end of M 3+4 ; hind tibia with outer spur longer than 1/2 length of inner spur; inferior volsella typically conspicuously long and broad, arising from inner margin near base and extending posteriorly approximately to the level of the gonostylus. Combining mitogenomes analysis and morphological evidence, Propsilocerus should be transferred from the Orthocladiinae to Prodiamesinae, as proposed by Cranston et al. 13 . At the base of Chironomide, the subfamilies Podonominae and Tanypodinae are ancestral taxa and were supported (BS = 100, PP = 1) as sister groups to the remaining taxa. Diamesinae (BS = 100, PP = 1) was a sister clade to Prodiamesinae + (Orthocladiinae + Chironominae). Prodiamesinae (BS = 100, PP = 1) was a sister group to Orthocladiinae + Chironominae. In the present study, the mitogenomic data used to explore the phylogenetic www.nature.com/scientificreports/ relationships of six subfamilies within the Chironomidae generally showed consistent trends with traditional morphology-based systematics 10 . Therefore, we confirm that mitogenomic data are critical for phylogenetic reconstructions at subfamily level in the Chironomidae.

Conclusions
In the PCGs of 12 mitogenomes, the A + T content of the third codon position was markedly higher than those of the first and second positions, as in the published mitogenomes of other Chironomidae. For the whole mitogenomes, the AT-skew values were positive, while the GC-skew values were negative in all species, which is in line with the mitochondrial genomes of most insects contained more A than T, while G content was lower than that of C. SSRs mainly comprised mono-, di-, and trinucleotide repeats, with a high proportion of A or T bases. In terms of codon usage in this study, the relative frequency of synonymous codon usage showed that codons ending   www.nature.com/scientificreports/ in A/U were more frequent than those ending in G/C, which was consistent with the high AT characteristics of insect mitochondrial genomes. In the mitochondrial evolution rate analysis of the current study, the K a /K s ratio of all genes (except nad5) was < 1, indicating mainly the purification selection of PCGs. The nad5 was the largest among the 13 PCGs, but it is rarely used in the phylogeny of Chironomidae, likely owing to its rapid rate of evolution. The coxI gene was the second-longest, and its K a /K s value was the lowest; therefore, it provides the strongest and the most conserved effective phylogenetic and evolutionary signals among all genes, and it is frequently used as a key barcode marker to identify species and determine inter-species differences. In general, different K a /K s values reflected that genes were subjected to different degrees of purification selection.
According to the phylogenetic results, the basic branches of Chironomidae included the subfamilies Podonominae and Tanypodinae. When Propsilocerus is transferred to the Prodiamesinae subfamily, all six subfamilies are monophyletic. The phylogenetic relationship of the six subfamilies is thus (Podonominae + Tanypodinae) + (Diamesinae + (Prodiamesinae + (Orthocladiinae + Chironominae))). Obtaining more data on the mitochondrial genome of Chironomidae will provide important support for research on the phylogenetic relationship of Chironomidae.